WO2021070004A1 - Object segmentation in video stream - Google Patents
Object segmentation in video stream Download PDFInfo
- Publication number
- WO2021070004A1 WO2021070004A1 PCT/IB2020/059092 IB2020059092W WO2021070004A1 WO 2021070004 A1 WO2021070004 A1 WO 2021070004A1 IB 2020059092 W IB2020059092 W IB 2020059092W WO 2021070004 A1 WO2021070004 A1 WO 2021070004A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- frame
- video stream
- object segmentation
- segmentation mask
- sequence
- Prior art date
Links
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/10—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
- H04N19/102—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
- H04N19/119—Adaptive subdivision aspects, e.g. subdivision of a picture into rectangular or non-rectangular coding blocks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/20—Analysis of motion
- G06T7/215—Motion-based segmentation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/49—Segmenting video sequences, i.e. computational techniques such as parsing or cutting the sequence, low-level clustering or determining units such as shots or scenes
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/10—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
- H04N19/134—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or criterion affecting or controlling the adaptive coding
- H04N19/136—Incoming video signal characteristics or properties
- H04N19/137—Motion inside a coding unit, e.g. average field, frame or block difference
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/10—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
- H04N19/169—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding
- H04N19/17—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/10—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
- H04N19/169—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding
- H04N19/17—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object
- H04N19/172—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object the region being a picture, frame or field
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/20—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using video object coding
- H04N19/23—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using video object coding with coding of regions that are present throughout a whole video segment, e.g. sprites, background or mosaic
Definitions
- the invention relates to the field of computer image processing.
- ROI video object segmentation is a widely used technique that enables video and image editors to separate the foreground of a scene (scene may be either a video clip or a still image) from the original scene’s background, and treat the foreground as a standalone visual layers.
- scene may be either a video clip or a still image
- creators can place the segmented objects in a different context and create a different visual meaning than the one concluded from the original video/image (e.g., placing a person who were originally filmed in the street, in a totally different location - like the surface of the moon).
- ROI video object segmentation is mostly done using machine learning algorithms which leam the various shape variants of the object to be segmented, and then detect this object in every frame of a given video stream.
- the segmentation processing for each frame may require a large amount of computing resources, especially in cases where the segmentation is done on a real-time video feed and the requirement is to draw the segmented objects on a new background with minimum delays. These requirements may result in quality reduction of the composed (foreground on new background) visual outcome, such as lower video frame rate and lower resolution
- a method comprising: receiving a video stream; continuously calculating pixel motion for each frame in said video stream, relative to an immediately-preceding frame in said video stream; and with respect to each current sequence of n frames in said video stream: (i) obtaining a reference object segmentation mask of a reference frame in a preceding sequence in said video stream (reference mask), and (ii) estimating an object segmentation mask for each current frame in said current sequence, by modifying the reference object segmentation mask with a sum of said pixel motion calculations accumulated from said reference frame through said current frame.
- a system comprising at least one hardware processor; and a non-transitory computer-readable storage medium having stored thereon program instructions, the program instructions executable by the at least one hardware processor to: receive a video stream; continuously calculate pixel motion for each frame in said video stream, relative to an immediately-preceding frame in said video stream; and with respect to each current sequence of n frames in said video stream: (i) obtain a reference object segmentation mask of a reference frame in a preceding sequence in said video stream (reference mask), and (ii) estimate an object segmentation mask for each current frame in said current sequence, by modifying the reference object segmentation mask with a sum of said pixel motion calculations accumulated from said reference frame through said current frame.
- a computer program product comprising a non-transitory computer-readable storage medium having program instructions embodied therewith, the program instructions executable by at least one hardware processor to: receive a video stream; continuously calculate pixel motion for each frame in said video stream, relative to an immediately-preceding frame in said video stream; and with respect to each current sequence of n frames in said video stream: (i) obtain a reference object segmentation mask of a reference frame in a preceding sequence in said video stream (reference mask), and (ii) estimate an object segmentation mask for each current frame in said current sequence, by modifying the reference object segmentation mask with a sum of said pixel motion calculations accumulated from said reference frame through said current frame.
- FIG. 1 shows an exemplary system for automated real-time object segmentation in a video stream, according to exemplary embodiments of the present invention
- FIG. 2 is a flowchart detailing the functional steps in a process for automated real-time object segmentation in a video stream, according to exemplary embodiments of the present invention.
- FIG. 3 is a schematic illustration of an iterative process for automated real-time object segmentation in a video stream, according to exemplary embodiments of the present invention.
- Described herein are a system, method, and computer program product for automated real-time object segmentation in a video stream.
- the present disclosure provides for continuous real-time object segmentation in a video stream, using a reduced amount of computing resources, and as such, may be particularly suited for mobile and similar applications with relatively modest computing power.
- segmentation refers to the process of dividing a digital image into groups of pixels, based on some criteria (e.g., color, texture, etc.).
- object segmentation refers to a segmentation where each group of pixels is associated with one or more specific objects detected in the image. Object segmentation may identify a contiguous subset of the image, for example, abounding box enclosing an area of the image containing the object, or a mask outlining the actual object and defining which pixels in the region are included in the object instance.
- Video object segmentation is the task of separating a foreground object from its original background in a video sequence.
- Video object segmentation may be used in video analysis, editing, surveillance, and various commercial and consumer-oriented applications.
- real-time automatic video object segmentation entails performing a refined object segmentation in each frame of the stream, and as such requires considerable computing resources.
- Performing real-time video object segmentation on, e.g., mobile or similar platforms with limited computing power may result in relatively poor results in terms of quality and speed.
- typical video stream frame rates are, e.g., between 15 and 60 frames-per-second (fps).
- processing time for a full refined object segmentation in a single frame may take between 20-50ms, and thus cannot be accomplished in real-time.
- the present disclosure provides for a hybrid approach which combines (i) selective full ‘ground-truth’ refined object segmentation in frames at regular intervals, and (ii) pixel motion estimation for intermediate frames of the interval, using, e.g., an optical flow and/or similar techniques.
- the process then outputs an estimated object mask for a specified number of subsequent frames, based on the ground truth segmentation as modified by the cumulative estimated pixel movement.
- the estimated output is reset at the next ground -truth segmented frame, to avoid accumulating estimation errors. Because pixel motion estimation is faster and less resource-intensive than full ground-truth segmentation, the overall process can output a continuous real-time segmented stream using a relatively modest computing platform, such as those used in mobile devices.
- the present disclosure provides for a parallelized process which is particularly suited for platforms combining a processing unit with a dedicated acceleration processing component, e.g., platforms combining a central processing unit (CPU) and a graphic processing unit (GPU). Accordingly, selective full segmentation can be performed by, e.g., the GPU at specified intervals, which may be dictated by available computing resources. Simultaneously, an optical flow estimation process can run in parallel on the CPU, thereby increasing overall performance of the system. In more advanced chipsets, a Digital Signal Processor (DSP), which is designed to run Neural Networks, may take the job for the full segmentation, while the GPU may process the parallelized computing vision (e.g. Optical flow) algorithms.
- DSP Digital Signal Processor
- the present disclosure is based on dividing a video stream into consecutive sequences of frames, e.g., between 2 and 10 frames each.
- the number of frames in each sequence may be determined by a combination of parameters, including, but not limited to, type and architecture of the computing platform, desired speed and quality outcomes, etc.
- the number of frames in a sequence may be dictated, e.g., by the computing power and processing times of the associated computing platform on which the process is to be performed. In some embodiments, the number of frames in a sequence may be dynamically adjusted, based, at least in part, on instant response times of the computing platform.
- a first sequence may comprise, e.g., 6 frames, based on frame segmentation processing time of, e.g., 80ms, and
- a subsequent sequence may comprise, e.g., 3 frames, when instant processing times may have reduced to, e.g., 40ms.
- the present disclosure provides for detecting and segmenting one or more objects in a first frame of the sequence, to generate one or more masks of the objects in the video.
- accurate object segmentation may be performed using a trained neural network, which determines a mask of the object to be segmented.
- the present disclosure then estimates pixel motion throughout subsequent frames in the sequence, wherein object location is estimated for each frame in the sequence by modifying the initial segmentation based on the cumulative estimated pixel motion form the last refined segmentation. The present disclosure then repeats this process with respect to the next sequence in the stream.
- the parallelized process continuously provides for:
- the present disclosure may utilize a forward-propagation process. Accordingly, in some embodiments, for a current video sequence (of, e.g., between 2 and 10 frames), the present disclosure may generate a corresponding object-segmented output stream which obtains a segmentation mask generated for a first frame in the immediately-preceding sequence, and modify it frame- to-frame based on the accumulated pixel motion estimates at each point in the sequence. This process then iterates for each subsequent sequence, using the fully-segmented first frame from the immediately-preceding sequence, to return to ground-truth, and modifying it with by current pixel motion estimations.
- a first sequence in a stream may be used as a ‘buffer’ sequence, which will not be outputted, to allow of the time lag in generating the initial full segmentation.
- Fig. 1 illustrates an exemplary segmentation system 100 for automated real time object segmentation in a video stream, in accordance with some embodiments of the present invention.
- Segmentation system 100 as described herein is only an exemplary embodiment of the present invention, and in practice may have more or fewer components than shown, may combine two or more of the components, or a may have a different configuration or arrangement of the components.
- the various components of segmentation system 100 may be implemented in hardware, software, or a combination of both hardware and software.
- segmentation system 100 may comprise a dedicated hardware device, or may form an addition to or extension of an existing medical device, such as a colposcope.
- segmentation system 100 may comprise a hardware processor 110 and memory storage device 114.
- segmentation system 100 may store in a non-volatile memory thereof, such as storage device 114, software instructions or components configured to operate a processing unit (also "hardware processor,” “CPU,” or simply "processor), such as hardware processor 110.
- the software components may include an operating system, including various software components and/or drivers for controlling and managing general system tasks (e.g., memory management, storage device control, power management, etc.) and facilitating communication between various hardware and software components.
- system 100 may comprise one or more graphic processing units (GPUs).
- GPUs graphic processing units
- hardware processor 110 comprises, e.g., a CPU, a GPU, and/or a DSP.
- the GPU and the CPU may be separated, or the GPU may be integrated on the CPU, and the communication portion may be separated from or integrated on the CPU or the GPU or the like.
- non-transient computer-readable storage device 114 (which may include one or more computer readable storage mediums) is used for storing, retrieving, comparing, and/or annotating captured frames.
- Image frames may be stored on storage device 114 based on one or more attributes, or tags, such as a time stamp, a user-entered label, or the result of an applied image processing method indicating the association of the frames, to name a few.
- the software instructions and/or components operating hardware processor 110 may include instructions for receiving and analyzing multiple frames captured by suitable imaging device.
- hardware processor 110 may comprise image processing module 111 and neural network module 112
- Image processing module 110 receives a video stream and applies one or more image processing algorithms thereto.
- image processing module 111 comprises one or more algorithms configured to perform object detection, classification, segmentation, and/or any other similar operation, using any suitable image processing, algorithm technique, and/or feature extraction process.
- the incoming video streams may come from various imaging devices.
- the video streams received by the image processing module 111 may vary in resolution, frame rate (e.g., between 15 and 60 frames per second), format, and protocol according to the characteristics and purpose of their respective source device.
- the image processing module 111 can route video streams through various processing functions, or to an output circuit that sends the processed video stream for presentation, e.g., on a display, to a recording system, across a network, or to another logical destination.
- the image processing module 111 may apply video stream processing algorithms alone or in combination.
- Image processing module 111 may also facilitate logging or recording operations with respect to n video stream. Some or all of the functionality of the image processing module 111 may be facilitated through a video stream recording system or a video stream processing system.
- system 100 may be configured to obtain object segmentation result of a specified frame, e.g., a first frame, in a first sequence of the video stream.
- system 100 is configured to perform object detection and semantic segmentation by processing an image to generate an output which defines at least some of (i) regions in the image that depict an instance of a respective object, (ii) a respective object type of the object instance (e.g., vehicle, cat, person, and the like) depicted in each region, and (iii) a respective segmentation of the object instance depicted in each region.
- segmentation system 100 is configured to process an image to generate object segmentation data.
- the object segmentation data defines a respective segmentation of the object instance depicted in each region.
- a segmentation of the object instance depicted in a region defines whether each pixel in the region is included in the object instance.
- segmentation system 100 generates the object detection data and the object segmentation data using neural network module 112 that can be trained using machine learning training techniques.
- Neural network module 112 may comprise a convolutional neural network (i.e., which includes one or more convolutional neural network layers), and can be implemented to embody any appropriate convolutional neural network architecture, e.g., U-Net, Mask R-CNN, DeepLab, and the like. See. e.g., Olaf Ronneberger et ah, U-Net: Convolutional Networks for Biomedical Image Segmentation, arXiv:1505.04597vl [cs.CV] 18 May 2015.
- neural network module 112 may include an input layer followed by a sequence of shared convolutional neural network layers.
- the output of the final shared convolutional neural network layer may be provided to a sequence of one or more additional neural network layers that are configured to generate the object detection data. However, other appropriate neural network processes may also be used, however.
- the output of the final shared convolutional neural network layers may be provided to a different sequence of one or more additional neural network layers.
- system 100 may also be configured to estimate motion between image frames, i.e., determine motion vectors that describe the transformation from one 2D image to another (usually, from adjacent frames in a video sequence) motion estimation may be defined as the process of finding corresponding points between two images (e.g., video frames), wherein the points that correspond to each other in two views of a scene or object may be considered to be the same point in that scene or on that object.
- the present disclosure may apply optical flow and/or another and/or similar computer vision technique or algorithm to estimate motion between frames. See, e.g., Fameback G. (2003) Two-Frame Motion Estimation Based on Polynomial Expansion. In: Bigun J., Gustavsson T. (eds) Image Analysis. SCIA 2003. Lecture Notes in Computer Science, vol 2749. Springer, Berlin, Heidelberg.
- optical flow may be defined as the velocity field which warps one image into another (usually very similar) image.
- an optical flow estimate comprises an estimate of a translation that describes any motion of a pixel from a position in one image to a position in a subsequent image.
- optical flow estimation returns, with respect to each pixel and/or groups of pixels, a change is coordinates (x, y) of the pixel.
- pixel motion between pairs of images may be estimated using additional and/or other methods.
- system 100 may also compute cumulative pixel coordinate difference acquired over several pairs of consecutive images.
- Fig. 2 is a flowchart detailing the functional steps in a process for automated real-time object segmentation in a video stream.
- segmentation system 100 in Fig. 1 may be configured to receive a video stream depicting, e.g., one or more objects, such as humans, pets, inanimate objects, and the like.
- system 100 may be configured to divide the video stream into sequences of specified number of frames, e.g., between 2 and 10 frames each.
- the number of frames in each sequence may be determined by a combination of parameters, including, but not limited to, type and architecture of the computing platform, chipset performance, desired speed and quality outcomes, etc.
- system 100 may perform an iterative process comprising:
- Fig. 3 is a schematic illustration of the iterative process of step 206.
- the present disclosure may utilize a forward-propagation process, wherein each outputted frame is based on:
- system 100 may receive a current sequence comprising n frames, e.g., sequence j comprising 4 frames (frame j +0 - frame j + 3). System 100 may then perform the following sub-steps of iterative step 206 with respect to the current sequence:
- Step 206a Generate a full ground-truth object segmentation for one or more objects in frame j + 0 of the current sequence j, to be propagated forward for use by the next sequence k;
- Step 206b Estimate an object mask in frame j + 0, by modifying a refined segmentation of frame i + 0 received from the previous sequence i, by incorporating all cumulative pixel motion estimates from frame i + 0 to frame i + 3 (i.e., S(DA, AY) of optical flows between ([i + 0] - [i + 1]), ([i + 1] - [t + 2]), and ([i + 2] - [i + 3]);
- Step 206c Continuously estimate pixel motion between each sequential pair of frames in current sequence j (i.e., ([/ + 0] - ⁇ j + 1]), ([/ + 1] - ⁇ j + 2]), and ([/ + 2 ] - [/ + 3]); AND
- Step 206d Use the corresponding pixel motion estimates from step 206c, to generate estimated object masks for each current frame ([/ + 1] — [j + 3]).
- the full obj ect segmentation mask generated in step 206a above is propagated forward, to be used as the basis for estimating a mask for frame k + 0 in subsequent sequence k.
- system 100 may be configured to output object segmentation results in the current frame sequence j, e.g., to a media device to be displayed on a display monitor, a device screen, etc.
- the output stream may be directed to another computing platform for continued processing and/or manipulation.
- step 208 continuously outputs the results of iterative step 206, to generate a continuous real time output stream of the input video with object segmentation results.
- the present invention may be a system, a method, and/or a computer program product.
- the computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present invention.
- the computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device.
- the computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.
- a non- exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanically encoded device having instructions recorded thereon, and any suitable combination of the foregoing.
- RAM random access memory
- ROM read-only memory
- EPROM or Flash memory erasable programmable read-only memory
- SRAM static random access memory
- CD-ROM compact disc read-only memory
- DVD digital versatile disk
- memory stick a floppy disk
- any suitable combination of the foregoing includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable
- a computer readable storage medium is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire. Rather, the computer readable storage medium is a non-transient (i.e., not-volatile) medium.
- Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network.
- the network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers.
- a network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device.
- Computer readable program instructions for carrying out operations of the present invention may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C++ or the like, and conventional procedural programming languages, such as the “C” programming language or similar programming languages.
- the computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.
- the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
- electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention.
- These computer readable program instructions may be provided to a processor of a general-purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
- These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable storage medium having instructions stored therein comprises an article of manufacture including instructions which implement aspects of the function/act specified in the flowchart and/or block diagram block or blocks.
- the computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus or other device to produce a computer implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
- each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s).
- the functions noted in the block may occur out of the order noted in the figures.
- two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
- each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration can be implemented by special purpose hardware -based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.
Abstract
A method comprising: receiving a video stream; continuously calculating pixel motion for each frame in said video stream, relative to an immediately-preceding frame in said video stream; and with respect to each current sequence of n frames in said video stream: (i) obtaining a reference object segmentation mask of a reference frame in a preceding sequence in said video stream (reference mask), and (ii) estimating an object segmentation mask for each current frame in said current sequence, by modifying the reference object segmentation mask with a sum of said pixel motion calculations accumulated from said reference frame through said current frame.
Description
OBJECT SEGMENTATION IN VIDEO STREAM
FIELD OF INVENTION
[0001] The invention relates to the field of computer image processing.
BACKGROUND OF THE INVENTION
[0002] Automatic region-of-interest (ROI) video object segmentation is a widely used technique that enables video and image editors to separate the foreground of a scene (scene may be either a video clip or a still image) from the original scene’s background, and treat the foreground as a standalone visual layers. By modifying or replacing the background, creators can place the segmented objects in a different context and create a different visual meaning than the one concluded from the original video/image (e.g., placing a person who were originally filmed in the street, in a totally different location - like the surface of the moon).
[0003] ROI video object segmentation is mostly done using machine learning algorithms which leam the various shape variants of the object to be segmented, and then detect this object in every frame of a given video stream. The segmentation processing for each frame may require a large amount of computing resources, especially in cases where the segmentation is done on a real-time video feed and the requirement is to draw the segmented objects on a new background with minimum delays. These requirements may result in quality reduction of the composed (foreground on new background) visual outcome, such as lower video frame rate and lower resolution
[0004] The foregoing examples of the related art and limitations related therewith are intended to be illustrative and not exclusive. Other limitations of the related art will become apparent to those of skill in the art upon a reading of the specification and a study of the figures.
SUMMARY OF THE INVENTION
[0005] The following embodiments and aspects thereof are described and illustrated in conjunction with systems, tools and methods which are meant to be exemplary and illustrative, not limiting in scope.
[0006] There is provided, in an embodiment, a method comprising: receiving a video stream; continuously calculating pixel motion for each frame in said video stream, relative to an immediately-preceding frame in said video stream; and with respect to each current sequence of n frames in said video stream: (i) obtaining a reference object segmentation mask of a reference frame in a preceding sequence in said video stream (reference mask), and (ii) estimating an object segmentation mask for each current frame in said current sequence, by modifying the reference object segmentation mask with a sum of said pixel motion calculations accumulated from said reference frame through said current frame.
[0007] There is also provided, in an embodiment, a system comprising at least one hardware processor; and a non-transitory computer-readable storage medium having stored thereon program instructions, the program instructions executable by the at least one hardware processor to: receive a video stream; continuously calculate pixel motion for each frame in said video stream, relative to an immediately-preceding frame in said video stream; and with respect to each current sequence of n frames in said video stream: (i) obtain a reference object segmentation mask of a reference frame in a preceding sequence in said video stream (reference mask), and (ii) estimate an object segmentation mask for each current frame in said current sequence, by modifying the reference object segmentation mask with a sum of said pixel motion calculations accumulated from said reference frame through said current frame.
[0008] There is further provided, in an embodiment, a computer program product comprising a non-transitory computer-readable storage medium having program instructions embodied therewith, the program instructions executable by at least one hardware processor to: receive a video stream; continuously calculate pixel motion for each frame in said video stream, relative to an immediately-preceding frame in said video stream; and with respect to each current sequence of n frames in said video stream: (i) obtain a reference object segmentation mask of a reference frame in a preceding sequence in said video stream (reference mask), and (ii) estimate an object segmentation mask for each current frame in said current sequence, by modifying the reference object segmentation mask with a sum of said pixel motion calculations accumulated from said reference frame through said current frame.
[0009] In addition to the exemplary aspects and embodiments described above, further aspects and embodiments will become apparent by reference to the figures and by study of the following detailed description.
BRIEF DESCRIPTION OF THE DRAWINGS
[0010] Exemplary embodiments are illustrated in referenced figures. Dimensions of components and features shown in the figures are generally chosen for convenience and clarity of presentation and are not necessarily shown to scale. The figures are listed below.
[0011] Fig. 1 shows an exemplary system for automated real-time object segmentation in a video stream, according to exemplary embodiments of the present invention;
[0012] Fig. 2 is a flowchart detailing the functional steps in a process for automated real-time object segmentation in a video stream, according to exemplary embodiments of the present invention; and
[0013] Fig. 3 is a schematic illustration of an iterative process for automated real-time object segmentation in a video stream, according to exemplary embodiments of the present invention.
DETAILED DESCRIPTION
[0014] Described herein are a system, method, and computer program product for automated real-time object segmentation in a video stream. In some embodiments, the present disclosure provides for continuous real-time object segmentation in a video stream, using a reduced amount of computing resources, and as such, may be particularly suited for mobile and similar applications with relatively modest computing power.
[0015] In the context of computer image processing, the term “segmentation,” as used herein, refers to the process of dividing a digital image into groups of pixels, based on some criteria (e.g., color, texture, etc.). The term “object segmentation” refers to a segmentation where each group of pixels is associated with one or more specific objects detected in the image. Object segmentation may identify a contiguous subset of the image, for example, abounding box enclosing an area of the image containing the object, or a mask outlining the actual object and defining which pixels in the region are included in the object instance.
[0016] Video object segmentation is the task of separating a foreground object from its original background in a video sequence. Video object segmentation may be used in video analysis, editing, surveillance, and various commercial and consumer-oriented applications. However, real-time automatic video object segmentation entails performing a refined object segmentation in each frame of the stream, and as such requires considerable computing resources. Performing real-time video object segmentation on, e.g., mobile or similar platforms with limited computing power may result in relatively poor results in terms of quality and speed.
[0017] For example, typical video stream frame rates are, e.g., between 15 and 60 frames-per-second (fps). However, processing time for a full refined object segmentation in a single frame, using atypical mobile device chipset, may take between 20-50ms, and thus cannot be accomplished in real-time.
[0018] Accordingly, in some embodiments, the present disclosure provides for a hybrid approach which combines (i) selective full ‘ground-truth’ refined object segmentation in frames at regular intervals, and (ii) pixel motion estimation for intermediate frames of the interval, using, e.g., an optical flow and/or similar techniques. The process then outputs an estimated object mask for a specified number of subsequent frames, based on the ground truth segmentation as modified by the cumulative estimated pixel movement. The estimated output is reset at the next ground -truth segmented frame, to avoid accumulating estimation errors. Because pixel motion estimation is faster and less resource-intensive than full ground-truth segmentation, the overall process can output a continuous real-time segmented stream using a relatively modest computing platform, such as those used in mobile devices.
[0019] In some embodiments, the present disclosure provides for a parallelized process which is particularly suited for platforms combining a processing unit with a dedicated acceleration processing component, e.g., platforms combining a central processing unit (CPU) and a graphic processing unit (GPU). Accordingly, selective full segmentation can be performed by, e.g., the GPU at specified intervals, which may be dictated by available computing resources. Simultaneously, an optical flow estimation process can run in parallel on the CPU, thereby increasing overall performance of the system. In more advanced chipsets, a Digital Signal Processor (DSP), which is designed to run Neural Networks, may take the job for the full segmentation, while the GPU may process the parallelized computing vision (e.g. Optical flow) algorithms.
[0020] In some embodiments, the present disclosure is based on dividing a video stream into consecutive sequences of frames, e.g., between 2 and 10 frames each.
[0021] In some embodiments, the number of frames in each sequence may be determined by a combination of parameters, including, but not limited to, type and architecture of the computing platform, desired speed and quality outcomes, etc.
[0022] In some embodiments, the number of frames in a sequence may be dictated, e.g., by the computing power and processing times of the associated computing platform on which the process is to be performed. In some embodiments, the number of frames in a sequence may be dynamically adjusted, based, at least in part, on instant response times of the computing platform. Thus, for exmaple:
• A first sequence may comprise, e.g., 6 frames, based on frame segmentation processing time of, e.g., 80ms, and
• a subsequent sequence may comprise, e.g., 3 frames, when instant processing times may have reduced to, e.g., 40ms.
[0023] In some embodiments, for each sequence of frames, the present disclosure provides for detecting and segmenting one or more objects in a first frame of the sequence, to generate one or more masks of the objects in the video. In some embodiments, accurate object segmentation may be performed using a trained neural network, which determines a mask of the object to be segmented.
[0024] In some embodiments, the present disclosure then estimates pixel motion throughout subsequent frames in the sequence, wherein object location is estimated for each frame in the sequence by modifying the initial segmentation based on the cumulative estimated pixel motion form the last refined segmentation. The present disclosure then repeats this process with respect to the next sequence in the stream.
[0025] Thus, in some embodiments, the parallelized process continuously provides for:
• Refined ‘ground-truth’ object segmentation on selected frames at specified intervals (which may be dynamically adjusted throughout the video stream) , and
• continuous optical flow process which estimates pixel motion frame-to-frame throughout the video stream.
[0026] In some embodiments, in order to output a continuous stream of object- segmented video in real-time, while allowing for the time-lag caused by performing full ground-truth segmentation on an initial frame in the stream, the present disclosure may utilize a forward-propagation process. Accordingly, in some embodiments, for a current video sequence (of, e.g., between 2 and 10 frames), the present disclosure may generate a corresponding object-segmented output stream which obtains a segmentation mask generated for a first frame in the immediately-preceding sequence, and modify it frame- to-frame based on the accumulated pixel motion estimates at each point in the sequence. This process then iterates for each subsequent sequence, using the fully-segmented first frame from the immediately-preceding sequence, to return to ground-truth, and modifying it with by current pixel motion estimations.
[0027] In some embodiments, a first sequence in a stream may be used as a ‘buffer’ sequence, which will not be outputted, to allow of the time lag in generating the initial full segmentation.
[0028] Fig. 1 illustrates an exemplary segmentation system 100 for automated real time object segmentation in a video stream, in accordance with some embodiments of the present invention. Segmentation system 100 as described herein is only an exemplary embodiment of the present invention, and in practice may have more or fewer components than shown, may combine two or more of the components, or a may have a different configuration or arrangement of the components. The various components of segmentation system 100 may be implemented in hardware, software, or a combination of both hardware and software. In various embodiments, segmentation system 100 may comprise a dedicated hardware device, or may form an addition to or extension of an existing medical device, such as a colposcope.
[0029] In some embodiments, segmentation system 100 may comprise a hardware processor 110 and memory storage device 114. In some embodiments, segmentation system 100 may store in a non-volatile memory thereof, such as storage device 114, software instructions or components configured to operate a processing unit (also "hardware processor," "CPU," or simply "processor), such as hardware processor 110. In some embodiments, the software components may include an operating system, including various software components and/or drivers for controlling and managing general system tasks (e.g., memory management, storage device control, power management, etc.) and facilitating communication between various hardware and
software components. In some embodiments, system 100 may comprise one or more graphic processing units (GPUs). In some embodiments, hardware processor 110 comprises, e.g., a CPU, a GPU, and/or a DSP. In some embodiments, the GPU and the CPU may be separated, or the GPU may be integrated on the CPU, and the communication portion may be separated from or integrated on the CPU or the GPU or the like.
[0030] In some embodiments, non-transient computer-readable storage device 114 (which may include one or more computer readable storage mediums) is used for storing, retrieving, comparing, and/or annotating captured frames. Image frames may be stored on storage device 114 based on one or more attributes, or tags, such as a time stamp, a user-entered label, or the result of an applied image processing method indicating the association of the frames, to name a few.
[0031] The software instructions and/or components operating hardware processor 110 may include instructions for receiving and analyzing multiple frames captured by suitable imaging device. For example, hardware processor 110 may comprise image processing module 111 and neural network module 112 Image processing module 110 receives a video stream and applies one or more image processing algorithms thereto. In some embodiments, image processing module 111 comprises one or more algorithms configured to perform object detection, classification, segmentation, and/or any other similar operation, using any suitable image processing, algorithm technique, and/or feature extraction process. The incoming video streams may come from various imaging devices. The video streams received by the image processing module 111 may vary in resolution, frame rate (e.g., between 15 and 60 frames per second), format, and protocol according to the characteristics and purpose of their respective source device. Depending on the embodiment, the image processing module 111 can route video streams through various processing functions, or to an output circuit that sends the processed video stream for presentation, e.g., on a display, to a recording system, across a network, or to another logical destination. The image processing module 111 may apply video stream processing algorithms alone or in combination. Image processing module 111 may also facilitate logging or recording operations with respect to n video stream. Some or all of the functionality of the image processing module 111 may be facilitated through a video stream recording system or a video stream processing system.
[0032] In some embodiments, system 100 may be configured to obtain object segmentation result of a specified frame, e.g., a first frame, in a first sequence of the video stream. In some embodiments, system 100 is configured to perform object detection and semantic segmentation by processing an image to generate an output which defines at least some of (i) regions in the image that depict an instance of a respective object, (ii) a respective object type of the object instance (e.g., vehicle, cat, person, and the like) depicted in each region, and (iii) a respective segmentation of the object instance depicted in each region. In some embodiments, segmentation system 100 is configured to process an image to generate object segmentation data. The object segmentation data defines a respective segmentation of the object instance depicted in each region. A segmentation of the object instance depicted in a region defines whether each pixel in the region is included in the object instance. In some embodiments, segmentation system 100 generates the object detection data and the object segmentation data using neural network module 112 that can be trained using machine learning training techniques.
[0033] Neural network module 112 may comprise a convolutional neural network (i.e., which includes one or more convolutional neural network layers), and can be implemented to embody any appropriate convolutional neural network architecture, e.g., U-Net, Mask R-CNN, DeepLab, and the like. See. e.g., Olaf Ronneberger et ah, U-Net: Convolutional Networks for Biomedical Image Segmentation, arXiv:1505.04597vl [cs.CV] 18 May 2015. In a particular example, neural network module 112 may include an input layer followed by a sequence of shared convolutional neural network layers. The output of the final shared convolutional neural network layer may be provided to a sequence of one or more additional neural network layers that are configured to generate the object detection data. However, other appropriate neural network processes may also be used, however. The output of the final shared convolutional neural network layers may be provided to a different sequence of one or more additional neural network layers.
[0034] In some embodiments, system 100 may also be configured to estimate motion between image frames, i.e., determine motion vectors that describe the transformation from one 2D image to another (usually, from adjacent frames in a video sequence) motion estimation may be defined as the process of finding corresponding points between two images (e.g., video frames), wherein the points that correspond to each other in two views of a scene or object may be considered to be the same point in that scene or on that object. In some embodiments, the present disclosure may apply optical flow and/or
another and/or similar computer vision technique or algorithm to estimate motion between frames. See, e.g., Fameback G. (2003) Two-Frame Motion Estimation Based on Polynomial Expansion. In: Bigun J., Gustavsson T. (eds) Image Analysis. SCIA 2003. Lecture Notes in Computer Science, vol 2749. Springer, Berlin, Heidelberg.
[0035] For consecutive image sequences such as found in video presentations, optical flow may be defined as the velocity field which warps one image into another (usually very similar) image. In some embodiments, an optical flow estimate comprises an estimate of a translation that describes any motion of a pixel from a position in one image to a position in a subsequent image. In some embodiments, optical flow estimation returns, with respect to each pixel and/or groups of pixels, a change is coordinates (x, y) of the pixel. In some embodiments, pixel motion between pairs of images may be estimated using additional and/or other methods. In some embodiments, system 100 may also compute cumulative pixel coordinate difference acquired over several pairs of consecutive images.
[0036] Fig. 2 is a flowchart detailing the functional steps in a process for automated real-time object segmentation in a video stream.
[0037] At step 202, in some embodiments, segmentation system 100 in Fig. 1 may be configured to receive a video stream depicting, e.g., one or more objects, such as humans, pets, inanimate objects, and the like.
[0038] At step 204, in some embodiments, system 100 may be configured to divide the video stream into sequences of specified number of frames, e.g., between 2 and 10 frames each. In some embodiments, the number of frames in each sequence may be determined by a combination of parameters, including, but not limited to, type and architecture of the computing platform, chipset performance, desired speed and quality outcomes, etc.
[0039] At step 206, system 100 may perform an iterative process comprising:
• Full ground truth object segmentation on a first frame in each interval, and
• continuous optical flow process to estimate pixel motion frame-to-frame throughout the video stream.
[0040] Fig. 3 is a schematic illustration of the iterative process of step 206. In some embodiments, in order to output a continuous stream of object-segmented video in real time (i.e., outputting between 15-60 frames per second), while allowing for the time-lag
caused by performing full ground-truth segmentation on an initial frame in the stream (e.g., 20-50ms), the present disclosure may utilize a forward-propagation process, wherein each outputted frame is based on:
(i) a ‘ground-truth’ segmentation mask generated for the first frame in the immediately-preceding sequence, as
(ii) modified by the cumulative optical flow calculated as of the last segmented mask.
[0041] Accordingly, as illustrated in Fig. 3, in some embodiments, system 100 may receive a current sequence comprising n frames, e.g., sequence j comprising 4 frames (frame j +0 - frame j + 3). System 100 may then perform the following sub-steps of iterative step 206 with respect to the current sequence:
(i) Step 206a: Generate a full ground-truth object segmentation for one or more objects in frame j + 0 of the current sequence j, to be propagated forward for use by the next sequence k;
(ii) Step 206b: Estimate an object mask in frame j + 0, by modifying a refined segmentation of frame i + 0 received from the previous sequence i, by incorporating all cumulative pixel motion estimates from frame i + 0 to frame i + 3 (i.e., S(DA, AY) of optical flows between ([i + 0] - [i + 1]), ([i + 1] - [t + 2]), and ([i + 2] - [i + 3]);
(iii) Step 206c: Continuously estimate pixel motion between each sequential pair of frames in current sequence j (i.e., ([/ + 0] - \j + 1]), ([/ + 1] - \j + 2]), and ([/ + 2 ] - [/ + 3]); AND
(iv) Step 206d: Use the corresponding pixel motion estimates from step 206c, to generate estimated object masks for each current frame ([/ + 1] — [j + 3]).
[0042] Accordingly, with continued reference to Fig. 3, in some embodiments, the following forms the basis for generating an estimated object mask for each frame in the current sequence j :
• Estimated mask for frame / + 0 = frame i + 0 mask + cumulative pixel motion ([j + 0] - [i + 3]).
• Estimated mask for frame / + 1 = Estimated mask for frame jO + pixel motion
(\j + 0] - [/ + 1]).
• Estimated mask for frame / + 2 = Estimated Mask for frame jl + pixel motion
([/ + 1 ] ~ \j + 2])·
• Estimated mask for frame / + 3 = Estimated mask for frame j + 2 + pixel motion ([/ + 2] — [/ + 3]).
[0043] In some embodiments, the full obj ect segmentation mask generated in step 206a above is propagated forward, to be used as the basis for estimating a mask for frame k + 0 in subsequent sequence k.
[0044] In some embodiments, at step 208, system 100 may be configured to output object segmentation results in the current frame sequence j, e.g., to a media device to be displayed on a display monitor, a device screen, etc. In some embodiments, the output stream may be directed to another computing platform for continued processing and/or manipulation. In some embodiments, step 208 continuously outputs the results of iterative step 206, to generate a continuous real time output stream of the input video with object segmentation results.
[0045] The present invention may be a system, a method, and/or a computer program product. The computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present invention.
[0046] The computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. A non- exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanically encoded device having instructions recorded thereon, and any suitable combination of the foregoing. A computer readable storage
medium, as used herein, is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire. Rather, the computer readable storage medium is a non-transient (i.e., not-volatile) medium.
[0047] Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network. The network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. A network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device.
[0048] Computer readable program instructions for carrying out operations of the present invention may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C++ or the like, and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the
computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention.
[0049] Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.
[0050] These computer readable program instructions may be provided to a processor of a general-purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable storage medium having instructions stored therein comprises an article of manufacture including instructions which implement aspects of the function/act specified in the flowchart and/or block diagram block or blocks.
[0051] The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus or other device to produce a computer implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
[0052] The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions
for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware -based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.
[0053] The descriptions of the various embodiments of the present invention have been presented for purposes of illustration, but are not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen to best explain the principles of the embodiments, the practical application or technical improvement over technologies found in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.
Claims
1. A method comprising: receiving a video stream; continuously calculating pixel motion for each frame in said video stream, relative to an immediately-preceding frame in said video stream; and with respect to each current sequence ofn frames in said video stream:
(i) obtaining a reference object segmentation mask of a reference frame in a preceding sequence in said video stream (reference mask), and
(ii) estimating an object segmentation mask for each current frame in said current sequence, by modifying the reference object segmentation mask with a sum of said pixel motion calculations accumulated from said reference frame through said current frame.
2. The method of claim 1, wherein said calculating is performed using an optical flow method.
3. The method of any one of claims 1-2, wherein n is determined based, at least in part, on a processing time associated with generating said reference object segmentation mask.
4. The method of any one of claims 1-3, wherein n is dynamically adjusted for at least some of said sequences in said video stream, based, at least in part, on changes in said processing time.
5. The method of any one of claims 1-4,
(i) wherein said estimating comprises estimating an object segmentation mask for said current frame by modifying an estimated object segmentation mask generated for a previous frame in said current sequence by modifying said estimated object segmentation mask with a sum of said pixel motion calculations accumulated from said previous frame through said current frame; and
(ii) wherein said estimated object segmentation mask is generated by modifying the reference object segmentation mask with a sum of said
pixel motion calculations accumulated from said reference frame through said previous frame.
6. The method of any one of claims 1-5, wherein said reference object segmentation mask is performed by applying a trained neural network.
7. The method of any one of claims 1-6, wherein said reference object segmentation mask comprises one or more objects in said video stream.
8. The method of any one of claims 1-7, wherein said preceding sequence is an immediately-preceding sequence in said video stream.
9. The method of any one of claims 1-8, wherein said reference frame is a first frame is said preceding sequence.
10. The method of any one of claims 1-9, wherein said obtaining and said estimating are repeated iteratively for each consecutive sequence in said video stream.
11. The method of any one of claims 1-10, further comprising outputting said estimated segmentation masks in real time.
12. The method of any one of claims 1-11, wherein said method is performed on a computing device, and wherein said calculating and said obtaining are each performed concurrently using two corresponding processing modules of said computing device.
13. A system comprising: at least one hardware processor; and a non-transitory computer-readable storage medium having stored thereon program instructions, the program instructions executable by the at least one hardware processor to: receive a video stream; continuously calculate pixel motion for each frame in said video stream, relative to an immediately-preceding frame in said video stream; and with respect to each current sequence ofn frames in said video stream:
(i) obtain a reference object segmentation mask of a reference frame in a preceding sequence in said video stream (reference mask), and
(ii) estimate an object segmentation mask for each current frame in said current sequence, by modifying the reference object segmentation mask with a sum of said pixel motion calculations accumulated from said reference frame through said current frame.
14. The system of claim 13, wherein said calculating is performed using an optical flow method.
15. The system of any one of claims 13-14, wherein n is determined based, at least in part, on a processing time associated with generating said reference object segmentation mask.
16. The system of any one of claims 13-15, whereinn is dynamically adjusted for at least some of said sequences in said video stream, based, at least in part, on changes in said processing time.
17. The system of any one of claims 13-16,
(i) wherein said estimating comprises estimating an object segmentation mask for said current frame by modifying an estimated object segmentation mask generated for a previous frame in said current sequence by modifying said estimated object segmentation mask with a sum of said pixel motion calculations accumulated from said previous frame through said current frame; and
(ii) wherein said estimated object segmentation mask is generated by modifying the reference object segmentation mask with a sum of said pixel motion calculations accumulated from said reference frame through said previous frame.
18. The system of any one of claims 13-17, wherein said reference object segmentation mask is performed by applying a trained neural network.
19. The system of any one of claims 13-18, wherein said reference object segmentation mask comprises one or more objects in said video stream.
20. The system of any one of claims 13-19, wherein said preceding sequence is an immediately-preceding sequence in said video stream.
21. The system of any one of claims 13-20, wherein said reference frame is a first frame is said preceding sequence.
22. The system of any one of claims 13-21, wherein said obtaining and said estimating are repeated iteratively for each consecutive sequence in said video stream.
23. The system of any one of claims 13-22, wherein said program instructions are further executable to output said estimated segmentation masks in real time.
24. The system of any one of claims 13-23, wherein said at least one hardware processor comprises at least two processing modules, and wherein said calculating and said obtaining are each performed concurrently using a different one of said at least two processing modules.
25. A computer program product comprising a non-transitory computer-readable storage medium having program instructions embodied therewith, the program instructions executable by at least one hardware processor to: receive a video stream; continuously calculate pixel motion for each frame in said video stream, relative to an immediately-preceding frame in said video stream; and with respect to each current sequence ofn frames in said video stream:
(i) obtain a reference object segmentation mask of a reference frame in a preceding sequence in said video stream (reference mask), and
(ii) estimate an object segmentation mask for each current frame in said current sequence, by modifying the reference object segmentation mask with a sum of said pixel motion calculations accumulated from said reference frame through said current frame.
26. The computer program product of claim 25, wherein said calculating is performed using an optical flow method.
27. The computer program product of any one of claims 25-26, wherein n is determined based, at least in part, on a processing time associated with generating said reference object segmentation mask.
28. The computer program product of any one of claims 25-27, wherein n is dynamically adjusted for at least some of said sequences in said video stream, based, at least in part, on changes in said processing time.
29. The computer program product of any one of claims 25-28,
(i) wherein said estimating comprises estimating an object segmentation mask for said current frame by modifying an estimated object segmentation mask generated for a previous frame in said current sequence by modifying said estimated object segmentation mask with a sum of said pixel motion calculations accumulated from said previous frame through said current frame; and
(ii) wherein said estimated object segmentation mask is generated by modifying the reference object segmentation mask with a sum of said pixel motion calculations accumulated from said reference frame through said previous frame.
30. The computer program product of any one of claims 25-29, wherein said reference object segmentation mask is performed by applying a trained neural network.
31. The computer program product of any one of claims 25-30, wherein said reference object segmentation mask comprises one or more objects in said video stream.
32. The computer program product of any one of claims 25-31, wherein said preceding sequence is an immediately-preceding sequence in said video stream.
33. The computer program product of any one of claims 25-32, wherein said reference frame is a first frame is said preceding sequence.
34. The computer program product of any one of claims 25-33, wherein said obtaining and said estimating are repeated iteratively for each consecutive sequence in said video stream.
35. The computer program product of any one of claims 25-34, wherein said program instructions are further executable to output said estimated segmentation masks in real time.
36. The computer program product of any one of claims 25-35, wherein said at least one hardware processor comprises at least two processing modules, and wherein said calculating and said obtaining are each performed concurrently using a different one of said at least two processing modules.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201962912202P | 2019-10-08 | 2019-10-08 | |
US62/912,202 | 2019-10-08 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021070004A1 true WO2021070004A1 (en) | 2021-04-15 |
Family
ID=75436801
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/IB2020/059092 WO2021070004A1 (en) | 2019-10-08 | 2020-09-29 | Object segmentation in video stream |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2021070004A1 (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113570610A (en) * | 2021-07-26 | 2021-10-29 | 北京百度网讯科技有限公司 | Method and device for performing target segmentation on video by adopting semantic segmentation model |
CN113923493A (en) * | 2021-09-29 | 2022-01-11 | 北京奇艺世纪科技有限公司 | Video processing method and device, electronic equipment and storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130121572A1 (en) * | 2010-01-27 | 2013-05-16 | Sylvain Paris | Methods and Apparatus for Tone Mapping High Dynamic Range Images |
US20150334398A1 (en) * | 2014-05-15 | 2015-11-19 | Daniel Socek | Content adaptive background foreground segmentation for video coding |
US20160093336A1 (en) * | 2014-07-07 | 2016-03-31 | Google Inc. | Method and System for Non-Causal Zone Search in Video Monitoring |
US20170337693A1 (en) * | 2016-05-23 | 2017-11-23 | Intel Corporation | Method and system of real-time image segmentation for image processing |
WO2018128741A1 (en) * | 2017-01-06 | 2018-07-12 | Board Of Regents, The University Of Texas System | Segmenting generic foreground objects in images and videos |
-
2020
- 2020-09-29 WO PCT/IB2020/059092 patent/WO2021070004A1/en active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130121572A1 (en) * | 2010-01-27 | 2013-05-16 | Sylvain Paris | Methods and Apparatus for Tone Mapping High Dynamic Range Images |
US20150334398A1 (en) * | 2014-05-15 | 2015-11-19 | Daniel Socek | Content adaptive background foreground segmentation for video coding |
US20160093336A1 (en) * | 2014-07-07 | 2016-03-31 | Google Inc. | Method and System for Non-Causal Zone Search in Video Monitoring |
US20170337693A1 (en) * | 2016-05-23 | 2017-11-23 | Intel Corporation | Method and system of real-time image segmentation for image processing |
WO2018128741A1 (en) * | 2017-01-06 | 2018-07-12 | Board Of Regents, The University Of Texas System | Segmenting generic foreground objects in images and videos |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113570610A (en) * | 2021-07-26 | 2021-10-29 | 北京百度网讯科技有限公司 | Method and device for performing target segmentation on video by adopting semantic segmentation model |
CN113570610B (en) * | 2021-07-26 | 2022-05-13 | 北京百度网讯科技有限公司 | Method and device for performing target segmentation on video by adopting semantic segmentation model |
CN113923493A (en) * | 2021-09-29 | 2022-01-11 | 北京奇艺世纪科技有限公司 | Video processing method and device, electronic equipment and storage medium |
CN113923493B (en) * | 2021-09-29 | 2023-06-16 | 北京奇艺世纪科技有限公司 | Video processing method and device, electronic equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP6837158B2 (en) | Video identification and training methods, equipment, electronic devices and media | |
US10412462B2 (en) | Video frame rate conversion using streamed metadata | |
US20180285689A1 (en) | Rgb-d scene labeling with multimodal recurrent neural networks | |
JP2021507388A (en) | Instance segmentation methods and devices, electronics, programs and media | |
US10872251B2 (en) | Automated annotation techniques | |
US11222409B2 (en) | Image/video deblurring using convolutional neural networks with applications to SFM/SLAM with blurred images/videos | |
WO2021070004A1 (en) | Object segmentation in video stream | |
US11910001B2 (en) | Real-time image generation in moving scenes | |
CN110472599B (en) | Object quantity determination method and device, storage medium and electronic equipment | |
US11599974B2 (en) | Joint rolling shutter correction and image deblurring | |
JP2013537654A (en) | Method and system for semantic label propagation | |
US10878850B2 (en) | Method and apparatus for visualizing information of a digital video stream | |
CN111601013B (en) | Method and apparatus for processing video frames | |
EP3739503B1 (en) | Video processing | |
US11902571B2 (en) | Region of interest (ROI)-based upscaling for video conferences | |
Kryjak et al. | Real-time implementation of foreground object detection from a moving camera using the vibe algorithm | |
WO2023105800A1 (en) | Object detection device, object detection method, and object detection system | |
CN107025433B (en) | Video event human concept learning method and device | |
US10628913B2 (en) | Optimal data sampling for image analysis | |
Chae et al. | Siamevent: Event-based object tracking via edge-aware similarity learning with siamese networks | |
CN111277863B (en) | Optical flow frame interpolation method and device | |
US20230325964A1 (en) | Systems and methods for generating and running computer vision pipelines for processing of images and/or video | |
CN117237648B (en) | Training method, device and equipment of semantic segmentation model based on context awareness | |
EP2858034A1 (en) | Method and apparatus for generating temporally consistent depth maps | |
Archana et al. | Abnormal Frame Extraction and Object Tracking Hybrid Machine Learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20874904 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 28.07.2022) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20874904 Country of ref document: EP Kind code of ref document: A1 |