WO2021070004A1 - Object segmentation in video stream - Google Patents

Object segmentation in video stream Download PDF

Info

Publication number
WO2021070004A1
WO2021070004A1 PCT/IB2020/059092 IB2020059092W WO2021070004A1 WO 2021070004 A1 WO2021070004 A1 WO 2021070004A1 IB 2020059092 W IB2020059092 W IB 2020059092W WO 2021070004 A1 WO2021070004 A1 WO 2021070004A1
Authority
WO
WIPO (PCT)
Prior art keywords
frame
video stream
object segmentation
segmentation mask
sequence
Prior art date
Application number
PCT/IB2020/059092
Other languages
French (fr)
Inventor
Eran Sela
Koby MALUK
Original Assignee
Spectalix Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Spectalix Ltd. filed Critical Spectalix Ltd.
Publication of WO2021070004A1 publication Critical patent/WO2021070004A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/102Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
    • H04N19/119Adaptive subdivision aspects, e.g. subdivision of a picture into rectangular or non-rectangular coding blocks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/215Motion-based segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/49Segmenting video sequences, i.e. computational techniques such as parsing or cutting the sequence, low-level clustering or determining units such as shots or scenes
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/134Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or criterion affecting or controlling the adaptive coding
    • H04N19/136Incoming video signal characteristics or properties
    • H04N19/137Motion inside a coding unit, e.g. average field, frame or block difference
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/169Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding
    • H04N19/17Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/169Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding
    • H04N19/17Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object
    • H04N19/172Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object the region being a picture, frame or field
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/20Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using video object coding
    • H04N19/23Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using video object coding with coding of regions that are present throughout a whole video segment, e.g. sprites, background or mosaic

Definitions

  • the invention relates to the field of computer image processing.
  • ROI video object segmentation is a widely used technique that enables video and image editors to separate the foreground of a scene (scene may be either a video clip or a still image) from the original scene’s background, and treat the foreground as a standalone visual layers.
  • scene may be either a video clip or a still image
  • creators can place the segmented objects in a different context and create a different visual meaning than the one concluded from the original video/image (e.g., placing a person who were originally filmed in the street, in a totally different location - like the surface of the moon).
  • ROI video object segmentation is mostly done using machine learning algorithms which leam the various shape variants of the object to be segmented, and then detect this object in every frame of a given video stream.
  • the segmentation processing for each frame may require a large amount of computing resources, especially in cases where the segmentation is done on a real-time video feed and the requirement is to draw the segmented objects on a new background with minimum delays. These requirements may result in quality reduction of the composed (foreground on new background) visual outcome, such as lower video frame rate and lower resolution
  • a method comprising: receiving a video stream; continuously calculating pixel motion for each frame in said video stream, relative to an immediately-preceding frame in said video stream; and with respect to each current sequence of n frames in said video stream: (i) obtaining a reference object segmentation mask of a reference frame in a preceding sequence in said video stream (reference mask), and (ii) estimating an object segmentation mask for each current frame in said current sequence, by modifying the reference object segmentation mask with a sum of said pixel motion calculations accumulated from said reference frame through said current frame.
  • a system comprising at least one hardware processor; and a non-transitory computer-readable storage medium having stored thereon program instructions, the program instructions executable by the at least one hardware processor to: receive a video stream; continuously calculate pixel motion for each frame in said video stream, relative to an immediately-preceding frame in said video stream; and with respect to each current sequence of n frames in said video stream: (i) obtain a reference object segmentation mask of a reference frame in a preceding sequence in said video stream (reference mask), and (ii) estimate an object segmentation mask for each current frame in said current sequence, by modifying the reference object segmentation mask with a sum of said pixel motion calculations accumulated from said reference frame through said current frame.
  • a computer program product comprising a non-transitory computer-readable storage medium having program instructions embodied therewith, the program instructions executable by at least one hardware processor to: receive a video stream; continuously calculate pixel motion for each frame in said video stream, relative to an immediately-preceding frame in said video stream; and with respect to each current sequence of n frames in said video stream: (i) obtain a reference object segmentation mask of a reference frame in a preceding sequence in said video stream (reference mask), and (ii) estimate an object segmentation mask for each current frame in said current sequence, by modifying the reference object segmentation mask with a sum of said pixel motion calculations accumulated from said reference frame through said current frame.
  • FIG. 1 shows an exemplary system for automated real-time object segmentation in a video stream, according to exemplary embodiments of the present invention
  • FIG. 2 is a flowchart detailing the functional steps in a process for automated real-time object segmentation in a video stream, according to exemplary embodiments of the present invention.
  • FIG. 3 is a schematic illustration of an iterative process for automated real-time object segmentation in a video stream, according to exemplary embodiments of the present invention.
  • Described herein are a system, method, and computer program product for automated real-time object segmentation in a video stream.
  • the present disclosure provides for continuous real-time object segmentation in a video stream, using a reduced amount of computing resources, and as such, may be particularly suited for mobile and similar applications with relatively modest computing power.
  • segmentation refers to the process of dividing a digital image into groups of pixels, based on some criteria (e.g., color, texture, etc.).
  • object segmentation refers to a segmentation where each group of pixels is associated with one or more specific objects detected in the image. Object segmentation may identify a contiguous subset of the image, for example, abounding box enclosing an area of the image containing the object, or a mask outlining the actual object and defining which pixels in the region are included in the object instance.
  • Video object segmentation is the task of separating a foreground object from its original background in a video sequence.
  • Video object segmentation may be used in video analysis, editing, surveillance, and various commercial and consumer-oriented applications.
  • real-time automatic video object segmentation entails performing a refined object segmentation in each frame of the stream, and as such requires considerable computing resources.
  • Performing real-time video object segmentation on, e.g., mobile or similar platforms with limited computing power may result in relatively poor results in terms of quality and speed.
  • typical video stream frame rates are, e.g., between 15 and 60 frames-per-second (fps).
  • processing time for a full refined object segmentation in a single frame may take between 20-50ms, and thus cannot be accomplished in real-time.
  • the present disclosure provides for a hybrid approach which combines (i) selective full ‘ground-truth’ refined object segmentation in frames at regular intervals, and (ii) pixel motion estimation for intermediate frames of the interval, using, e.g., an optical flow and/or similar techniques.
  • the process then outputs an estimated object mask for a specified number of subsequent frames, based on the ground truth segmentation as modified by the cumulative estimated pixel movement.
  • the estimated output is reset at the next ground -truth segmented frame, to avoid accumulating estimation errors. Because pixel motion estimation is faster and less resource-intensive than full ground-truth segmentation, the overall process can output a continuous real-time segmented stream using a relatively modest computing platform, such as those used in mobile devices.
  • the present disclosure provides for a parallelized process which is particularly suited for platforms combining a processing unit with a dedicated acceleration processing component, e.g., platforms combining a central processing unit (CPU) and a graphic processing unit (GPU). Accordingly, selective full segmentation can be performed by, e.g., the GPU at specified intervals, which may be dictated by available computing resources. Simultaneously, an optical flow estimation process can run in parallel on the CPU, thereby increasing overall performance of the system. In more advanced chipsets, a Digital Signal Processor (DSP), which is designed to run Neural Networks, may take the job for the full segmentation, while the GPU may process the parallelized computing vision (e.g. Optical flow) algorithms.
  • DSP Digital Signal Processor
  • the present disclosure is based on dividing a video stream into consecutive sequences of frames, e.g., between 2 and 10 frames each.
  • the number of frames in each sequence may be determined by a combination of parameters, including, but not limited to, type and architecture of the computing platform, desired speed and quality outcomes, etc.
  • the number of frames in a sequence may be dictated, e.g., by the computing power and processing times of the associated computing platform on which the process is to be performed. In some embodiments, the number of frames in a sequence may be dynamically adjusted, based, at least in part, on instant response times of the computing platform.
  • a first sequence may comprise, e.g., 6 frames, based on frame segmentation processing time of, e.g., 80ms, and
  • a subsequent sequence may comprise, e.g., 3 frames, when instant processing times may have reduced to, e.g., 40ms.
  • the present disclosure provides for detecting and segmenting one or more objects in a first frame of the sequence, to generate one or more masks of the objects in the video.
  • accurate object segmentation may be performed using a trained neural network, which determines a mask of the object to be segmented.
  • the present disclosure then estimates pixel motion throughout subsequent frames in the sequence, wherein object location is estimated for each frame in the sequence by modifying the initial segmentation based on the cumulative estimated pixel motion form the last refined segmentation. The present disclosure then repeats this process with respect to the next sequence in the stream.
  • the parallelized process continuously provides for:
  • the present disclosure may utilize a forward-propagation process. Accordingly, in some embodiments, for a current video sequence (of, e.g., between 2 and 10 frames), the present disclosure may generate a corresponding object-segmented output stream which obtains a segmentation mask generated for a first frame in the immediately-preceding sequence, and modify it frame- to-frame based on the accumulated pixel motion estimates at each point in the sequence. This process then iterates for each subsequent sequence, using the fully-segmented first frame from the immediately-preceding sequence, to return to ground-truth, and modifying it with by current pixel motion estimations.
  • a first sequence in a stream may be used as a ‘buffer’ sequence, which will not be outputted, to allow of the time lag in generating the initial full segmentation.
  • Fig. 1 illustrates an exemplary segmentation system 100 for automated real time object segmentation in a video stream, in accordance with some embodiments of the present invention.
  • Segmentation system 100 as described herein is only an exemplary embodiment of the present invention, and in practice may have more or fewer components than shown, may combine two or more of the components, or a may have a different configuration or arrangement of the components.
  • the various components of segmentation system 100 may be implemented in hardware, software, or a combination of both hardware and software.
  • segmentation system 100 may comprise a dedicated hardware device, or may form an addition to or extension of an existing medical device, such as a colposcope.
  • segmentation system 100 may comprise a hardware processor 110 and memory storage device 114.
  • segmentation system 100 may store in a non-volatile memory thereof, such as storage device 114, software instructions or components configured to operate a processing unit (also "hardware processor,” “CPU,” or simply "processor), such as hardware processor 110.
  • the software components may include an operating system, including various software components and/or drivers for controlling and managing general system tasks (e.g., memory management, storage device control, power management, etc.) and facilitating communication between various hardware and software components.
  • system 100 may comprise one or more graphic processing units (GPUs).
  • GPUs graphic processing units
  • hardware processor 110 comprises, e.g., a CPU, a GPU, and/or a DSP.
  • the GPU and the CPU may be separated, or the GPU may be integrated on the CPU, and the communication portion may be separated from or integrated on the CPU or the GPU or the like.
  • non-transient computer-readable storage device 114 (which may include one or more computer readable storage mediums) is used for storing, retrieving, comparing, and/or annotating captured frames.
  • Image frames may be stored on storage device 114 based on one or more attributes, or tags, such as a time stamp, a user-entered label, or the result of an applied image processing method indicating the association of the frames, to name a few.
  • the software instructions and/or components operating hardware processor 110 may include instructions for receiving and analyzing multiple frames captured by suitable imaging device.
  • hardware processor 110 may comprise image processing module 111 and neural network module 112
  • Image processing module 110 receives a video stream and applies one or more image processing algorithms thereto.
  • image processing module 111 comprises one or more algorithms configured to perform object detection, classification, segmentation, and/or any other similar operation, using any suitable image processing, algorithm technique, and/or feature extraction process.
  • the incoming video streams may come from various imaging devices.
  • the video streams received by the image processing module 111 may vary in resolution, frame rate (e.g., between 15 and 60 frames per second), format, and protocol according to the characteristics and purpose of their respective source device.
  • the image processing module 111 can route video streams through various processing functions, or to an output circuit that sends the processed video stream for presentation, e.g., on a display, to a recording system, across a network, or to another logical destination.
  • the image processing module 111 may apply video stream processing algorithms alone or in combination.
  • Image processing module 111 may also facilitate logging or recording operations with respect to n video stream. Some or all of the functionality of the image processing module 111 may be facilitated through a video stream recording system or a video stream processing system.
  • system 100 may be configured to obtain object segmentation result of a specified frame, e.g., a first frame, in a first sequence of the video stream.
  • system 100 is configured to perform object detection and semantic segmentation by processing an image to generate an output which defines at least some of (i) regions in the image that depict an instance of a respective object, (ii) a respective object type of the object instance (e.g., vehicle, cat, person, and the like) depicted in each region, and (iii) a respective segmentation of the object instance depicted in each region.
  • segmentation system 100 is configured to process an image to generate object segmentation data.
  • the object segmentation data defines a respective segmentation of the object instance depicted in each region.
  • a segmentation of the object instance depicted in a region defines whether each pixel in the region is included in the object instance.
  • segmentation system 100 generates the object detection data and the object segmentation data using neural network module 112 that can be trained using machine learning training techniques.
  • Neural network module 112 may comprise a convolutional neural network (i.e., which includes one or more convolutional neural network layers), and can be implemented to embody any appropriate convolutional neural network architecture, e.g., U-Net, Mask R-CNN, DeepLab, and the like. See. e.g., Olaf Ronneberger et ah, U-Net: Convolutional Networks for Biomedical Image Segmentation, arXiv:1505.04597vl [cs.CV] 18 May 2015.
  • neural network module 112 may include an input layer followed by a sequence of shared convolutional neural network layers.
  • the output of the final shared convolutional neural network layer may be provided to a sequence of one or more additional neural network layers that are configured to generate the object detection data. However, other appropriate neural network processes may also be used, however.
  • the output of the final shared convolutional neural network layers may be provided to a different sequence of one or more additional neural network layers.
  • system 100 may also be configured to estimate motion between image frames, i.e., determine motion vectors that describe the transformation from one 2D image to another (usually, from adjacent frames in a video sequence) motion estimation may be defined as the process of finding corresponding points between two images (e.g., video frames), wherein the points that correspond to each other in two views of a scene or object may be considered to be the same point in that scene or on that object.
  • the present disclosure may apply optical flow and/or another and/or similar computer vision technique or algorithm to estimate motion between frames. See, e.g., Fameback G. (2003) Two-Frame Motion Estimation Based on Polynomial Expansion. In: Bigun J., Gustavsson T. (eds) Image Analysis. SCIA 2003. Lecture Notes in Computer Science, vol 2749. Springer, Berlin, Heidelberg.
  • optical flow may be defined as the velocity field which warps one image into another (usually very similar) image.
  • an optical flow estimate comprises an estimate of a translation that describes any motion of a pixel from a position in one image to a position in a subsequent image.
  • optical flow estimation returns, with respect to each pixel and/or groups of pixels, a change is coordinates (x, y) of the pixel.
  • pixel motion between pairs of images may be estimated using additional and/or other methods.
  • system 100 may also compute cumulative pixel coordinate difference acquired over several pairs of consecutive images.
  • Fig. 2 is a flowchart detailing the functional steps in a process for automated real-time object segmentation in a video stream.
  • segmentation system 100 in Fig. 1 may be configured to receive a video stream depicting, e.g., one or more objects, such as humans, pets, inanimate objects, and the like.
  • system 100 may be configured to divide the video stream into sequences of specified number of frames, e.g., between 2 and 10 frames each.
  • the number of frames in each sequence may be determined by a combination of parameters, including, but not limited to, type and architecture of the computing platform, chipset performance, desired speed and quality outcomes, etc.
  • system 100 may perform an iterative process comprising:
  • Fig. 3 is a schematic illustration of the iterative process of step 206.
  • the present disclosure may utilize a forward-propagation process, wherein each outputted frame is based on:
  • system 100 may receive a current sequence comprising n frames, e.g., sequence j comprising 4 frames (frame j +0 - frame j + 3). System 100 may then perform the following sub-steps of iterative step 206 with respect to the current sequence:
  • Step 206a Generate a full ground-truth object segmentation for one or more objects in frame j + 0 of the current sequence j, to be propagated forward for use by the next sequence k;
  • Step 206b Estimate an object mask in frame j + 0, by modifying a refined segmentation of frame i + 0 received from the previous sequence i, by incorporating all cumulative pixel motion estimates from frame i + 0 to frame i + 3 (i.e., S(DA, AY) of optical flows between ([i + 0] - [i + 1]), ([i + 1] - [t + 2]), and ([i + 2] - [i + 3]);
  • Step 206c Continuously estimate pixel motion between each sequential pair of frames in current sequence j (i.e., ([/ + 0] - ⁇ j + 1]), ([/ + 1] - ⁇ j + 2]), and ([/ + 2 ] - [/ + 3]); AND
  • Step 206d Use the corresponding pixel motion estimates from step 206c, to generate estimated object masks for each current frame ([/ + 1] — [j + 3]).
  • the full obj ect segmentation mask generated in step 206a above is propagated forward, to be used as the basis for estimating a mask for frame k + 0 in subsequent sequence k.
  • system 100 may be configured to output object segmentation results in the current frame sequence j, e.g., to a media device to be displayed on a display monitor, a device screen, etc.
  • the output stream may be directed to another computing platform for continued processing and/or manipulation.
  • step 208 continuously outputs the results of iterative step 206, to generate a continuous real time output stream of the input video with object segmentation results.
  • the present invention may be a system, a method, and/or a computer program product.
  • the computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present invention.
  • the computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device.
  • the computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.
  • a non- exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanically encoded device having instructions recorded thereon, and any suitable combination of the foregoing.
  • RAM random access memory
  • ROM read-only memory
  • EPROM or Flash memory erasable programmable read-only memory
  • SRAM static random access memory
  • CD-ROM compact disc read-only memory
  • DVD digital versatile disk
  • memory stick a floppy disk
  • any suitable combination of the foregoing includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable
  • a computer readable storage medium is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire. Rather, the computer readable storage medium is a non-transient (i.e., not-volatile) medium.
  • Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network.
  • the network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers.
  • a network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device.
  • Computer readable program instructions for carrying out operations of the present invention may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C++ or the like, and conventional procedural programming languages, such as the “C” programming language or similar programming languages.
  • the computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.
  • the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
  • electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention.
  • These computer readable program instructions may be provided to a processor of a general-purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
  • These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable storage medium having instructions stored therein comprises an article of manufacture including instructions which implement aspects of the function/act specified in the flowchart and/or block diagram block or blocks.
  • the computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus or other device to produce a computer implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s).
  • the functions noted in the block may occur out of the order noted in the figures.
  • two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
  • each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration can be implemented by special purpose hardware -based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.

Abstract

A method comprising: receiving a video stream; continuously calculating pixel motion for each frame in said video stream, relative to an immediately-preceding frame in said video stream; and with respect to each current sequence of n frames in said video stream: (i) obtaining a reference object segmentation mask of a reference frame in a preceding sequence in said video stream (reference mask), and (ii) estimating an object segmentation mask for each current frame in said current sequence, by modifying the reference object segmentation mask with a sum of said pixel motion calculations accumulated from said reference frame through said current frame.

Description

OBJECT SEGMENTATION IN VIDEO STREAM
FIELD OF INVENTION
[0001] The invention relates to the field of computer image processing.
BACKGROUND OF THE INVENTION
[0002] Automatic region-of-interest (ROI) video object segmentation is a widely used technique that enables video and image editors to separate the foreground of a scene (scene may be either a video clip or a still image) from the original scene’s background, and treat the foreground as a standalone visual layers. By modifying or replacing the background, creators can place the segmented objects in a different context and create a different visual meaning than the one concluded from the original video/image (e.g., placing a person who were originally filmed in the street, in a totally different location - like the surface of the moon).
[0003] ROI video object segmentation is mostly done using machine learning algorithms which leam the various shape variants of the object to be segmented, and then detect this object in every frame of a given video stream. The segmentation processing for each frame may require a large amount of computing resources, especially in cases where the segmentation is done on a real-time video feed and the requirement is to draw the segmented objects on a new background with minimum delays. These requirements may result in quality reduction of the composed (foreground on new background) visual outcome, such as lower video frame rate and lower resolution
[0004] The foregoing examples of the related art and limitations related therewith are intended to be illustrative and not exclusive. Other limitations of the related art will become apparent to those of skill in the art upon a reading of the specification and a study of the figures.
SUMMARY OF THE INVENTION
[0005] The following embodiments and aspects thereof are described and illustrated in conjunction with systems, tools and methods which are meant to be exemplary and illustrative, not limiting in scope. [0006] There is provided, in an embodiment, a method comprising: receiving a video stream; continuously calculating pixel motion for each frame in said video stream, relative to an immediately-preceding frame in said video stream; and with respect to each current sequence of n frames in said video stream: (i) obtaining a reference object segmentation mask of a reference frame in a preceding sequence in said video stream (reference mask), and (ii) estimating an object segmentation mask for each current frame in said current sequence, by modifying the reference object segmentation mask with a sum of said pixel motion calculations accumulated from said reference frame through said current frame.
[0007] There is also provided, in an embodiment, a system comprising at least one hardware processor; and a non-transitory computer-readable storage medium having stored thereon program instructions, the program instructions executable by the at least one hardware processor to: receive a video stream; continuously calculate pixel motion for each frame in said video stream, relative to an immediately-preceding frame in said video stream; and with respect to each current sequence of n frames in said video stream: (i) obtain a reference object segmentation mask of a reference frame in a preceding sequence in said video stream (reference mask), and (ii) estimate an object segmentation mask for each current frame in said current sequence, by modifying the reference object segmentation mask with a sum of said pixel motion calculations accumulated from said reference frame through said current frame.
[0008] There is further provided, in an embodiment, a computer program product comprising a non-transitory computer-readable storage medium having program instructions embodied therewith, the program instructions executable by at least one hardware processor to: receive a video stream; continuously calculate pixel motion for each frame in said video stream, relative to an immediately-preceding frame in said video stream; and with respect to each current sequence of n frames in said video stream: (i) obtain a reference object segmentation mask of a reference frame in a preceding sequence in said video stream (reference mask), and (ii) estimate an object segmentation mask for each current frame in said current sequence, by modifying the reference object segmentation mask with a sum of said pixel motion calculations accumulated from said reference frame through said current frame. [0009] In addition to the exemplary aspects and embodiments described above, further aspects and embodiments will become apparent by reference to the figures and by study of the following detailed description.
BRIEF DESCRIPTION OF THE DRAWINGS
[0010] Exemplary embodiments are illustrated in referenced figures. Dimensions of components and features shown in the figures are generally chosen for convenience and clarity of presentation and are not necessarily shown to scale. The figures are listed below.
[0011] Fig. 1 shows an exemplary system for automated real-time object segmentation in a video stream, according to exemplary embodiments of the present invention;
[0012] Fig. 2 is a flowchart detailing the functional steps in a process for automated real-time object segmentation in a video stream, according to exemplary embodiments of the present invention; and
[0013] Fig. 3 is a schematic illustration of an iterative process for automated real-time object segmentation in a video stream, according to exemplary embodiments of the present invention.
DETAILED DESCRIPTION
[0014] Described herein are a system, method, and computer program product for automated real-time object segmentation in a video stream. In some embodiments, the present disclosure provides for continuous real-time object segmentation in a video stream, using a reduced amount of computing resources, and as such, may be particularly suited for mobile and similar applications with relatively modest computing power.
[0015] In the context of computer image processing, the term “segmentation,” as used herein, refers to the process of dividing a digital image into groups of pixels, based on some criteria (e.g., color, texture, etc.). The term “object segmentation” refers to a segmentation where each group of pixels is associated with one or more specific objects detected in the image. Object segmentation may identify a contiguous subset of the image, for example, abounding box enclosing an area of the image containing the object, or a mask outlining the actual object and defining which pixels in the region are included in the object instance. [0016] Video object segmentation is the task of separating a foreground object from its original background in a video sequence. Video object segmentation may be used in video analysis, editing, surveillance, and various commercial and consumer-oriented applications. However, real-time automatic video object segmentation entails performing a refined object segmentation in each frame of the stream, and as such requires considerable computing resources. Performing real-time video object segmentation on, e.g., mobile or similar platforms with limited computing power may result in relatively poor results in terms of quality and speed.
[0017] For example, typical video stream frame rates are, e.g., between 15 and 60 frames-per-second (fps). However, processing time for a full refined object segmentation in a single frame, using atypical mobile device chipset, may take between 20-50ms, and thus cannot be accomplished in real-time.
[0018] Accordingly, in some embodiments, the present disclosure provides for a hybrid approach which combines (i) selective full ‘ground-truth’ refined object segmentation in frames at regular intervals, and (ii) pixel motion estimation for intermediate frames of the interval, using, e.g., an optical flow and/or similar techniques. The process then outputs an estimated object mask for a specified number of subsequent frames, based on the ground truth segmentation as modified by the cumulative estimated pixel movement. The estimated output is reset at the next ground -truth segmented frame, to avoid accumulating estimation errors. Because pixel motion estimation is faster and less resource-intensive than full ground-truth segmentation, the overall process can output a continuous real-time segmented stream using a relatively modest computing platform, such as those used in mobile devices.
[0019] In some embodiments, the present disclosure provides for a parallelized process which is particularly suited for platforms combining a processing unit with a dedicated acceleration processing component, e.g., platforms combining a central processing unit (CPU) and a graphic processing unit (GPU). Accordingly, selective full segmentation can be performed by, e.g., the GPU at specified intervals, which may be dictated by available computing resources. Simultaneously, an optical flow estimation process can run in parallel on the CPU, thereby increasing overall performance of the system. In more advanced chipsets, a Digital Signal Processor (DSP), which is designed to run Neural Networks, may take the job for the full segmentation, while the GPU may process the parallelized computing vision (e.g. Optical flow) algorithms. [0020] In some embodiments, the present disclosure is based on dividing a video stream into consecutive sequences of frames, e.g., between 2 and 10 frames each.
[0021] In some embodiments, the number of frames in each sequence may be determined by a combination of parameters, including, but not limited to, type and architecture of the computing platform, desired speed and quality outcomes, etc.
[0022] In some embodiments, the number of frames in a sequence may be dictated, e.g., by the computing power and processing times of the associated computing platform on which the process is to be performed. In some embodiments, the number of frames in a sequence may be dynamically adjusted, based, at least in part, on instant response times of the computing platform. Thus, for exmaple:
• A first sequence may comprise, e.g., 6 frames, based on frame segmentation processing time of, e.g., 80ms, and
• a subsequent sequence may comprise, e.g., 3 frames, when instant processing times may have reduced to, e.g., 40ms.
[0023] In some embodiments, for each sequence of frames, the present disclosure provides for detecting and segmenting one or more objects in a first frame of the sequence, to generate one or more masks of the objects in the video. In some embodiments, accurate object segmentation may be performed using a trained neural network, which determines a mask of the object to be segmented.
[0024] In some embodiments, the present disclosure then estimates pixel motion throughout subsequent frames in the sequence, wherein object location is estimated for each frame in the sequence by modifying the initial segmentation based on the cumulative estimated pixel motion form the last refined segmentation. The present disclosure then repeats this process with respect to the next sequence in the stream.
[0025] Thus, in some embodiments, the parallelized process continuously provides for:
• Refined ‘ground-truth’ object segmentation on selected frames at specified intervals (which may be dynamically adjusted throughout the video stream) , and
• continuous optical flow process which estimates pixel motion frame-to-frame throughout the video stream. [0026] In some embodiments, in order to output a continuous stream of object- segmented video in real-time, while allowing for the time-lag caused by performing full ground-truth segmentation on an initial frame in the stream, the present disclosure may utilize a forward-propagation process. Accordingly, in some embodiments, for a current video sequence (of, e.g., between 2 and 10 frames), the present disclosure may generate a corresponding object-segmented output stream which obtains a segmentation mask generated for a first frame in the immediately-preceding sequence, and modify it frame- to-frame based on the accumulated pixel motion estimates at each point in the sequence. This process then iterates for each subsequent sequence, using the fully-segmented first frame from the immediately-preceding sequence, to return to ground-truth, and modifying it with by current pixel motion estimations.
[0027] In some embodiments, a first sequence in a stream may be used as a ‘buffer’ sequence, which will not be outputted, to allow of the time lag in generating the initial full segmentation.
[0028] Fig. 1 illustrates an exemplary segmentation system 100 for automated real time object segmentation in a video stream, in accordance with some embodiments of the present invention. Segmentation system 100 as described herein is only an exemplary embodiment of the present invention, and in practice may have more or fewer components than shown, may combine two or more of the components, or a may have a different configuration or arrangement of the components. The various components of segmentation system 100 may be implemented in hardware, software, or a combination of both hardware and software. In various embodiments, segmentation system 100 may comprise a dedicated hardware device, or may form an addition to or extension of an existing medical device, such as a colposcope.
[0029] In some embodiments, segmentation system 100 may comprise a hardware processor 110 and memory storage device 114. In some embodiments, segmentation system 100 may store in a non-volatile memory thereof, such as storage device 114, software instructions or components configured to operate a processing unit (also "hardware processor," "CPU," or simply "processor), such as hardware processor 110. In some embodiments, the software components may include an operating system, including various software components and/or drivers for controlling and managing general system tasks (e.g., memory management, storage device control, power management, etc.) and facilitating communication between various hardware and software components. In some embodiments, system 100 may comprise one or more graphic processing units (GPUs). In some embodiments, hardware processor 110 comprises, e.g., a CPU, a GPU, and/or a DSP. In some embodiments, the GPU and the CPU may be separated, or the GPU may be integrated on the CPU, and the communication portion may be separated from or integrated on the CPU or the GPU or the like.
[0030] In some embodiments, non-transient computer-readable storage device 114 (which may include one or more computer readable storage mediums) is used for storing, retrieving, comparing, and/or annotating captured frames. Image frames may be stored on storage device 114 based on one or more attributes, or tags, such as a time stamp, a user-entered label, or the result of an applied image processing method indicating the association of the frames, to name a few.
[0031] The software instructions and/or components operating hardware processor 110 may include instructions for receiving and analyzing multiple frames captured by suitable imaging device. For example, hardware processor 110 may comprise image processing module 111 and neural network module 112 Image processing module 110 receives a video stream and applies one or more image processing algorithms thereto. In some embodiments, image processing module 111 comprises one or more algorithms configured to perform object detection, classification, segmentation, and/or any other similar operation, using any suitable image processing, algorithm technique, and/or feature extraction process. The incoming video streams may come from various imaging devices. The video streams received by the image processing module 111 may vary in resolution, frame rate (e.g., between 15 and 60 frames per second), format, and protocol according to the characteristics and purpose of their respective source device. Depending on the embodiment, the image processing module 111 can route video streams through various processing functions, or to an output circuit that sends the processed video stream for presentation, e.g., on a display, to a recording system, across a network, or to another logical destination. The image processing module 111 may apply video stream processing algorithms alone or in combination. Image processing module 111 may also facilitate logging or recording operations with respect to n video stream. Some or all of the functionality of the image processing module 111 may be facilitated through a video stream recording system or a video stream processing system. [0032] In some embodiments, system 100 may be configured to obtain object segmentation result of a specified frame, e.g., a first frame, in a first sequence of the video stream. In some embodiments, system 100 is configured to perform object detection and semantic segmentation by processing an image to generate an output which defines at least some of (i) regions in the image that depict an instance of a respective object, (ii) a respective object type of the object instance (e.g., vehicle, cat, person, and the like) depicted in each region, and (iii) a respective segmentation of the object instance depicted in each region. In some embodiments, segmentation system 100 is configured to process an image to generate object segmentation data. The object segmentation data defines a respective segmentation of the object instance depicted in each region. A segmentation of the object instance depicted in a region defines whether each pixel in the region is included in the object instance. In some embodiments, segmentation system 100 generates the object detection data and the object segmentation data using neural network module 112 that can be trained using machine learning training techniques.
[0033] Neural network module 112 may comprise a convolutional neural network (i.e., which includes one or more convolutional neural network layers), and can be implemented to embody any appropriate convolutional neural network architecture, e.g., U-Net, Mask R-CNN, DeepLab, and the like. See. e.g., Olaf Ronneberger et ah, U-Net: Convolutional Networks for Biomedical Image Segmentation, arXiv:1505.04597vl [cs.CV] 18 May 2015. In a particular example, neural network module 112 may include an input layer followed by a sequence of shared convolutional neural network layers. The output of the final shared convolutional neural network layer may be provided to a sequence of one or more additional neural network layers that are configured to generate the object detection data. However, other appropriate neural network processes may also be used, however. The output of the final shared convolutional neural network layers may be provided to a different sequence of one or more additional neural network layers.
[0034] In some embodiments, system 100 may also be configured to estimate motion between image frames, i.e., determine motion vectors that describe the transformation from one 2D image to another (usually, from adjacent frames in a video sequence) motion estimation may be defined as the process of finding corresponding points between two images (e.g., video frames), wherein the points that correspond to each other in two views of a scene or object may be considered to be the same point in that scene or on that object. In some embodiments, the present disclosure may apply optical flow and/or another and/or similar computer vision technique or algorithm to estimate motion between frames. See, e.g., Fameback G. (2003) Two-Frame Motion Estimation Based on Polynomial Expansion. In: Bigun J., Gustavsson T. (eds) Image Analysis. SCIA 2003. Lecture Notes in Computer Science, vol 2749. Springer, Berlin, Heidelberg.
[0035] For consecutive image sequences such as found in video presentations, optical flow may be defined as the velocity field which warps one image into another (usually very similar) image. In some embodiments, an optical flow estimate comprises an estimate of a translation that describes any motion of a pixel from a position in one image to a position in a subsequent image. In some embodiments, optical flow estimation returns, with respect to each pixel and/or groups of pixels, a change is coordinates (x, y) of the pixel. In some embodiments, pixel motion between pairs of images may be estimated using additional and/or other methods. In some embodiments, system 100 may also compute cumulative pixel coordinate difference acquired over several pairs of consecutive images.
[0036] Fig. 2 is a flowchart detailing the functional steps in a process for automated real-time object segmentation in a video stream.
[0037] At step 202, in some embodiments, segmentation system 100 in Fig. 1 may be configured to receive a video stream depicting, e.g., one or more objects, such as humans, pets, inanimate objects, and the like.
[0038] At step 204, in some embodiments, system 100 may be configured to divide the video stream into sequences of specified number of frames, e.g., between 2 and 10 frames each. In some embodiments, the number of frames in each sequence may be determined by a combination of parameters, including, but not limited to, type and architecture of the computing platform, chipset performance, desired speed and quality outcomes, etc.
[0039] At step 206, system 100 may perform an iterative process comprising:
• Full ground truth object segmentation on a first frame in each interval, and
• continuous optical flow process to estimate pixel motion frame-to-frame throughout the video stream.
[0040] Fig. 3 is a schematic illustration of the iterative process of step 206. In some embodiments, in order to output a continuous stream of object-segmented video in real time (i.e., outputting between 15-60 frames per second), while allowing for the time-lag caused by performing full ground-truth segmentation on an initial frame in the stream (e.g., 20-50ms), the present disclosure may utilize a forward-propagation process, wherein each outputted frame is based on:
(i) a ‘ground-truth’ segmentation mask generated for the first frame in the immediately-preceding sequence, as
(ii) modified by the cumulative optical flow calculated as of the last segmented mask.
[0041] Accordingly, as illustrated in Fig. 3, in some embodiments, system 100 may receive a current sequence comprising n frames, e.g., sequence j comprising 4 frames (frame j +0 - frame j + 3). System 100 may then perform the following sub-steps of iterative step 206 with respect to the current sequence:
(i) Step 206a: Generate a full ground-truth object segmentation for one or more objects in frame j + 0 of the current sequence j, to be propagated forward for use by the next sequence k;
(ii) Step 206b: Estimate an object mask in frame j + 0, by modifying a refined segmentation of frame i + 0 received from the previous sequence i, by incorporating all cumulative pixel motion estimates from frame i + 0 to frame i + 3 (i.e., S(DA, AY) of optical flows between ([i + 0] - [i + 1]), ([i + 1] - [t + 2]), and ([i + 2] - [i + 3]);
(iii) Step 206c: Continuously estimate pixel motion between each sequential pair of frames in current sequence j (i.e., ([/ + 0] - \j + 1]), ([/ + 1] - \j + 2]), and ([/ + 2 ] - [/ + 3]); AND
(iv) Step 206d: Use the corresponding pixel motion estimates from step 206c, to generate estimated object masks for each current frame ([/ + 1] — [j + 3]).
[0042] Accordingly, with continued reference to Fig. 3, in some embodiments, the following forms the basis for generating an estimated object mask for each frame in the current sequence j :
• Estimated mask for frame / + 0 = frame i + 0 mask + cumulative pixel motion ([j + 0] - [i + 3]). • Estimated mask for frame / + 1 = Estimated mask for frame jO + pixel motion
(\j + 0] - [/ + 1]).
• Estimated mask for frame / + 2 = Estimated Mask for frame jl + pixel motion
([/ + 1 ] ~ \j + 2])·
• Estimated mask for frame / + 3 = Estimated mask for frame j + 2 + pixel motion ([/ + 2] — [/ + 3]).
[0043] In some embodiments, the full obj ect segmentation mask generated in step 206a above is propagated forward, to be used as the basis for estimating a mask for frame k + 0 in subsequent sequence k.
[0044] In some embodiments, at step 208, system 100 may be configured to output object segmentation results in the current frame sequence j, e.g., to a media device to be displayed on a display monitor, a device screen, etc. In some embodiments, the output stream may be directed to another computing platform for continued processing and/or manipulation. In some embodiments, step 208 continuously outputs the results of iterative step 206, to generate a continuous real time output stream of the input video with object segmentation results.
[0045] The present invention may be a system, a method, and/or a computer program product. The computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present invention.
[0046] The computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. A non- exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanically encoded device having instructions recorded thereon, and any suitable combination of the foregoing. A computer readable storage medium, as used herein, is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire. Rather, the computer readable storage medium is a non-transient (i.e., not-volatile) medium.
[0047] Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network. The network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. A network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device.
[0048] Computer readable program instructions for carrying out operations of the present invention may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C++ or the like, and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention.
[0049] Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.
[0050] These computer readable program instructions may be provided to a processor of a general-purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable storage medium having instructions stored therein comprises an article of manufacture including instructions which implement aspects of the function/act specified in the flowchart and/or block diagram block or blocks.
[0051] The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus or other device to produce a computer implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
[0052] The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware -based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.
[0053] The descriptions of the various embodiments of the present invention have been presented for purposes of illustration, but are not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen to best explain the principles of the embodiments, the practical application or technical improvement over technologies found in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

CLAIMS What is claimed is:
1. A method comprising: receiving a video stream; continuously calculating pixel motion for each frame in said video stream, relative to an immediately-preceding frame in said video stream; and with respect to each current sequence ofn frames in said video stream:
(i) obtaining a reference object segmentation mask of a reference frame in a preceding sequence in said video stream (reference mask), and
(ii) estimating an object segmentation mask for each current frame in said current sequence, by modifying the reference object segmentation mask with a sum of said pixel motion calculations accumulated from said reference frame through said current frame.
2. The method of claim 1, wherein said calculating is performed using an optical flow method.
3. The method of any one of claims 1-2, wherein n is determined based, at least in part, on a processing time associated with generating said reference object segmentation mask.
4. The method of any one of claims 1-3, wherein n is dynamically adjusted for at least some of said sequences in said video stream, based, at least in part, on changes in said processing time.
5. The method of any one of claims 1-4,
(i) wherein said estimating comprises estimating an object segmentation mask for said current frame by modifying an estimated object segmentation mask generated for a previous frame in said current sequence by modifying said estimated object segmentation mask with a sum of said pixel motion calculations accumulated from said previous frame through said current frame; and
(ii) wherein said estimated object segmentation mask is generated by modifying the reference object segmentation mask with a sum of said pixel motion calculations accumulated from said reference frame through said previous frame.
6. The method of any one of claims 1-5, wherein said reference object segmentation mask is performed by applying a trained neural network.
7. The method of any one of claims 1-6, wherein said reference object segmentation mask comprises one or more objects in said video stream.
8. The method of any one of claims 1-7, wherein said preceding sequence is an immediately-preceding sequence in said video stream.
9. The method of any one of claims 1-8, wherein said reference frame is a first frame is said preceding sequence.
10. The method of any one of claims 1-9, wherein said obtaining and said estimating are repeated iteratively for each consecutive sequence in said video stream.
11. The method of any one of claims 1-10, further comprising outputting said estimated segmentation masks in real time.
12. The method of any one of claims 1-11, wherein said method is performed on a computing device, and wherein said calculating and said obtaining are each performed concurrently using two corresponding processing modules of said computing device.
13. A system comprising: at least one hardware processor; and a non-transitory computer-readable storage medium having stored thereon program instructions, the program instructions executable by the at least one hardware processor to: receive a video stream; continuously calculate pixel motion for each frame in said video stream, relative to an immediately-preceding frame in said video stream; and with respect to each current sequence ofn frames in said video stream:
(i) obtain a reference object segmentation mask of a reference frame in a preceding sequence in said video stream (reference mask), and (ii) estimate an object segmentation mask for each current frame in said current sequence, by modifying the reference object segmentation mask with a sum of said pixel motion calculations accumulated from said reference frame through said current frame.
14. The system of claim 13, wherein said calculating is performed using an optical flow method.
15. The system of any one of claims 13-14, wherein n is determined based, at least in part, on a processing time associated with generating said reference object segmentation mask.
16. The system of any one of claims 13-15, whereinn is dynamically adjusted for at least some of said sequences in said video stream, based, at least in part, on changes in said processing time.
17. The system of any one of claims 13-16,
(i) wherein said estimating comprises estimating an object segmentation mask for said current frame by modifying an estimated object segmentation mask generated for a previous frame in said current sequence by modifying said estimated object segmentation mask with a sum of said pixel motion calculations accumulated from said previous frame through said current frame; and
(ii) wherein said estimated object segmentation mask is generated by modifying the reference object segmentation mask with a sum of said pixel motion calculations accumulated from said reference frame through said previous frame.
18. The system of any one of claims 13-17, wherein said reference object segmentation mask is performed by applying a trained neural network.
19. The system of any one of claims 13-18, wherein said reference object segmentation mask comprises one or more objects in said video stream.
20. The system of any one of claims 13-19, wherein said preceding sequence is an immediately-preceding sequence in said video stream.
21. The system of any one of claims 13-20, wherein said reference frame is a first frame is said preceding sequence.
22. The system of any one of claims 13-21, wherein said obtaining and said estimating are repeated iteratively for each consecutive sequence in said video stream.
23. The system of any one of claims 13-22, wherein said program instructions are further executable to output said estimated segmentation masks in real time.
24. The system of any one of claims 13-23, wherein said at least one hardware processor comprises at least two processing modules, and wherein said calculating and said obtaining are each performed concurrently using a different one of said at least two processing modules.
25. A computer program product comprising a non-transitory computer-readable storage medium having program instructions embodied therewith, the program instructions executable by at least one hardware processor to: receive a video stream; continuously calculate pixel motion for each frame in said video stream, relative to an immediately-preceding frame in said video stream; and with respect to each current sequence ofn frames in said video stream:
(i) obtain a reference object segmentation mask of a reference frame in a preceding sequence in said video stream (reference mask), and
(ii) estimate an object segmentation mask for each current frame in said current sequence, by modifying the reference object segmentation mask with a sum of said pixel motion calculations accumulated from said reference frame through said current frame.
26. The computer program product of claim 25, wherein said calculating is performed using an optical flow method.
27. The computer program product of any one of claims 25-26, wherein n is determined based, at least in part, on a processing time associated with generating said reference object segmentation mask.
28. The computer program product of any one of claims 25-27, wherein n is dynamically adjusted for at least some of said sequences in said video stream, based, at least in part, on changes in said processing time.
29. The computer program product of any one of claims 25-28,
(i) wherein said estimating comprises estimating an object segmentation mask for said current frame by modifying an estimated object segmentation mask generated for a previous frame in said current sequence by modifying said estimated object segmentation mask with a sum of said pixel motion calculations accumulated from said previous frame through said current frame; and
(ii) wherein said estimated object segmentation mask is generated by modifying the reference object segmentation mask with a sum of said pixel motion calculations accumulated from said reference frame through said previous frame.
30. The computer program product of any one of claims 25-29, wherein said reference object segmentation mask is performed by applying a trained neural network.
31. The computer program product of any one of claims 25-30, wherein said reference object segmentation mask comprises one or more objects in said video stream.
32. The computer program product of any one of claims 25-31, wherein said preceding sequence is an immediately-preceding sequence in said video stream.
33. The computer program product of any one of claims 25-32, wherein said reference frame is a first frame is said preceding sequence.
34. The computer program product of any one of claims 25-33, wherein said obtaining and said estimating are repeated iteratively for each consecutive sequence in said video stream.
35. The computer program product of any one of claims 25-34, wherein said program instructions are further executable to output said estimated segmentation masks in real time.
36. The computer program product of any one of claims 25-35, wherein said at least one hardware processor comprises at least two processing modules, and wherein said calculating and said obtaining are each performed concurrently using a different one of said at least two processing modules.
PCT/IB2020/059092 2019-10-08 2020-09-29 Object segmentation in video stream WO2021070004A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201962912202P 2019-10-08 2019-10-08
US62/912,202 2019-10-08

Publications (1)

Publication Number Publication Date
WO2021070004A1 true WO2021070004A1 (en) 2021-04-15

Family

ID=75436801

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2020/059092 WO2021070004A1 (en) 2019-10-08 2020-09-29 Object segmentation in video stream

Country Status (1)

Country Link
WO (1) WO2021070004A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113570610A (en) * 2021-07-26 2021-10-29 北京百度网讯科技有限公司 Method and device for performing target segmentation on video by adopting semantic segmentation model
CN113923493A (en) * 2021-09-29 2022-01-11 北京奇艺世纪科技有限公司 Video processing method and device, electronic equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130121572A1 (en) * 2010-01-27 2013-05-16 Sylvain Paris Methods and Apparatus for Tone Mapping High Dynamic Range Images
US20150334398A1 (en) * 2014-05-15 2015-11-19 Daniel Socek Content adaptive background foreground segmentation for video coding
US20160093336A1 (en) * 2014-07-07 2016-03-31 Google Inc. Method and System for Non-Causal Zone Search in Video Monitoring
US20170337693A1 (en) * 2016-05-23 2017-11-23 Intel Corporation Method and system of real-time image segmentation for image processing
WO2018128741A1 (en) * 2017-01-06 2018-07-12 Board Of Regents, The University Of Texas System Segmenting generic foreground objects in images and videos

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130121572A1 (en) * 2010-01-27 2013-05-16 Sylvain Paris Methods and Apparatus for Tone Mapping High Dynamic Range Images
US20150334398A1 (en) * 2014-05-15 2015-11-19 Daniel Socek Content adaptive background foreground segmentation for video coding
US20160093336A1 (en) * 2014-07-07 2016-03-31 Google Inc. Method and System for Non-Causal Zone Search in Video Monitoring
US20170337693A1 (en) * 2016-05-23 2017-11-23 Intel Corporation Method and system of real-time image segmentation for image processing
WO2018128741A1 (en) * 2017-01-06 2018-07-12 Board Of Regents, The University Of Texas System Segmenting generic foreground objects in images and videos

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113570610A (en) * 2021-07-26 2021-10-29 北京百度网讯科技有限公司 Method and device for performing target segmentation on video by adopting semantic segmentation model
CN113570610B (en) * 2021-07-26 2022-05-13 北京百度网讯科技有限公司 Method and device for performing target segmentation on video by adopting semantic segmentation model
CN113923493A (en) * 2021-09-29 2022-01-11 北京奇艺世纪科技有限公司 Video processing method and device, electronic equipment and storage medium
CN113923493B (en) * 2021-09-29 2023-06-16 北京奇艺世纪科技有限公司 Video processing method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
JP6837158B2 (en) Video identification and training methods, equipment, electronic devices and media
US10412462B2 (en) Video frame rate conversion using streamed metadata
US20180285689A1 (en) Rgb-d scene labeling with multimodal recurrent neural networks
JP2021507388A (en) Instance segmentation methods and devices, electronics, programs and media
US10872251B2 (en) Automated annotation techniques
US11222409B2 (en) Image/video deblurring using convolutional neural networks with applications to SFM/SLAM with blurred images/videos
WO2021070004A1 (en) Object segmentation in video stream
US11910001B2 (en) Real-time image generation in moving scenes
CN110472599B (en) Object quantity determination method and device, storage medium and electronic equipment
US11599974B2 (en) Joint rolling shutter correction and image deblurring
JP2013537654A (en) Method and system for semantic label propagation
US10878850B2 (en) Method and apparatus for visualizing information of a digital video stream
CN111601013B (en) Method and apparatus for processing video frames
EP3739503B1 (en) Video processing
US11902571B2 (en) Region of interest (ROI)-based upscaling for video conferences
Kryjak et al. Real-time implementation of foreground object detection from a moving camera using the vibe algorithm
WO2023105800A1 (en) Object detection device, object detection method, and object detection system
CN107025433B (en) Video event human concept learning method and device
US10628913B2 (en) Optimal data sampling for image analysis
Chae et al. Siamevent: Event-based object tracking via edge-aware similarity learning with siamese networks
CN111277863B (en) Optical flow frame interpolation method and device
US20230325964A1 (en) Systems and methods for generating and running computer vision pipelines for processing of images and/or video
CN117237648B (en) Training method, device and equipment of semantic segmentation model based on context awareness
EP2858034A1 (en) Method and apparatus for generating temporally consistent depth maps
Archana et al. Abnormal Frame Extraction and Object Tracking Hybrid Machine Learning

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20874904

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 28.07.2022)

122 Ep: pct application non-entry in european phase

Ref document number: 20874904

Country of ref document: EP

Kind code of ref document: A1