WO2020043296A1 - Dispositif et procédé de division d'image en avant-plan et en arrière-plan, par apprentissage profond - Google Patents

Dispositif et procédé de division d'image en avant-plan et en arrière-plan, par apprentissage profond Download PDF

Info

Publication number
WO2020043296A1
WO2020043296A1 PCT/EP2018/073381 EP2018073381W WO2020043296A1 WO 2020043296 A1 WO2020043296 A1 WO 2020043296A1 EP 2018073381 W EP2018073381 W EP 2018073381W WO 2020043296 A1 WO2020043296 A1 WO 2020043296A1
Authority
WO
WIPO (PCT)
Prior art keywords
map
resolution
picture
decoder
activation
Prior art date
Application number
PCT/EP2018/073381
Other languages
English (en)
Inventor
Thai V HOANG
Markus Brenner
Hongbin Wang
Jian Tang
Original Assignee
Huawei Technologies Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co., Ltd. filed Critical Huawei Technologies Co., Ltd.
Priority to PCT/EP2018/073381 priority Critical patent/WO2020043296A1/fr
Priority to CN201880097060.6A priority patent/CN112639830A/zh
Publication of WO2020043296A1 publication Critical patent/WO2020043296A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/11Region-based segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/194Segmentation; Edge detection involving foreground-background segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Definitions

  • Embodiments of the present invention relate to the task of separating a picture, e.g. a picture of a video particularly of a surveillance video, into foreground and background. Specifically, to separate moving foreground objects from a static background scene.
  • the present invention presents a device and method, which employ a Convolutional Neural Network (CNN), i.e. perform the separation based on deep learning.
  • CNN Convolutional Neural Network
  • Segmentation is a key component of conventional video analytics, and is used to extract moving foreground objects from static background scenes.
  • segmentation can be seen as the grouping of pixels of the picture into regions that represent moving objects. It is essential to achieve high segmentation accuracy, since it is one of the first steps in many processing pipelines. Current segmentation techniques do not deliver satisfactory results for a range of different recording conditions of the cameras, while maintaining a low computational complexity for real-time processing.
  • surveillance cameras Since surveillance cameras have a stationary position most of the time, they record the same background scene with moving foreground objects of interest. Conventional approaches usually exploit this assumption, and extract the moving objects from each picture of the video by removing the stationary regions of the picture.
  • Background subtraction is one such conventional approach, and is based on the “difference” between a current picture and a reference picture, often called“background model”. Variants of this approach depend on how the background model is constructed and how the“difference” operation is defined at the pixel level. For example, the background model can be estimated as the median at each pixel location of all the pictures (or frames) in a sliding window close to the current picture (or current frame). And the“difference” can be defined as the difference in pixel intensity between the current picture and the background model at each pixel location.
  • some background subtraction techniques are relatively fast, and a number of them has been widely used for surveillance video analytics, these techniques have several limitations:
  • Color similarity between the foreground and background regions may create holes or even break the foreground masks into disconnected blobs.
  • Semantic/instance segmentation techniques in computer vision have improved substantially in recent years, due to the use of deep learning with ever-increasing computing resources and training data. Semantic segmentation is the process of associating each pixel of a picture with a class label (such as“person”,“vehicle”,“bicycle”,“tree”%), while instance segmentation combines object detection with semantic segmentation to label each object instance using a unique instance label. Even though these techniques start to be used in advanced perception systems, such as autonomous vehicles, they are not designed explicitly for surveillance video applications, and thus do not exploit in their algorithmic formulation the fact that surveillance cameras are stationary. The performance of these techniques in foreground objects extraction is thus sub-optimal.
  • “Background subtraction/semantic segmentation combination” is a hybrid technique, which consists of leveraging object-level semantics to address some challenging scenarios for background subtraction. More precisely, the technique combines a“probability map” output from semantic segmentation with the output of a background subtraction technique, in order to reduce false positives.
  • this hybrid technique has to run two models separately in parallel, and does not directly use deep learning in an end-to-end solution for extracting foreground moving objects.
  • “Deep learning-based background subtraction” is a relatively new technique. There are some conventional approaches that use deep neural networks to solve the foreground object extraction from surveillance videos. Characteristic features and disadvantages of these approaches using deep learning are summarized below:
  • embodiments of the present invention aims to improve the conventional approaches.
  • An objective is to provide a segmentation technique, which is high-performant and robust to different recording conditions.
  • embodiments of the invention aim for a lightweight device and a method for separating pictures of surveillance videos in an improved manner.
  • Embodiments of the present invention are defined in the enclosed independent claims.
  • Advantageous implementations of the present invention are further defined in the dependent claims.
  • embodiments of the present invention propose a segmentation technique, which is developed specifically for surveillance videos. Contrary to conventional semantic segmentation, which assigns a category label to each pixel of a picture to indicate an object or a stuff the pixel belongs to, the proposed segmentation technique is able to assign a binary value to each pixel to indicate whether the pixel belongs to a moving object, or in other words, whether it belongs to a foreground object of interest or to background.
  • the output is, for example, a binary map indicating for each pixel of the picture (or image or frame) whether it is associated to a foreground object of interest or not (e.g. to background).
  • the CNN input is a picture and a background model image, preferably in high- resolution RGB (6 channels in total).
  • the CNN output is a 1 -channel probability map of the same resolution as the input, i.e. as the picture.
  • the probability value at each pixel location indicates the confidence that the pixel of the picture belongs to a foreground moving object. Then, preferably, a decision on foreground/background per pixel is forced by thresholding to obtain a binary map.
  • the CNN has an encoder-decoder architecture for multi-resolution feature extraction in the encoder, and multi-resolution foreground activation maps in the decoder.
  • multiple skip connections are provided from the encoder to the decoder to help restore fine boundary details of the activation maps at multi-resolution levels.
  • the training of the CNN optimizes the multi-resolution binary cross entropy loss by using multi-resolution activation maps.
  • the appearance of holes in each activation map or even breakdown of an activation map into disconnected regions can be avoided.
  • a first aspect of the invention provides a device for separating a picture into foreground and background configured to employ a CNN, to receive as an input the picture and a background model image, generate a plurality of feature maps of different resolution based on the input, wherein the resolution of feature maps is gradually reduced, generate a plurality of activation maps of different resolution based on the plurality of feature maps of different resolution, wherein the resolution of activation maps is gradually increased, and output a 1 -channel probability map having the same resolution as the picture, wherein each pixel of the output 1 -channel probability map corresponds to a pixel of the picture and indicates a probability that the corresponding pixel of the picture is associated with a foreground object or with a background object.
  • The“picture” may be a still image or may be a picture of a video. That is, the picture may be one picture in a sequence of pictures, which builds a video. Accordingly, the picture may also be a frame of a video.
  • the video may specifically be a surveillance video, which is typically taken by a stationary surveillance video camera.
  • Each value of a“feature map” indicates, whether one or more determined features are present at one of multiple different regions of the input.
  • the resolution of feature maps is gradually reduced so that longer range information is more easily captured in the deeper feature map.
  • Each value of an“activation map” indicates a confidence that one of multiple different regions of a corresponding feature map is associated with a foreground object or with background. That is, the activation maps may be considered representing foreground masks.
  • the resolution of the activation maps is gradually increased so that object details are better recovered in the deeper activation map.
  • A“probability map“ can be obtained by applying a sigmoid function to a 1 -channel “activation map”.
  • the probability map is also l-channel, i.e. a l-channel probability map, and its values are, e.g., in the range [0, 1].
  • the device is configured to threshold the output 1 -channel probability map to get a binary mask, wherein each pixel of the binary mask indicates whether the corresponding pixel of the picture is associated with a foreground object or with a background object.
  • the device can easily produce an accurate separation of the picture in foreground and background.
  • the input includes a 3-channel [particularly RGB] high-resolution background model image and a 3-channel [particularly RGB] picture of similar resolution.
  • the device does not need to perform a greyscale conversion, i.e. more information can be used for the separation of the picture.
  • the CNN includes an encoder-decoder architecture, the encoder is configured to generate the plurality of feature maps of different resolution, and the decoder is configured to generate the plurality of activation maps of different resolution.
  • This structure allows a particular high performance and accurate separation of the picture into foreground and background.
  • the CNN comprises an encoder with a plurality of consecutive encoder layers and a decoder with a plurality of consecutive decoder layers, the encoder is configured to generate one of the plurality of feature maps per encoder layer, wherein the first encoder layer is configured to generate and downsample a feature map from the received input, and each further encoder layer is configured to generate and downsample a further feature map based on the feature map generated by the previous encoder layer, and the decoder is configured to generate one of the plurality of activation maps per decoder layer, wherein the first decoder layer is configured to upsample the feature map generated by the last encoder layer and generate an activation map based on the upsampled feature map, and each further decoder layer is configured to upsample the activation map generated by the previous decoder layer and generate a further activation map based on the upsampled activation map.
  • each encoder layer contains at least one convolutional filter configured to operate on respectively the input or the feature map of the previous encoder layer, in order to generate a feature map
  • each decoder layer contains at least one convolutional filter configured to operate on respectively the feature map of the last encoder layer or the activation map of the previous decoder layer, in order to generate an activation map.
  • each encoder layer is configured to reduce the resolution of the feature map by performing a strided convolution or pooling operation
  • each decoder layer is configured to increase the resolution of the feature map of the last encoder layer or of the activation map generated by the previous decoder layer by performing a transposed convolution or unpooling operation.
  • the CNN further comprises a plurality of skip connections, each skip connection connects one of the further encoder layers, which is configured to generate a feature map of a certain resolution, with one of the further decoder layers, which is configured to generate an activation map of a same resolution or the most similar resolution, and said further decoder layer is configured to generate the activation map based on the activation map of the previous decoder layer and the feature map generated by the encoder layer to which it is connected via the skip connection.
  • a further activation map may be generated based on a concatenation of the feature map from an encoder layer with similar resolution (due to the skip connection) and an upsampled activation map generated by previous decoder layer.
  • the skip connections are beneficial for restoring fine boundary details in the multi-resolution activation maps.
  • each of the further encoder layers is configured to generate a feature map including more channels than included in the feature map generated by the previous encoder layer.
  • each of the further decoder layers is configured to generate an activation map including less channels than included in the activation map generated by the previous decoder layer.
  • each decoder layer is further configured to output a 1 -channel activation map estimation
  • the device is configured to calculate a multi-resolution loss based on all the output 1 -channel activation map estimations of the decoder layers, and upsample each 1 -channel activation map estimation and use it as input to the next decoder layer.
  • a second aspect of the invention provides a hardware implementation of a CNN configured to receive as an input a picture and a background model image, generate a plurality of feature maps of different resolution based on the input, wherein the resolution of feature maps is gradually reduced, generate a plurality of activation maps of different resolution based on the plurality of feature maps of different resolution, wherein the resolution of activation maps is gradually increased, and output a 1 -channel probability map having the same resolution as the picture, wherein each pixel of the output 1 -channel probability map corresponds to a pixel of the picture and indicates a probability that the corresponding pixel of the picture is associated with a foreground object or with a background object.
  • a third aspect of the invention provides a method of employing a CNN for separating a picture into foreground and background, the method comprising receiving as an input the picture and a background model image, generating a plurality of feature maps of different resolution based on the input, wherein the resolution of feature maps is gradually reduced, generating a plurality of activation maps of different resolution based on the plurality of feature maps of different resolution, wherein the resolution of activation maps is gradually increased, and outputting a 1 -channel probability map having the same resolution as the picture, wherein each pixel of the output 1 -channel probability map corresponds to a pixel of the picture and indicates a probability that the corresponding pixel of the picture is associated with a foreground object or with a background object.
  • the method comprises thresholding the output 1 -channel probability map to get a binary mask, wherein each pixel of the binary mask indicates whether the corresponding pixel of the picture is associated with a foreground object or with a background object.
  • the input includes a 3-channel [particularly RGB] high-resolution background model image and a 3-channel [particularly RGB] picture.
  • the employed CNN includes an encoder-decoder architecture, the encoder generates the plurality of feature maps of different resolution, and the decoder generates the plurality of activation maps of different resolution.
  • the employed CNN comprises an encoder with a plurality of consecutive encoder layers and a decoder with a plurality of consecutive decoder layers, the encoder generates one of the plurality of feature maps per encoder layer, wherein the first encoder layer generates and downsamples a feature map from the received input, and each further encoder layer generates and downsamples a further feature map based on the feature map generated by the previous encoder layer, and the decoder generates one of the plurality of activation maps per decoder layer, wherein the first decoder layer upsamples the feature map generated by the last encoder layer and generates an activation map based on the upsampled feature map, and each further decoder layer upsamples the activation map generated by the previous decoder layer and generates a further activation map based on the upsampled activation map.
  • each encoder layer contains at least one convolutional filter operating on respectively the input or the feature map of the previous encoder layer, in order to generate a feature map
  • each decoder layer contains at least one convolutional filter operating on respectively the feature map of the last encoder layer or the activation map of the previous decoder layer, in order to generate an activation map.
  • each encoder layer reduces the resolution of the feature map by performing a strided convolution or pooling operation
  • each decoder layer increases the resolution of the feature map of the last encoder layer or of the activation map generated by the previous decoder layer by performing a transposed convolution or unpooling operation.
  • the CNN further comprises a plurality of skip connections, each skip connection connects one of the further encoder layers, which generates a feature map of a certain resolution, with one of the further decoder layers, which generates an activation map of a same resolution or the most similar resolution, and said further decoder layer generates the activation map based on the activation map of the previous decoder layer and the feature map generated by the encoder layer to which it is connected via the skip connection.
  • each of the further encoder layers is configured to generate a feature map including more channels than included in the feature map generated by the previous encoder layer.
  • each of the further decoder layers generates an activation map including less channels than included in the activation map generated by the previous decoder layer.
  • each decoder layer further outputs a 1 -channel activation map estimation
  • the method comprises calculating a multi resolution loss based on all the output 1 -channel activation map estimations of the decoder layers, and upsampling each 1 -channel activation map estimation and using it as input to the next decoder layer.
  • a fourth aspect of the invention provides a computer program product comprising program code for performing, when implemented on a processor, a method according to the third aspect or any of its implementation forms.
  • a fifth aspect of the invention provides a computer comprising at least one memory and at least one processor, which are configured to store and execute program code to perform the method according to the fourth aspect or any of its implementation forms.
  • Not scene-specific i.e. a single model is pre-trained to work on all scenes.
  • No picture resizing requires i.e. pictures or video inputted to the CNN can have a resolution similar to the resolution used for video recording for high segmentation performance. For example, if a video is recorded at 1920x1080, the CNN will accept as the input data of size 1920x1080. Lightweight, because the CNN architecture is carefully designed so that it can perform foreground object extraction on 1920x1080 videos in real time using a common GPU.
  • FIG. 1 shows a device according to an embodiment of the invention.
  • FIG. 2 shows a device according to an embodiment of the invention.
  • FIG. 3 shows an encoder of a CNN of a device according to an embodiment of the invention.
  • FIG. 4 shows a decoder of a CNN of a device according to an embodiment of the invention.
  • FIG. 5 shows a method according to an embodiment of the invention.
  • FIG. 6 shows a comparison between results obtained by the invention and results obtained by conventional background subtraction techniques.
  • FIG. 1 shows a device 100 according to an embodiment of the invention.
  • the device 100 is configured to separate a picture 101 into foreground and background, e.g. into moving objects and a static scene.
  • the device 100 of FIG. 1 is configured to employ a CNN (CNN model, CNN architecture), i.e. is configured to perform the separation of the picture 101 by using deep learning.
  • the device 100 may be an image processor, computer, microprocessor, or the like, or multiples thereof or any combination thereof, that implements the CNN.
  • the CNN is configured to receive the picture 101 and a background model image 102 as an input 101, 102.
  • the background model image 102 may be an image of a scene, which is monitored by a surveillance video camera also providing the picture, and which is taken in advance (or at some determined time) without any (moving) foreground objects or can be estimated as the median at each pixel location of all the pictures (or frames) in a sliding window close to the current picture (or current frame) .
  • the picture 101 may be a still picture or may be one of a sequence of pictures, e.g. pictures of a video, as provided for instance by said surveillance video camera.
  • the CNN is configured to generate a plurality of feature maps 103 (indicated in FIG. 1 by dotted rectangles) of different resolution (indicated by the different sizes of the dotted rectangles) based on the input 101, 102.
  • the resolution of the feature maps 103 is gradually reduced, i.e. each further generated feature map 103 has a lower resolution than the previous one.
  • the CNN is configured to generate a plurality of activation maps 104 (indicated by dashed rectangles in FIG. 1) of different resolution (indicated by the different sizes of the dashed rectangles) based on the plurality of feature maps 103 of different resolution.
  • the resolution of the activation maps 104 is gradually increased, i.e. each further generated activation map 104 has a higher resolution than the previous one.
  • the CNN is finally configured to output a 1 -channel probability map 105 having the same resolution as the picture 101.
  • Each pixel of the output 1 -channel probability map 105 corresponds to a pixel of the picture 101 and indicates a probability that the corresponding pixel of the picture 101 is associated with a foreground object or with a background object.
  • the CNN may apply a sigmoid function to an activation map having 1 -channel.
  • the device 100 may be further configured to threshold the output 1 -channel probability map 105 to obtain a binary mask, wherein each pixel of the binary mask indicates whether the corresponding pixel of the picture 101 is associated with a foreground object or with a background object.
  • the probability per pixel of the probability map is compared with a probability threshold, and the pixel is e.g. attributed to the background, if its probability value is below the threshold, and to foreground, if its probability value is above the threshold.
  • the thresholding can also be done by another device receiving the 1 -channel probability map 105 from the CNN.
  • binary masks of different resolution are beneficial, since they can be compared with the ground-truth data of different resolution.
  • the inference stage of the CNN only the binary mask calculated from the output 1 -channel probability map may be used.
  • FIG. 2 shows a device 100 according to an embodiment of the invention, which builds on the device 100 shown in FIG. 1. Accordingly, the device 100 of FIG. 2 includes all elements of the device 100 of FIG. 1, wherein identical elements are labelled with the same reference signs and function likewise.
  • the CNN of the device 100 includes an encoder-decoder architecture, i.e. it includes an encoder 200 and a decoder 210.
  • the encoder 200 is configured to generate the plurality of feature maps 103, while the decoder 210 is configured to generate the plurality of activation maps 104 and the l-channel probability map 105, respectively.
  • the encoder 200 comprises a plurality of consecutive encoder layers 20la, 20lb, namely a first encode layer 20la and at least one further encoder layer 20lb.
  • the encoder 200 is configured to generate one of the plurality of feature maps 103 per encoder layer 20 la, 20 lb, wherein each feature map 103 has a different resolution.
  • the decoder 210 comprises a plurality of consecutive decoder layers 21 la, 21 lb, namely a first decoder layer 21 la and at least one further decoder layer 21 lb.
  • the decoder 200 is configured to generate one of the plurality of activation maps 104 per decoder layer 21 la, 21 lb, wherein each activation map 104 has a different resolution.
  • the last encoder layer 21 lb specifically generates the 1 -channel probability map 105 from the activation map 104 it generates.
  • the CNN has further a plurality of skip connections 202.
  • Each skip connection 202 connects one of the further encoder layers 20 lb, which is configured to generate a feature map 103 of a certain resolution, with one of the decoder layers 21 la and 21 lb, which is configured to generate an activation map 104 of a same resolution or at least of the most similar resolution.
  • Each decoder layer 21 la, 21 lb is further configured to output a 1 -channel activation map estimation 212.
  • These estimations 212 can be used by the device 100 to calculate a multi resolution loss based on all the output l-channel activation map estimations 212 of the decoder layers 21 la, 21 lb, in order to recover spatial information lost during downsampling.
  • the CNN of the device 100 is further configured to upsample each 1- channel activation map estimation 212 and to then use it as input to the next decoder layer 21 lb.
  • the l-channel probability map 105 output by the last decoder layer 21 lb may correspond to the l-channel activation map estimation 212 output from that layer 21 lb.
  • the encoder 200 is now exemplarily described in more detail.
  • the encoder 200 may be defined as a sequence of convolution filtering intertwining with downsampling operations to obtain features maps 103 of the picture 101 and background model 102, respectively.
  • Each encoder layer 20 la, 20 lb may contain:
  • a feature map 103 output from each layer 20 la, 20 lb can be described as a set of activations, which represents semantic information for a region of the input 101, 102 at that layer’ s resolution.
  • the encoder 200 can be seen as a multi-resolution feature extractor.
  • Feature maps 103 output from encoder layers 20 la, 20 lb can be used as input to the decoder 210 by means of the skip connections 202, in order to reconstruct activation maps 104 of the foreground moving objects at multiple scales at the decoder 210.
  • the encoder 200 may accept 6-channel input data (3 channels for the picture 101 and 3 channels for background model image 102) of original picture size, i.e. no picture resizing to a fixed size value is required.
  • This input 101, 102 will go through a sequence of 5 encoder layers 20la, 20lb (1 5).
  • the number of channels in the feature maps 103 (indicated by the dashed rectangles) output from these layers 20la, 20lb increases as the data goes deeper into the encoder 200 (e.g., in FIG. 3 the first encoder layer 20 la outputs a feature map 103 with 64 channels, whereas the fifth encoder layer 20lb outputs a feature map 103 with 512 channels).
  • the spatial resolution of the feature maps 103 decrease by a factor of 1/2 after each encoder layer 20la, 20lb, reaching a downsampling factor of 1/32 at the end of the encoder 200.
  • the decoder 210 is now exemplarily described in more detail.
  • the decoder 210 may be defined as a sequence of convolution filtering intertwining with upsampling operations to obtain activation maps 104 of foreground moving objects.
  • Each decoder layer 21 la, 21 lb may contain:
  • Upsampling means operating based on e.g. a transposed convolution or unpooling operation.
  • An activation map 104 output of each decoder layer 211 a, 2l lb may be described as an estimation of binary masks at the current decoder layer’s resolution.
  • a multi-layer decoder 210 produces multi-resolution estimations of binary masks of foreground moving objects.
  • the decoder 210 receives multi-resolution feature maps 103 from the encoder 200, particularly by means of the skip connections 212 between encoder-decoder layers of similar resolution.
  • the concatenated previous decoder layer’s activation map 104 with the corresponding encoder’s feature map 103 will go through a sequence of 4 decoder layers 21 la, 21 lb (1 4).
  • the number of channels in the activation maps 104 output from these layers 21 la, 21 lb decreases as the data is nearer to the end of the decoder 210 (e.g., the first decoder layer 21 la outputs an activation map 104 of 256 channels, whereas the fourth (last) decoder layer 21 lb generates an activation map 104 of 1 channel and outputs a probability map 105 having 1 channel).
  • the spatial resolution of the activation maps 104 increases by a factor of 2 after each decoder layer 21 la, 21 lb, reaching 1/4 at the end of the decoder 210 and before a final bi linear interpolation with a resizing factor of 4 and a sigmoid function to obtain the 1- channel probability map 105.
  • Each decoder layer 21 la, 21 lb may additionally have a mask-estimator that produces a 1- channel activation map estimation 212 at that layer’s resolution.
  • a mask-estimator that produces a 1- channel activation map estimation 212 at that layer’s resolution.
  • the skip connections 202 are used to bring feature maps 103 from encoder layers 20 la, 20 lb to the corresponding decoder layers 21 la, 21 lb of same or similar resolution.
  • Multi-resolution loss may further be calculated and used to enforce the generation of activation maps 104 at multi-resolution.
  • the loss at each resolution is first computed as the binary cross-entropy between the estimation of binary masks at that resolution with a downsampled version of the expected binary masks (ground- truth).
  • the final loss used for training i.e. updating the values of encoder’s and decoder’s convolution filters
  • Stochastic gradient descent supervised training may be used to determine the convolution filters, so that the device 100 can perform optimally on certain datasets. For each mini batch of size k of the training dataset:
  • FIG. 5 shows a method 500 according to an embodiment of the invention.
  • the method 500 employs a CNN and is used for separating a picture 101 into foreground and background.
  • the method 500 may be carried out by the device 100 shown in FIG. 1 or FIG. 2.
  • the method 500 comprises a step 501 of receiving as an input 101, 102 the picture 101 and a background model image 102. Further, a step 502 of generating a plurality of feature maps 103 of different resolution based on the input 101, 102, wherein the resolution of feature maps 103 is gradually reduced. Further, a step 503 of generating a plurality of activation maps 104 of different resolution based on the plurality of feature maps 103 of different resolution, wherein the resolution of activation maps 104 is gradually increased. Further, a step 504 of outputting a 1 -channel probability map 105 having the same resolution as the picture 101. Each pixel of the output 1 -channel probability map 105 corresponds to a pixel of the picture 101 and indicates a probability that the corresponding pixel of the picture 101 is associated with a foreground object or with a background object.
  • FIG. 6 shows an example foreground object extraction results from a surveillance video frame (original picture shown top-left of Fig. 6) using the device 100 of the present invention (BackgroundNet, shown top-right of Fig. 6) and two implementations of conventional background subtraction techniques (CNT and MOG2, shown bottom- left and bottom-right respectively in Fig. 6). It can be seen that BackgroundNet provides segmentation results with mush less noise and disconnectivity.
  • Embodiments of the invention may be implemented in hardware, software or any combination thereof.
  • Embodiments of the invention e.g. the device and/or the hardware implementation, may be implemented as any of a variety of suitable circuitry, such as one or more microprocessors, digital signal processors (DSPs), application- specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), discrete logic, hardware, etc, or any combinations thereof.
  • Embodiments may comprise computer program products comprising program code for performing, when implemented on a processor, any of the methods described herein.
  • Further embodiments may comprise at least one memory and at least one processor, which are configured to store and execute program code to perform any of the methods described herein.
  • embodiments may comprise a device configured store instructions for software in a suitable, non-transitory computer-readable storage medium and may execute the instructions in hardware using one or more processors to perform any of the methods described herein.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

Des modes de réalisation de l'invention ont trait au domaine de la division d'images, en particulier d'images d'une vidéo de surveillance, en avant-plan et en arrière-plan. L'invention concerne en particulier un dispositif et un procédé mettant en oeuvre un réseau de neurones convolutif (CNN), c'est-à-dire basés sur l'apprentissage profond. Le CNN est configuré pour recevoir comme entrée l'image et une image de modèle d'arrière-plan. Le CNN est configuré pour générer des cartes de caractéristiques de résolution différente en fonction de l'entrée, la résolution des cartes de caractéristiques étant progressivement réduite. Selon les cartes de caractéristiques, le CNN est configuré pour générer des cartes d'activation de résolution différente, la résolution des cartes d'activation étant progressivement augmentée. En outre, le CNN est configuré pour émettre en sortie une carte de probabilités de canal 1 présentant la même résolution que l'image, chaque pixel de la carte de probabilités de canal 1 de sortie correspondant à un pixel de l'image et indiquant une probabilité que le pixel correspondant de l'image soit associé à un objet d'avant-plan ou à un objet d'arrière-plan.
PCT/EP2018/073381 2018-08-30 2018-08-30 Dispositif et procédé de division d'image en avant-plan et en arrière-plan, par apprentissage profond WO2020043296A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/EP2018/073381 WO2020043296A1 (fr) 2018-08-30 2018-08-30 Dispositif et procédé de division d'image en avant-plan et en arrière-plan, par apprentissage profond
CN201880097060.6A CN112639830A (zh) 2018-08-30 2018-08-30 利用深度学习将图片分离成前景和背景的设备和方法

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/EP2018/073381 WO2020043296A1 (fr) 2018-08-30 2018-08-30 Dispositif et procédé de division d'image en avant-plan et en arrière-plan, par apprentissage profond

Publications (1)

Publication Number Publication Date
WO2020043296A1 true WO2020043296A1 (fr) 2020-03-05

Family

ID=63491609

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2018/073381 WO2020043296A1 (fr) 2018-08-30 2018-08-30 Dispositif et procédé de division d'image en avant-plan et en arrière-plan, par apprentissage profond

Country Status (2)

Country Link
CN (1) CN112639830A (fr)
WO (1) WO2020043296A1 (fr)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111539403A (zh) * 2020-07-13 2020-08-14 航天宏图信息技术股份有限公司 农业大棚的识别方法、装置及电子设备
CN111582449A (zh) * 2020-05-07 2020-08-25 广州视源电子科技股份有限公司 一种目标域检测网络的训练方法、装置、设备及存储介质
WO2021183215A1 (fr) * 2020-03-13 2021-09-16 Microsoft Technology Licensing, Llc Système et procédé d'amélioration de modèles d'apprentissage automatique basés sur un réseau neuronal convolutif
US11445198B2 (en) * 2020-09-29 2022-09-13 Tencent America LLC Multi-quality video super resolution with micro-structured masks
CN116206114A (zh) * 2023-04-28 2023-06-02 成都云栈科技有限公司 一种复杂背景下人像提取方法及装置

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108154518A (zh) * 2017-12-11 2018-06-12 广州华多网络科技有限公司 一种图像处理的方法、装置、存储介质及电子设备

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108010034A (zh) * 2016-11-02 2018-05-08 广州图普网络科技有限公司 商品图像分割方法及装置
CN106651886A (zh) * 2017-01-03 2017-05-10 北京工业大学 一种基于超像素聚类优化cnn的云图分割方法
CN108416751A (zh) * 2018-03-08 2018-08-17 深圳市唯特视科技有限公司 一种基于深度辅助全分辨率网络的新视点图像合成方法

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108154518A (zh) * 2017-12-11 2018-06-12 广州华多网络科技有限公司 一种图像处理的方法、装置、存储介质及电子设备

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
LIM KYUNGSUN ET AL: "Background subtraction using encoder-decoder structured convolutional neural network", 2017 14TH IEEE INTERNATIONAL CONFERENCE ON ADVANCED VIDEO AND SIGNAL BASED SURVEILLANCE (AVSS), IEEE, 29 August 2017 (2017-08-29), pages 1 - 6, XP033233444, DOI: 10.1109/AVSS.2017.8078547 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021183215A1 (fr) * 2020-03-13 2021-09-16 Microsoft Technology Licensing, Llc Système et procédé d'amélioration de modèles d'apprentissage automatique basés sur un réseau neuronal convolutif
US11436491B2 (en) 2020-03-13 2022-09-06 Microsoft Technology Licensing, Llc System and method for improving convolutional neural network-based machine learning models
CN111582449A (zh) * 2020-05-07 2020-08-25 广州视源电子科技股份有限公司 一种目标域检测网络的训练方法、装置、设备及存储介质
CN111539403A (zh) * 2020-07-13 2020-08-14 航天宏图信息技术股份有限公司 农业大棚的识别方法、装置及电子设备
US11445198B2 (en) * 2020-09-29 2022-09-13 Tencent America LLC Multi-quality video super resolution with micro-structured masks
CN116206114A (zh) * 2023-04-28 2023-06-02 成都云栈科技有限公司 一种复杂背景下人像提取方法及装置
CN116206114B (zh) * 2023-04-28 2023-08-01 成都云栈科技有限公司 一种复杂背景下人像提取方法及装置

Also Published As

Publication number Publication date
CN112639830A (zh) 2021-04-09

Similar Documents

Publication Publication Date Title
CN108256562B (zh) 基于弱监督时空级联神经网络的显著目标检测方法及系统
WO2020043296A1 (fr) Dispositif et procédé de division d'image en avant-plan et en arrière-plan, par apprentissage profond
US20210326650A1 (en) Device for generating prediction image on basis of generator including concentration layer, and control method therefor
US11393100B2 (en) Automatically generating a trimap segmentation for a digital image by utilizing a trimap generation neural network
US11651477B2 (en) Generating an image mask for a digital image by utilizing a multi-branch masking pipeline with neural networks
CN111968064B (zh) 一种图像处理方法、装置、电子设备及存储介质
CN111914654B (zh) 一种文本版面分析方法、装置、设备和介质
Ma et al. Fusioncount: Efficient crowd counting via multiscale feature fusion
CN111079613B (zh) 姿势识别方法和装置、电子设备及存储介质
CN111932480A (zh) 去模糊视频恢复方法、装置、终端设备以及存储介质
Onim et al. Blpnet: A new dnn model and bengali ocr engine for automatic licence plate recognition
CN116129291A (zh) 一种面向无人机畜牧的图像目标识别方法及其装置
Patil et al. Multi-frame recurrent adversarial network for moving object segmentation
CN116994000A (zh) 零件边缘特征提取方法和装置、电子设备及存储介质
Angelo A novel approach on object detection and tracking using adaptive background subtraction method
Hua et al. Dynamic scene deblurring with continuous cross-layer attention transmission
Jameel et al. Gait recognition based on deep learning
Huang et al. A content-adaptive resizing framework for boosting computation speed of background modeling methods
WO2023155305A1 (fr) Procédé et appareil de reconstruction d'image, et dispositif électronique et support de stockage
US20230135978A1 (en) Generating alpha mattes for digital images utilizing a transformer-based encoder-decoder
US20230342986A1 (en) Autoencoder-based segmentation mask generation in an alpha channel
Lim et al. LAU-Net: A low light image enhancer with attention and resizing mechanisms
CN115272906A (zh) 一种基于点渲染的视频背景人像分割模型及算法
Haris et al. An efficient super resolution based on image dimensionality reduction using accumulative intensity gradient
Yin et al. Deep motion boundary detection

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18765407

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18765407

Country of ref document: EP

Kind code of ref document: A1