CN107624189B - Method and apparatus for generating a predictive model - Google Patents

Method and apparatus for generating a predictive model Download PDF

Info

Publication number
CN107624189B
CN107624189B CN201580080145.XA CN201580080145A CN107624189B CN 107624189 B CN107624189 B CN 107624189B CN 201580080145 A CN201580080145 A CN 201580080145A CN 107624189 B CN107624189 B CN 107624189B
Authority
CN
China
Prior art keywords
image
training
neural network
convolutional neural
density distribution
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201580080145.XA
Other languages
Chinese (zh)
Other versions
CN107624189A (en
Inventor
王晓刚
张聪
李鸿升
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sensetime Technology Development Co Ltd
Original Assignee
Beijing Sensetime Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sensetime Technology Development Co Ltd filed Critical Beijing Sensetime Technology Development Co Ltd
Publication of CN107624189A publication Critical patent/CN107624189A/en
Application granted granted Critical
Publication of CN107624189B publication Critical patent/CN107624189B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/52Surveillance or monitoring of activities, e.g. for recognising suspicious objects
    • G06V20/53Recognition of crowd images, e.g. recognition of crowd congestion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/192Recognition using electronic means using simultaneous comparisons or correlations of the image signals with a plurality of references
    • G06V30/194References adjustable by an adaptive method, e.g. learning

Landscapes

  • Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Biomedical Technology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Image Analysis (AREA)

Abstract

A method for generating a prediction model to predict crowd density distribution and people count in an image frame is disclosed, comprising: training a CNN by inputting one or more crowd image patches of frames in a training set, each crowd image patch having a predetermined real-world density distribution and people count in the inputted crowd image patches; sampling frames from a target scene image set and receiving training images from a training set having the determined truth density distribution and counts/numbers; retrieving similar image data from the received training frames for each of the sampled target image frames to overcome a scene gap between the set of target scene images and the training images; and fine-tuning the CNN by inputting similar image data to the CNN to determine a prediction model for predicting the population density map and the person count in the image frame.

Description

Method and apparatus for generating a predictive model
Technical Field
The present application relates to a device and method for generating a prediction model to predict crowd density distribution and people count in an image frame.
Background
Counting the pedestrian population in the video has strong demand in video monitoring, and therefore attracts much attention. People counting is a challenging task due to severe occlusion, perspective distortion of the scene, and diversity of people distribution. Since pedestrian detection and tracking has difficulty when used in crowd scenarios, most methods of the current state of the art are based on regression, and the goal is to learn the mapping between low-level features and crowd counts. However, these works are scenario specific, i.e. the population count model learned for a particular scenario can only be applied to the same scenario. The model must be retrained with new annotations to account for unseen scenes or changed scene layouts.
There have been many efforts to count pedestrians by detection or trajectory-clustering. But for the people counting problem, these methods are limited by severe occlusion between people. Many methods attempt to predict the global count by using regression elements (regressors) trained with low-level features. These methods are more suitable for crowded environments and are more computationally efficient.
Counting by global regression ignores the spatial information of the pedestrian.Lempitsky et alAn object counting method by pixel layer object density map regression is presented. After the operation has been carried out in the first place,fiaschi et alRandom forest (random forest) is used to back-and-forth attribute object density and improve training efficiency. Besides considering spatial information, another advantage of the density regression-based approach is: they can estimate the count of objects in any region of the image. With this advantage, an interactive object counting system is introduced that visualizes the region count to help the user efficiently determine relevance feedback.RodriguezeDensity map estimation is used to improve the human head detection results. These methods are scene specific and are not applicable to cross scene counting.
Much work has taken in-depth learning for various monitoring applications such as people re-identification, pedestrian detection, tracking, crowd behavior analysis, and crowd segmentation. Their success benefits from the discriminative power of the depth model.Serman et alIt has been shown that for many applications, features extracted from depth models are more efficient than hand-made features. However, a depth model for population counting has not been developed.
With manyLarge scale and well-labeled data sets are disclosed, and a nonparametric, data-driven approach is proposed. Such methods can be easily extended upwards as they do not require training. They transfer labels from the training images to the test images by taking the most similar training images and matching these training images to the test images.Liu et al Human beingA non-parametric image analysis method is proposed that seeks a dense deformation field between images.
Disclosure of Invention
The present disclosure addresses the problem of crowd density and count estimation, with the goal of automatically estimating the density map and/or the number/count of people on a given surveillance video frame.
The application provides a cross scene density and count estimation system. Even if no target scene exists in the training set, the system can still estimate the density map and the person count of the scene.
In one aspect, an apparatus for generating a prediction model to predict a population density map and count is disclosed that includes a density map creation unit, a CNN generation unit, a similarity data acquisition unit, and a model refinement unit. The density map creation unit is configured to approximate a perspective map (productive map) of each training scenario from a training set (with pedestrian head labels indicating head positions of each person in a region of interest (ROI)) to create a ground-route density map and counts on the training set based on the labels and the perspective map. The density map represents the population distribution per frame and the integral of the density map is equal to the total number of pedestrians. The CNN generation unit is configured to construct and initialize a crowd convolutional neural network to train the CNN by inputting crowd image patches sampled from the training set to the CNN and corresponding truth density maps and counts. The similar data acquisition unit is configured to: receiving a sample frame from a target scene and samples from a training set with a truth density map and counts created by a CNN generation unit; and acquiring similar data from the training set for each target scene to overcome the scene gap. The model fine-tuning unit is configured to receive the acquired similar data and construct a second CNN, wherein the second CNN is initialized by using the trained first CNN, and the model fine-tuning unit is further configured to fine-tune the initialized second CNN with the similar data so that the second CNN can predict a density map and a pedestrian count in a region of interest of the video frame to be detected and the region of interest.
In an aspect of the application, a method for generating a prediction model to predict a population density distribution and a person count in an image frame is disclosed, the method comprising:
training a CNN by inputting one or more crowd image patches of a frame in a training set, each crowd image patch having a predetermined true density distribution and people count in the inputted crowd image patches;
sampling frames from a set of target scene images and receiving training images from a training set having the determined truth density distribution and counts/numbers, an
Acquiring similar image data from the received training frames for each sampled target image frame to overcome a scene gap between the target scene image set and the training images; and
the CNN is fine-tuned by inputting similar image data to the CNN to determine a prediction model for predicting crowd density maps and people counts in image frames.
In a further aspect of the application, an apparatus for generating a prediction model to predict a population density distribution and a person count in an image frame is disclosed, the apparatus comprising:
a CNN training unit training CNN by inputting one or more crowd image patches of a frame in a training set, each crowd image patch having a predetermined true density distribution and a person count among the inputted crowd image patches;
a similar data acquisition unit sampling frames from the target scene image set and receiving training images having the determined truth density distribution and count/number from the training set; acquiring similar image data from the received training frame for each sampled target image frame to overcome a scene gap between the target scene image set and the training image; and
and a model fine-tuning unit for fine-tuning the CNN by inputting similar image data to the CNN to determine a prediction model for predicting the crowd density map and the person count in the image frame.
In a further aspect of the application, a system for generating a prediction model to predict a population density distribution and a person count in an image frame is disclosed, the system comprising:
a memory storing executable components; and
a processor electrically coupled to the memory to execute executable components to perform operations of the system, wherein the executable components comprise:
a CNN training component for training a CNN by inputting one or more crowd image patches of a frame in a training set, each of the crowd image patches having a predetermined truth density distribution and person count in the inputted crowd image patches;
a similarity data acquisition component that samples frames from the target scene image set and receives training images from the training set having the determined truth density distribution and counts/numbers; acquiring similar image data from the received training frame for each sampled target image frame to overcome a scene gap between the target scene image set and the training image; and
and a model fine-tuning section that fine-tunes the CNN by inputting similar image data to the CNN to determine a prediction model for predicting a crowd density map and a person count in the image frame.
According to the required solution. There will be at least one of the following advantages:
a multitasking system-which can estimate crowd density maps and counts together. The number of counts can be calculated by integration of the density map. Two related tasks may also help each other to get a better solution for our training model.
Cross scene capability — the object scene does not require additional pedestrian tags in the framework for cross scene counting.
No population segmentation is required-it does not rely on the population foreground segmentation pre-processing. The crowd texture will be captured by our model regardless of whether the crowd moves or not, and the system can obtain reasonable estimation results.
The following description and the annexed drawings set forth certain illustrative aspects of the disclosure. These aspects are indicative, however, of but a few of the various ways in which the principles of the disclosure may be employed. Other aspects of the disclosure will become apparent from the following detailed description of the disclosure when considered in conjunction with the accompanying drawings.
Drawings
Illustrative, non-limiting embodiments of the invention are described below with reference to the accompanying drawings. The figures are illustrative and are generally not drawn to exact scale. The same reference numbers will be used throughout the drawings to refer to the same or like parts.
Fig. 1 is a schematic diagram illustrating a block diagram of an apparatus 1000 for generating a predictive model to predict crowd density maps and counts according to one embodiment of the present application.
Fig. 2 is a schematic diagram illustrating a flow of a device 1000 generating a prediction module to predict a population density distribution and a person count in an image frame according to an embodiment of the present application.
Fig. 3 is a schematic diagram illustrating a flow process of the density map creating unit 10 according to an embodiment of the present application.
Fig. 4 is a diagram illustrating a flow process of a CNN training unit according to an embodiment of the present application.
Fig. 5 is a schematic diagram illustrating an overview of a demographic CNN model with switchable objects (switchable objects) according to one embodiment of the present application.
Fig. 6 is a schematic diagram illustrating a flow of similar data acquisition according to another embodiment of the present application.
FIG. 7 is a schematic diagram illustrating a system for generating a predictive model in which the functionality of the present invention is implemented in software, according to one embodiment of the present application.
Detailed Description
Reference will now be made in detail to some specific embodiments of the invention including the best modes contemplated by the inventors for carrying out the invention. Examples of these specific embodiments are illustrated in the accompanying drawings. While the invention is described in conjunction with these specific embodiments, it will be understood that it is not intended to limit the invention to the described embodiments. On the contrary, it is intended to cover all alternatives, modifications, and equivalents as may be included within the spirit and scope of the invention as defined by the appended claims. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. The present invention may be practiced without some or all of these specific details. In other instances, well known process operations have not been described in detail in order not to unnecessarily obscure the present invention.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
Fig. 1 is a schematic diagram illustrating a block diagram of an apparatus 1000 for generating a predictive model to predict crowd density maps and counts according to one embodiment of the present application. As shown, the apparatus 1000 may include a density map creation unit 10, a CNN generation unit 20, a similar data acquisition unit 30, and a model fine-tuning unit 40.
Fig. 2 is a general schematic diagram illustrating a flow process 2000 of an apparatus 1000 according to one embodiment of the present application. In step s201, the truth density map creation unit 10 operates to select image patches from one or more training image frames in the training set and to determine a true crowd distribution in the selected image patches and a true total pedestrian count in the selected image patches. At step s202, the CNN training unit 20 operates to train the CNN by inputting one or more crowd image patches of frames in the training set, wherein each crowd image patch has a predetermined truth density distribution and person count among the input crowd image patches. In step s203, the similar data obtaining unit 30 operates to sample frames from the target scene image set, and receive training images having the determined distribution of the truth density and the count/number from the training set, and obtain similar image data from the received training frames for each sampled target image frame, so as to overcome the scene gap between the target scene image set and the training images. At step s204, the model fine-tuning unit 40 operates to fine-tune the CNN by inputting similar image data to the CNN to determine a prediction model for predicting the crowd density map and the people count in the image frame. The cooperation of the density map creation unit 10, the CNN generation unit 20, the similar data acquisition unit 30, and the model fine-tuning unit 40 will be discussed in detail below.
1) Density map creation unit 10
Initially input to the apparatus 100 (i.e. to the density map creation unit 10) is a training set that contains a certain number of video frames captured from various surveillance cameras with pedestrian head labels. The density map creation unit 10 operates to output a density map and a count for each video frame based on the input training set.
Fig. 3 is a schematic diagram illustrating a flow process of the density map creating unit 10 according to an embodiment of the present application. In step s301, the density map creation unit 10 operates to approximate the perspective/distribution of each training scene/frame from the training set. The pedestrian heads are labeled to indicate the head position of each person in the region of interest for each training frame. In the case of a human head position, the spatial position and body shape of the pedestrian will be located in each frame. At step s302, a truth density map/distribution is created based on the spatial position of the pedestrian, the body shape, and the perspective deformation of the image to determine the truth density of the pedestrian/crowd in each frame and to estimate the people count in the crowd in each frame of the training set. Specifically, the truth density map/distribution represents the population distribution in each frame, and the integral of the density map/distribution is equal to the total number of pedestrians.
In particular, the main goal of the crowd CNN model to be discussed later is to learn the mapping F X-D, where X is a set of low-level features extracted from the training image and D is the crowd density map/distribution of the image. Assuming that the position of each pedestrian is labeled, a density map/distribution is created based on the spatial position of the pedestrian, the body shape, and the perspective deformation of the image. Randomly selected image patches from the training image are considered as training samples, and the density map/distribution of the corresponding image patches is considered as a true value of a crowd CNN model, which will be discussed further later. As an auxiliary target, the total population in the selected training image patch is calculated by integration of the density map/distribution. It should be noted that the total number will be a fractional number rather than an integer.
In the prior art, the density map regression truth has been defined as the sum of gaussian kernels centered around the location of the object. Such a density map/distribution is suitable for characterizing the density distribution of circular objects, such as cells and bacteria. However, this assumption may fail when talking about a pedestrian population in which the camera is not typically in the bird's eye view. The example of a pedestrian in a common surveillance camera has three distinct characteristics: 1) pedestrian images in the surveillance video have different scales due to perspective deformation; 2) the shape of a pedestrian is more similar to an ellipse than a circle; 3) since the occlusion is severe, the human head and shoulders are important cues to determine whether a pedestrian is present at each location. The body parts of the pedestrian are unreliable for marking the person. In view of these properties, a crowd density map/distribution is created by a combination of several distributions with perspective normalization (perspective normalization).
Perspective normalization is necessary to estimate the pedestrian scale. For each scene, a number of adult pedestrians will be randomly selected and then labeled from head to foot. Assuming that the average height of an adult is 175cm (for example), the perspective M can be approximated by linear regression. The pixel values in the perspective view m (p) represent: the number of pixels in the image represents a certain distance (e.g. 1 meter) at said position in the actual scene. If a pedestrian is marked with H pixels, the perspective view at the center position of the pedestrian is m (p) ═ H/1.75, and then it linearly interpolates the perspective views in the vertical and horizontal directions, respectively, to obtain the entire perspective views. After the perspective/distribution and the center position of the pedestrian's head Ph in the region of interest (ROI) are obtained, a population density map/distribution is created according to the following rules:
Figure BDA0001472847700000071
the population density distribution kernel contains two terms: a normalized 2-dimensional gaussian kernel Nh as the head part and a bivariate normal distribution Nb as the body part. Here, Pb is the position of the pedestrian's body, which is estimated from the head position and the perspective value. To optimally represent the pedestrian contour, the variance is set to
Figure BDA0001472847700000073
Figure BDA0001472847700000074
(for the term Nh, and Nx ═ 0.2M (p)),
Figure BDA0001472847700000072
(for the term Nb). To ensure that the integral of all density values in the density map/distribution is equal to the total population in the original image, the overall distribution is normalized by Z.
In short, for each person with an indicated head position, a body build density distribution or kernel (hereinafter "kernel") will be determined, as described in equation (1). All body volume kernels of all labeled persons are combined (overlaid) to form a true density map/distribution for each frame. The larger the values of the locations in the truth density map/distribution, the higher the crowd density in those locations. In addition, since the normalized value of each body volume kernel is equal to 1, the people count in the population will be equal to the sum of all the values of the body volume kernels in the truth density map/distribution.
2) CNN generation unit 20
The CNN generation unit 20 is configured to construct and initialize a first population Convolutional Neural Network (CNN). The generation unit 20 operates to acquire/sample segments of the crowd image from frames in the training set and to obtain corresponding truth density maps and number of people in the sampled segments of the crowd image (as determined by unit 10). Then, the generating unit 20 inputs the crowd image blocks sampled from the training set and their corresponding truth density map/distribution and population into the CNN as target objective (target objective) to train the CNN.
Fig. 4 is a schematic diagram illustrating a flow diagram of a process 4000 for generating and training a CNN according to one embodiment of the present application.
As shown, in step s401, the process 300 samples one or more crowd image patches from frames in the training set and obtains corresponding truth density maps and number of people/crowd in the sampled crowd image patches. The input is image patches cropped from a training image. To obtain a pedestrian at a similar scale, the size of each image patch at different locations is selected according to the perspective value of its central pixel. In an example, each image tile may be set to cover 3X3 square meters in an actual scene. The image patch is then warped (warp) to 72 (for example) pixels X72 (for example) pixels as follows as the crowd CNN model generated in step 302.
At step s402, the process 4000 randomly initializes the crowd convolutional neural network based on a gaussian random distribution. An overview of the population CNN model with switchable goals is shown in fig. 5.
As shown, the population CNN model 500 contains 3 convolutional layers (con1 to conv3) and three fully-connected layers (fc4, fc5 and fc6 or fc 7). conv1 has 32 7X3 filters, conv2 has 32 7X32 filters, and the last convolutional layer has 64 5X32 filters. After conv1 and conv2, the largest pooling layer with 2X2 nuclei size was used. A modified linear unit (ReLU), not shown in fig. 5, is an activation function applied after each convolutional layer and fully-connected layer. It should be understood that the number of filters and the number of layers are described herein as examples only for purposes of illustration, and that the application is not limited to these particular numbers and that other numbers would be acceptable.
At step s403, the process 400 learns the mapping from the crowd image patches to the density map/distribution, for example by using a small batch of gradient descent and backpropagation until the density map/distribution converges to the true value density/distribution as created by the true value density map creation unit 10. At step s404, the process 400 switches targets, learning a mapping from crowd image patches to counts until the learned counts converge to the counts estimated by the truth density map creation unit 10. At step 405, it is determined whether the estimated density map/distribution and count converge to the true value, and if not, steps s403 to s405 are repeated. Hereinafter, steps s403 to s405 will be discussed in detail.
In an embodiment of the present application, it introduces an iterative switching process in the crowd CNN model 500 that is used to alternately optimize the density map/distribution estimation task and the count estimation task. The main task of the crowd CNN model 400 is to estimate the crowd density map/distribution of the input image patches. In the embodiment as shown in fig. 5, since there are two pooling layers in the CNN model 500, the output density map/distribution is downsampled to 18X 18. Thus, the true density map/distribution is also downsampled to 18X 18. Since the density map/distribution contains rich local detailed information, the CNN model 500 may benefit from learning to predict the density map/distribution and may obtain a better representation of the crowd image patches. The total count regression of the input image patches is considered a second task, which is calculated by integrating the density map image patches. The two tasks alternately help each other and get a better solution. Two loss functions are defined according to the following rules:
Figure BDA0001472847700000091
Figure BDA0001472847700000092
where Θ is a set of parameters of the CNN model and N is the number of training samples. L isDIs the estimated density map Fd (X)i(ii) a Θ) (output of fc 6) and the truth density map Di.Similarly, LYIs the estimated number of people Fy (Xi; Θ) (output of fc7) and the number of truth values YiWith the loss in between. Euclidean distances are used in these two objective penalties. A small amount of gradient descent and back propagation is used to minimize losses.
The switchable training program is outlined in algorithm 1. L isDSet as the first target loss to be minimized, the density map/distribution estimation requires the model 500 to learn a general representation of the population because the density map/distribution can introduce more spatial information into the CNN model. After the first target converges, the model 500 switches to minimize the target of global count regression. Count regression is an easier task and learns it faster than the task of density map/distribution regression. It should be noted that these two target losses should be normalized on a similar or identical scale; otherwise, objects with larger dimensions will dominate the training process. In one embodiment of the present application, the scaling weight for density loss may be set to 10 and the scaling weight for count loss may be set to 1. The training facts converge after approximately 6 switching iterations. The proposed handover learning method may achieve better performance than the widely used multi-task learning method.
Figure BDA0001472847700000101
3) Similar data acquisition unit 30
The similar data acquisition unit 30 is configured to: receive a frame of samples from the target scene and samples from the training set with the truth density map/distribution and counts created by unit 10; similar data is then obtained from the training set for each target scene to overcome the scene gap.
The population CNN model 500 is pre-trained by the proposed switchable learning process based on all training scenario data. However, each queried demographic scene has its unique scene properties, such as different perspectives, scales, and different density distributions. These properties significantly change the appearance of the crowd image patch and affect the performance of the crowd CNN model 500. To bridge the distribution gap between the training scenario and the test scenario, a non-parametric fine tuning scheme is designed to adapt the pre-trained CNN model 500 to the target scenario that is not visible.
Given a target video from an unseen scene, samples with similar properties are taken from the training frames and added to the training data to fine tune the crowd CNN model 500. The acquisition task consists of two steps: candidate scene acquisition and local image block acquisition.
Candidate scene acquisition (step 601)The perspective and scale of the scene are the main factors that affect the appearance of the crowd. The perspective/distribution may indicate both viewing angle and scale. To overcome the scale gap between different scenes, each input image patch is normalized to the same scale, which covers 3X3 square meters (for example) in the actual scene according to perspective/distribution. Thus, the first step of the non-parametric fine tuning method focuses on obtaining training scenes from all training scenes with a perspective/distribution similar to the target scene. Those captured scenes are referred to as candidate hinted scenes. The perspective descriptor is designed to represent the perspective of each scene. Since the perspective/distribution is fitted linearly along the y-axis, its vertical gradient Δ My ═ M (y) -M (y-1) can be used as a perspective descriptor. Based on the descriptor, top (e.g., 20) perspective-like scenes are taken from the entire training dataset for unseen scenes. Images in the acquired scene are considered as candidate scenes for local image block acquisition.
Local image block acquisition (step 602)The second step is to select similar image patches from the candidate scenes that have a density distribution similar to the density distribution in the test scene. In addition to viewing angle and scale, crowd density distribution also affects the appearance patterns of the crowd. More dense people have more severe occlusion and only the head and shoulders can be observed. In contrast, in sparse people, pedestrians assume a full body shape. Thus, the similar data acquiring unit 30 is configured to attempt to predict the density distribution of the target scene and acquire a match to the target scene from the candidate scenesSimilar image blocks of the predicted target density distribution. For example, for a high density crowd scene, more dense image patches should be acquired to fine-tune the pre-trained model to fit the target scene.
Using the pre-trained CNN model 500 as trained in unit 20, we can roughly predict the density and total count of each image patch of the target image. Image patches with similar density maps/distributions are assumed to have similar outputs through the pre-trained model 500. Based on the prediction results, we calculate a histogram of the density distribution of the target scene. Each interval (bin) is calculated according to the following rule:
Figure BDA0001472847700000111
wherein y isiIs the integrated count of the estimated density map/distribution of sample i.
Since there are rarely scenes where more than 20 pedestrians stand within 3X3 square meters, when y isi>At 20, the image partition should be allocated to the 6 th section (i.e., ci ═ 6). The density distribution of the target scene can be obtained from equation (4). Then, image patches are randomly selected from the acquired training scenes, and the number of image patches different in density is controlled to match the density distribution of the target scene. In this way, the proposed fine-tuning method is used to obtain image patches with similar view angle, scale and density distribution.
Model fine-tuning unit 40
The model fine-tuning unit 40 is configured to receive the acquired similar data and fine-tune the pre-trained CNN 500 using the similar data to enable the CNN 500 to predict the density map/distribution and pedestrian count in and the region of interest of the video frame to be detected. The fine-tuned crowd CNN model achieves better performance for the target scene.
In one embodiment of the present application, the fine-tuning unit 40 samples the similar image patches obtained from the unit 30 and inputs the obtained similar image patches to the pre-trained CNN for fine-tuning thereof (e.g., by using a small batch of gradient descent and backpropagation until the density map/distribution converges to the true-value density/distribution as created by the true-value density map creation unit 10). Then, the fine adjustment unit 40 switches the objectivity and learns the mapping from the crowd image patches to the counts until the learned counts converge on the counts estimated by the truth density map creation unit 10. Finally, it is determined whether the estimated density map/distribution and count converge to the true value, and if not, the above steps are repeated.
The trimmed prediction model generated by the model trimming unit 40 may receive the video frame to be detected and the region of interest and then predict the estimated density map and pedestrian count in the region of interest.
As will be appreciated by one skilled in the art, the present invention may be embodied as a system, method or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment and hardware aspects (which may all generally be referred to herein as a "unit," "circuit," "module," or "system"). Much of the inventive functionality and many of the inventive principles when implemented, are best supported with or in an Integrated Circuit (IC), such as a digital signal processor and thus software or application specific IC. Notwithstanding possibly significant effort and many design choices motivated by, for example, available time, current technology, and economic considerations, it is expected that one of ordinary skill, when guided by the concepts and principles disclosed herein will be readily capable of generating ICs with minimal experimentation. Thus, in the interest of brevity and minimization of any risk of obscuring the principles and concepts according to the present invention, further discussion of such software and ICs, if any, will be limited to the essentials with respect to the principles and concepts used by the preferred embodiments.
Additionally, the present invention may take the form of an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software embodiments. Furthermore, the present invention may take the form of a computer program product embodied in any tangible presentation medium having computer usable program code embodied in the medium. Fig. 7 illustrates a system 7000 for generating a predictive model to predict population density distribution and person count in an image frame. The system 7000 comprises: a memory 3001 storing executable components; and a processor 3002 electrically coupled to the memory 3001 to execute the executable components to perform the operations of the system 3000. These executable components may include: a truth density map creation component 701, a CNN training component 702, a similar data acquisition component 703, and a model fine tuning component 704.
The truth density map creation section 701 is configured to: selecting image patches from one or more training image frames in a training set; and determining a true crowd distribution in the selected image patch and a true total pedestrian count in the selected image patch. The CNN training component 702 is configured to train the CNN by inputting one or more crowd image patches from frames in a training set, each crowd image patch having a predetermined truth density distribution and person count among the inputted crowd image patches.
The similarity data acquisition component 703 is configured to sample frames from the target scene image set and receive training images from the training set having the determined truth density distribution and counts/numbers; and acquiring similar image data from the received training frame for each sampled target image frame to overcome a scene gap between the target scene image set and the training image.
The model fine-tuning component 703 is configured to fine-tune the CNN by inputting similar image data to the CNN 500 to determine a prediction model for predicting a crowd density map and a person count in an image frame.
The functions of the components 701-704 are similar to the functions of the cells 10-40, respectively, and thus detailed descriptions thereof are omitted herein.
While preferred examples of the present invention have been described, variations or modifications in those examples may occur to those skilled in the art upon learning of the basic inventive concepts. It is intended that the appended claims be construed to include preferred examples and that all such variations or modifications are within the scope of the invention.

Claims (22)

1. A method for generating a prediction model to predict crowd density distribution and people count in an image frame, comprising:
training a convolutional neural network by inputting one or more crowd image patches of a training image frame in a training set, each of the crowd image patches having a predetermined true-value density distribution and a predetermined true-value total pedestrian count in the input crowd image patches;
sampling a target image frame from a target scene image set and receiving training image frames from the training set having the predetermined truth density distribution and the predetermined truth pedestrian total count;
acquiring similar image data from the received training image frames for each sampled target image frame to overcome a scene gap between the target scene image set and the training image frame; and
fine-tuning the trained convolutional neural network by inputting the similar image data to the trained convolutional neural network to determine a prediction model for predicting crowd density distribution and people count in image frames.
2. The method of claim 1, further comprising:
selecting a crowd image patch from one or more training image frames in the training set; and
a true density distribution in the selected crowd image patch and a true total count of pedestrians in the selected crowd image patch are determined.
3. The method of claim 2, wherein the determining further comprises:
identifying each person having a tagged head position in each of the training image frames;
determining a body shape kernel for each identified person, the body shape kernel comprising: a normalized two-dimensional Gaussian kernel as the head part and a bivariate normal distribution as the body part; and
combining all of the determined body volume kernels to form a truth density distribution for each of the training image frames, wherein a true value total pedestrian count for the training image frames is equal to a sum of all values of the body volume kernels in the truth density distribution.
4. The method of claim 1, wherein the training further comprises:
randomly initializing the convolutional neural network based on a Gaussian random distribution;
sampling image patches from the training image frame;
estimating, by the convolutional neural network, a population density distribution in the sampled image patches and a people count in the sampled image patches;
updating parameters of the convolutional neural network until the estimated distribution converges to the predetermined truth density distribution; and
further updating parameters of the convolutional neural network until the estimated number converges to the predetermined true value total pedestrian count, thereby obtaining a pre-trained convolutional neural network.
5. The method of claim 4, wherein the acquiring similar image data further comprises:
acquiring candidate fine-tuning frame data with perspective distribution similar to that of the target image frame from the training image frame; and
selecting similar image blocks having a density distribution similar to that of the target image frame from the candidate scenes.
6. The method of claim 5, wherein the fine tuning further comprises:
sampling image blocks from the similar image blocks;
estimating, by the pre-trained convolutional neural network, a population density distribution in the sampled image patches and a people count in the sampled image patches;
updating parameters of the convolutional neural network until the estimated distribution converges to a predetermined truth density distribution; and
further updating parameters of the pre-trained convolutional neural network until the estimated number converges to a predetermined true value total pedestrian count to obtain a fine-tuned convolutional neural network.
7. The method of any of claims 1 to 6, wherein determining the person count in the image frame to which the image patch corresponds is calculated by integrating the image patch of the determined population density distribution.
8. The method of any of claims 1 to 6, wherein the crowd density distribution is created based on a spatial position of a pedestrian in each image frame, a body shape in each image frame, and a perspective deformation of the image.
9. An apparatus for generating a prediction model to predict crowd density distribution and people count in an image frame, comprising:
a convolutional neural network training unit (20) for training a convolutional neural network by inputting one or more crowd image patches from training image frames in a training set, each of the crowd image patches having a predetermined truth density distribution and a predetermined truth pedestrian total count;
a similar data acquisition unit (30) sampling target image frames from a target scene image set and receiving training image frames from the training set with a determined predetermined true-value density distribution and a predetermined true-value total count of pedestrians, and acquiring similar image data from the received training image frames for each sampled target image frame to overcome a scene gap between the target scene image set and the training image frames; and
a model fine-tuning unit (40) that fine-tunes the convolutional neural network by inputting the similar image data to the trained convolutional neural network to determine a prediction model for predicting a population density distribution and a person count in an image frame.
10. The apparatus of claim 9, further comprising:
a truth density map creation unit (10) selecting a crowd image patch from one or more training image frames in the training set; and determining a true density distribution in the selected crowd image patch and a true total pedestrian count in the selected crowd image patch.
11. The apparatus according to claim 10, wherein the truth density map creation unit (10) is configured to determine the truth density distribution in the selected crowd image patch and the true pedestrian total count in the selected crowd image patch by:
identifying each person having a tagged head position in each of the training image frames;
determining a body shape kernel for each identified person, the body shape kernel comprising: a normalized two-dimensional Gaussian kernel as the head part and a bivariate normal distribution as the body part; and
combining all of the determined body volume kernels to form a truth density distribution for each of the training image frames, wherein a true value total pedestrian count for the training image frames is equal to a sum of all values of the body volume kernels in the truth density distribution.
12. The apparatus of claim 9, wherein the convolutional neural network training unit (20) trains the convolutional neural network by:
randomly initializing the convolutional neural network based on a Gaussian random distribution;
sampling image patches from the training image frame;
estimating, by the convolutional neural network, a population density distribution in the sampled image patches and a people count in the sampled image patches;
updating parameters of the convolutional neural network until the estimated distribution converges to the predetermined truth density distribution; and
further updating parameters of the convolutional neural network until the estimated number converges to the predetermined true value total pedestrian count to obtain a pre-trained convolutional neural network.
13. The apparatus according to claim 12, wherein the similar data acquisition unit (30) is configured to:
acquiring candidate fine-tuning frame data with perspective distribution similar to that of the target image frame from the training image frame; and
selecting similar image blocks having a density distribution similar to that of the target image frame from the candidate scenes.
14. The apparatus of claim 13, wherein the fine tuning unit is further configured to:
sampling image blocks from the similar image blocks;
estimating, by the pre-trained convolutional neural network, a population density distribution in the sampled image patches and a people count in the sampled image patches;
updating parameters of the convolutional neural network until the estimated distribution converges to a predetermined true density distribution; and
further updating parameters of the pre-trained convolutional neural network until the estimated number converges to a predetermined true value total pedestrian count to obtain a fine-tuned convolutional neural network.
15. The apparatus of any of claims 9 to 14, wherein the person count in the image frame to which the image patch is determined to correspond is calculated by integrating the image patch of the determined population density distribution.
16. The device of any of claims 9 to 14, wherein the crowd density distribution is created based on a spatial position of a pedestrian in each image frame, a body shape in each image frame, and a perspective deformation of the image.
17. A system for generating a prediction model to predict crowd density distribution and people count in an image frame, comprising:
a memory storing executable components; and
a processor electrically coupled to the memory to execute the executable components to perform operations of the system, wherein the executable components comprise:
a convolutional neural network training section training a convolutional neural network by inputting one or more crowd image patches of a training image frame in a training set, each of the crowd image patches having a predetermined true-value density distribution and a predetermined true-value total pedestrian count among the inputted crowd image patches;
a similarity data acquisition component that samples target image frames from a target scene image set, receives training image frames from the training set having the predetermined truth density distribution and the predetermined truth total pedestrian count, and acquires similarity image data from the received training image frames for each of the sampled target image frames to overcome a scene gap between the target scene image set and the training image frames; and
a model fine-tuning component that fine-tunes the convolutional neural network by inputting the similar image data to the trained convolutional neural network to determine a prediction model for predicting a crowd density map and a people count in an image frame.
18. The system of claim 17, further comprising:
a truth density map creation component that selects crowd image patches from one or more training image frames in the training set; and determining a true density distribution in the selected crowd image patch and a true total pedestrian count in the selected crowd image patch.
19. The system according to claim 18, wherein the truth density map creation component is configured to determine the truth density distribution in the selected crowd image patch and the truth total pedestrian count in the selected crowd image patch by:
identifying each person having a tagged head position in each of the training image frames;
determining a body shape kernel for each identified person, the body shape kernel comprising: a normalized two-dimensional Gaussian kernel as the head part and a bivariate normal distribution as the body part; and
combining all of the determined body volume kernels to form a truth density distribution for each of the training image frames, wherein a true value total pedestrian count for the training image frames is equal to a sum of all values of the body volume kernels in the truth density distribution.
20. The system of claim 17, wherein the convolutional neural network training component trains the convolutional neural network by:
randomly initializing the convolutional neural network based on a Gaussian random distribution;
sampling image patches from the training image frame;
estimating, by the convolutional neural network, a population density distribution in the sampled image patches and a people count in the sampled image patches;
updating parameters of the convolutional neural network until the estimated distribution converges to the predetermined truth density distribution; and
further updating parameters of the convolutional neural network until the estimated number converges to the predetermined true value total pedestrian count to obtain a pre-trained convolutional neural network.
21. The system of claim 20, wherein the similar data acquisition component is configured to:
acquiring candidate fine-tuning frame data with perspective distribution similar to that of the target image frame from the training image frame; and
selecting similar image blocks having a density distribution similar to that of the target image frame from the candidate scenes.
22. The system of claim 21, wherein the fine tuning component is further configured to:
sampling image blocks from the similar image blocks;
estimating, by the pre-trained convolutional neural network, a population density distribution in the sampled image patches and a people count in the sampled image patches;
updating parameters of the pre-trained convolutional neural network until the estimated distribution converges to a predetermined truth density distribution; and
further updating parameters of the convolutional neural network until the estimated number converges to a predetermined true value total pedestrian count to obtain a fine-tuned convolutional neural network.
CN201580080145.XA 2015-05-18 2015-05-18 Method and apparatus for generating a predictive model Active CN107624189B (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2015/079178 WO2016183766A1 (en) 2015-05-18 2015-05-18 Method and apparatus for generating predictive models

Publications (2)

Publication Number Publication Date
CN107624189A CN107624189A (en) 2018-01-23
CN107624189B true CN107624189B (en) 2020-11-20

Family

ID=57319199

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201580080145.XA Active CN107624189B (en) 2015-05-18 2015-05-18 Method and apparatus for generating a predictive model

Country Status (2)

Country Link
CN (1) CN107624189B (en)
WO (1) WO2016183766A1 (en)

Families Citing this family (44)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10547971B2 (en) 2015-11-04 2020-01-28 xAd, Inc. Systems and methods for creating and using geo-blocks for location-based information service
US10455363B2 (en) * 2015-11-04 2019-10-22 xAd, Inc. Systems and methods for using geo-blocks and geo-fences to discover lookalike mobile devices
CN107566781B (en) * 2016-06-30 2019-06-21 北京旷视科技有限公司 Video monitoring method and video monitoring equipment
CN106997459B (en) * 2017-04-28 2020-06-26 成都艾联科创科技有限公司 People counting method and system based on neural network and image superposition segmentation
CN108875456B (en) * 2017-05-12 2022-02-18 北京旷视科技有限公司 Object detection method, object detection apparatus, and computer-readable storage medium
CN107330364B (en) * 2017-05-27 2019-12-03 上海交通大学 A kind of people counting method and system based on cGAN network
CN107563349A (en) * 2017-09-21 2018-01-09 电子科技大学 A kind of Population size estimation method based on VGGNet
CN107657226B (en) * 2017-09-22 2020-12-29 电子科技大学 People number estimation method based on deep learning
CN107609597B (en) * 2017-09-26 2020-10-13 嘉世达电梯有限公司 Elevator car number detection system and detection method thereof
CN111295689B (en) * 2017-11-01 2023-10-03 诺基亚技术有限公司 Depth aware object counting
CN107977025A (en) * 2017-11-07 2018-05-01 中国农业大学 A kind of regulator control system and method for industrialized aquiculture dissolved oxygen
CN108154089B (en) * 2017-12-11 2021-07-30 中山大学 Size-adaptive-based crowd counting method for head detection and density map
CN108615027B (en) * 2018-05-11 2021-10-08 常州大学 Method for counting video crowd based on long-term and short-term memory-weighted neural network
CN109034355B (en) * 2018-07-02 2022-08-02 百度在线网络技术(北京)有限公司 Method, device and equipment for predicting number of people in dense crowd and storage medium
CN109117791A (en) * 2018-08-14 2019-01-01 中国电子科技集团公司第三十八研究所 A kind of crowd density drawing generating method based on expansion convolution
US11134359B2 (en) 2018-08-17 2021-09-28 xAd, Inc. Systems and methods for calibrated location prediction
US10349208B1 (en) 2018-08-17 2019-07-09 xAd, Inc. Systems and methods for real-time prediction of mobile device locations
US11172324B2 (en) 2018-08-17 2021-11-09 xAd, Inc. Systems and methods for predicting targeted location events
CN109635634B (en) * 2018-10-29 2023-03-31 西北大学 Pedestrian re-identification data enhancement method based on random linear interpolation
CN109447008B (en) * 2018-11-02 2022-02-15 中山大学 Crowd analysis method based on attention mechanism and deformable convolutional neural network
CN109409318B (en) * 2018-11-07 2021-03-02 四川大学 Statistical model training method, statistical device and storage medium
CN111191667B (en) * 2018-11-15 2023-08-18 天津大学青岛海洋技术研究院 Crowd counting method based on multiscale generation countermeasure network
CN111291587A (en) * 2018-12-06 2020-06-16 深圳光启空间技术有限公司 Pedestrian detection method based on dense crowd, storage medium and processor
CN109815936B (en) * 2019-02-21 2023-08-22 深圳市商汤科技有限公司 Target object analysis method and device, computer equipment and storage medium
CN110197502B (en) * 2019-06-06 2021-01-22 山东工商学院 Multi-target tracking method and system based on identity re-identification
CN110826496B (en) * 2019-11-07 2023-04-07 腾讯科技(深圳)有限公司 Crowd density estimation method, device, equipment and storage medium
US11106904B2 (en) * 2019-11-20 2021-08-31 Omron Corporation Methods and systems for forecasting crowd dynamics
CN110942015B (en) * 2019-11-22 2023-04-07 上海应用技术大学 Crowd density estimation method
CN111062275A (en) * 2019-12-02 2020-04-24 汇纳科技股份有限公司 Multi-level supervision crowd counting method, device, medium and electronic equipment
CN111178235A (en) * 2019-12-27 2020-05-19 卓尔智联(武汉)研究院有限公司 Target quantity determination method, device, equipment and storage medium
CN111274900B (en) * 2020-01-15 2021-01-01 北京航空航天大学 Empty-base crowd counting method based on bottom layer feature extraction
CN111340801A (en) * 2020-03-24 2020-06-26 新希望六和股份有限公司 Livestock checking method, device, equipment and storage medium
CN111626141B (en) * 2020-04-30 2023-06-02 上海交通大学 Crowd counting model building method, counting method and system based on generated image
CN111652168B (en) * 2020-06-09 2023-09-08 腾讯科技(深圳)有限公司 Group detection method, device, equipment and storage medium based on artificial intelligence
CN112001274B (en) * 2020-08-06 2023-11-17 腾讯科技(深圳)有限公司 Crowd density determining method, device, storage medium and processor
CN111898578B (en) * 2020-08-10 2023-09-19 腾讯科技(深圳)有限公司 Crowd density acquisition method and device and electronic equipment
CN112990530B (en) * 2020-12-23 2023-12-26 北京软通智慧科技有限公司 Regional population quantity prediction method, regional population quantity prediction device, electronic equipment and storage medium
CN113822111B (en) * 2021-01-19 2024-05-24 北京京东振世信息技术有限公司 Crowd detection model training method and device and crowd counting method and device
CN112801018B (en) * 2021-02-07 2023-07-07 广州大学 Cross-scene target automatic identification and tracking method and application
CN113033342A (en) * 2021-03-10 2021-06-25 西北工业大学 Crowd scene pedestrian target detection and counting method based on density estimation
CN113269224B (en) * 2021-03-24 2023-10-31 华南理工大学 Scene image classification method, system and storage medium
CN113920391B (en) * 2021-09-17 2024-06-25 北京理工大学 Target counting method based on scale generation self-adaptive truth diagram
CN115293465B (en) * 2022-10-09 2023-02-14 枫树谷(成都)科技有限责任公司 Crowd density prediction method and system
CN118155142A (en) * 2024-05-09 2024-06-07 浙江大华技术股份有限公司 Object density recognition method and event recognition method

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7991193B2 (en) * 2007-07-30 2011-08-02 International Business Machines Corporation Automated learning for people counting systems
CN103971100A (en) * 2014-05-21 2014-08-06 国家电网公司 Video-based camouflage and peeping behavior detection method for automated teller machine
CN104077613A (en) * 2014-07-16 2014-10-01 电子科技大学 Crowd density estimation method based on cascaded multilevel convolution neural network
CN104573744A (en) * 2015-01-19 2015-04-29 上海交通大学 Fine granularity classification recognition method and object part location and feature extraction method thereof

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8195598B2 (en) * 2007-11-16 2012-06-05 Agilence, Inc. Method of and system for hierarchical human/crowd behavior detection
CN104268524A (en) * 2014-09-24 2015-01-07 朱毅 Convolutional neural network image recognition method based on dynamic adjustment of training targets

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7991193B2 (en) * 2007-07-30 2011-08-02 International Business Machines Corporation Automated learning for people counting systems
CN103971100A (en) * 2014-05-21 2014-08-06 国家电网公司 Video-based camouflage and peeping behavior detection method for automated teller machine
CN104077613A (en) * 2014-07-16 2014-10-01 电子科技大学 Crowd density estimation method based on cascaded multilevel convolution neural network
CN104573744A (en) * 2015-01-19 2015-04-29 上海交通大学 Fine granularity classification recognition method and object part location and feature extraction method thereof

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Pedestrian detection with convolutional neural networks;M.Szarvas 等;《IEEE PROCEEDINGS. INTELLIGENT VEHICLES SYMPOSIUM,2005》;20050608;全文 *
多种人群密度场景下的人群计数;覃勋辉;《中国图象图形学报》;20130430;第18卷(第4期);全文 *

Also Published As

Publication number Publication date
CN107624189A (en) 2018-01-23
WO2016183766A1 (en) 2016-11-24

Similar Documents

Publication Publication Date Title
CN107624189B (en) Method and apparatus for generating a predictive model
US10096122B1 (en) Segmentation of object image data from background image data
US9633282B2 (en) Cross-trained convolutional neural networks using multimodal images
CN110765860B (en) Tumble judging method, tumble judging device, computer equipment and storage medium
CN109472191B (en) Pedestrian re-identification and tracking method based on space-time context
EP1975879A2 (en) Computer implemented method for tracking object in sequence of frames of video
CN111667001B (en) Target re-identification method, device, computer equipment and storage medium
CN107346414B (en) Pedestrian attribute identification method and device
Ma et al. Counting people crossing a line using integer programming and local features
CN108875456B (en) Object detection method, object detection apparatus, and computer-readable storage medium
JP5936561B2 (en) Object classification based on appearance and context in images
CN108198172B (en) Image significance detection method and device
WO2016179808A1 (en) An apparatus and a method for face parts and face detection
CN111598067B (en) Re-recognition training method, re-recognition method and storage device in video
dos Santos Rosa et al. Sparse-to-continuous: Enhancing monocular depth estimation using occupancy maps
CN107766864B (en) Method and device for extracting features and method and device for object recognition
CN109063549A (en) High-resolution based on deep neural network is taken photo by plane video moving object detection method
CN114399644A (en) Target detection method and device based on small sample
Hambarde et al. Single image depth estimation using deep adversarial training
Li et al. RGBD relocalisation using pairwise geometry and concise key point sets
CN112686178A (en) Multi-view target track generation method and device and electronic equipment
CN113033468A (en) Specific person re-identification method based on multi-source image information
Gomes et al. Robust underwater object detection with autonomous underwater vehicle: A comprehensive study
CN112949539A (en) Pedestrian re-identification interactive retrieval method and system based on camera position
Ji et al. 3d reconstruction of dynamic textures in crowd sourced data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant