CN111295689B

CN111295689B - Depth aware object counting

Info

Publication number: CN111295689B
Application number: CN201780096479.5A
Authority: CN
Inventors: 姜晓恒
Original assignee: Nokia Technologies Oy
Current assignee: Nokia Technologies Oy
Priority date: 2017-11-01
Filing date: 2017-11-01
Publication date: 2023-10-03
Anticipated expiration: 2037-11-01
Also published as: US20200242777A1; WO2019084854A1; US11270441B2; CN111295689A; EP3704558A1; EP3704558A4

Abstract

Methods and apparatus, including computer program products, are provided for depth aware object counting. In some example embodiments, a method may be provided that includes processing a first segment of an image and a second segment of the image through a trained machine learning model, the first segment being processed using a first filter selected based on depth information to enable formation of a first density map, and the second segment being processed using a second filter selected based on depth information to enable formation of a second density map; combining the first density map and the second density map by a trained machine learning model to form a density map for the image; and providing an output based on the density map through the trained machine learning model. Related systems, methods, and articles of manufacture are also described.

Description

Depth aware object counting

Technical Field

The subject matter described herein relates to machine learning.

Background

Machine learning techniques enable computers to learn tasks. For example, machine learning may allow a computer to learn to perform tasks during a training phase. Later, during the operational phase, the computer may be able to perform the learned tasks. Machine learning may take the form of neural networks, such as deep learning neural networks, convolutional Neural Networks (CNNs), state vector machines, bayesian classifiers, and other types of machine learning models.

Disclosure of Invention

Methods and apparatus, including computer program products, are provided for depth aware object counting.

In some example embodiments, a method may be provided that includes processing a first segment of an image and a second segment of the image through a trained machine learning model, the first segment being processed using a first filter selected based on depth information to enable formation of a first density map, and the second segment being processed using a second filter selected based on depth information to enable formation of a second density map; combining the first density map and the second density map by a trained machine learning model to form a density map of the image; and providing an output based on the density map through the trained machine learning model, the output representing an estimate of the number of objects in the image.

In some variations, one or more of the features disclosed herein, including the following features, may optionally be included in any feasible combination. The trained machine learning model may receive an image comprising a plurality of objects, wherein the image is segmented into at least a first segment and a second segment based on depth information. The depth information may be received from another machine learning model trained to output depth information from the image. The trained machine learning model may include a multi-column convolutional neural network including a first convolutional neural network and a second convolutional neural network. The first convolutional network may include a first filter. The second convolution network may include a second filter. The first filter and the second filter each include a convolution layer. The depth information may indicate a position of the first segment and/or the second segment. The depth information may indicate an object size due to a distance from the camera. The depth information may indicate a first filter size of the first filter and a second filter size of the second filter. The trained machine learning model may select a first filter size of the first filter and a second filter size of the second filter based on the depth information. The training may be based on the reference image such that the machine learning model trains to learn the generation of the density map. The plurality of objects may include a plurality of people, a plurality of vehicles, and/or a group of people. The first density map may estimate a density of objects in the first segment. The second density map may estimate a density of objects in the second segment. The density map may estimate the density of objects in the image.

The above aspects and features may be implemented in systems, apparatus, methods, and/or articles of manufacture as desired. The details of one or more variations of the subject matter described herein are set forth in the accompanying drawings and the description below. Features and advantages of the subject matter described herein will be apparent from the description and drawings, and from the claims.

Drawings

In the drawings of which there are shown,

FIG. 1 depicts an example of an image including a population and a corresponding density map, according to some example embodiments;

FIG. 2A depicts an example of a Convolutional Neural Network (CNN) according to some example embodiments;

fig. 2B depicts another example of a CNN according to some example embodiments;

FIG. 3A depicts an example of neurons for a neural network, according to some example embodiments;

FIG. 3B depicts an example of a neural network including at least one neuron, according to some example embodiments;

FIG. 4 depicts a multi-column convolutional neural network (MCCNN) according to some example embodiments;

5A-5D depict a process flow for determining an object count, according to some example embodiments;

FIG. 6 depicts an example of an apparatus according to some example embodiments; and

FIG. 7 depicts another example of an apparatus according to some example embodiments;

In the drawings, like reference numerals are used to refer to the same or like items.

Detailed Description

Machine learning may be used to perform one or more tasks, such as counting the number of objects within at least one image. For example, a machine learning model such as a neural network, convolutional Neural Network (CNN), multi-column CNN (MCCNN), and/or other type of machine learning may be trained to learn how to process at least one image to determine an estimate of a number of objects, such as people or other types of objects, in the at least one image (which may be in the form of video frames). To further illustrate by way of another example, public safety officers may want to know the crowd count for a given location, which may be useful for a variety of reasons, including crowd control, limiting the number of people at a location, minimizing the risk of hiking, and/or minimizing the risk of some other large group-related confusion. To further illustrate by another example, a traffic safety officer may want to know the vehicle count on a road (or a certain location), and this count may be useful for various reasons, including traffic congestion control and management. According to some example embodiments, a trained machine learning model may be used to count objects, such as people, vehicles, or other objects, in at least one image.

When counting objects in an image, the trained machine learning model may provide an actual count of the number of objects estimated to be in the image, or may provide a density map that provides an estimate of the number of objects per square unit distance, such as the number of objects per square meter. The density map may provide more information in the sense that the density map may estimate the number of objects in the image and the distribution or density of objects over the image.

Although some examples described herein relate to counting people in an image, this is merely an example of the types of objects that may be counted, as other types of objects, such as vehicles, may also be counted.

Fig. 1 depicts an example of an image 100 including an object 100 to be counted and a corresponding density map 105, according to some example embodiments. In the example of fig. 1, the object represents a person, although as noted, the object may represent other types of objects.

The density map 105 may provide information about objects, such as people, in the image 100, such as a density of people per square meter, a distribution of people in the image, and/or a count of a number of people in at least a portion of the image. In the crowd count example, the scale of objects, such as people, in an image may change due to a change in size (e.g., scale) caused by the perspective of the camera relative to the people. For example, when compared to a similarly sized person in the background and thus farther from the camera, the person in the foreground of the image 100 appears larger as the person is closer to the camera. Such a change in size due to the viewing angle may affect the accuracy of the count of objects in the at least one image 100 and the accuracy of the corresponding density map 105.

In some example embodiments, a machine learning model such as neural network, CNN, MCCNN, etc. may be used to determine an estimate of the number of objects such as people in an image. The estimate may be in the form of a density map of the image. In some example embodiments, the machine learning model may be implemented as MCCNN, although other types of machine learning models may also be used. In the case of MCCNN, crowd counting is described in Y.Zhang et al paper, "Single-image crowd counting via multi-column convolutional neural network," in the 2016 IEEE computer vision and pattern recognition treatises.

In some example embodiments, the density map 105 of the image 100 may be determined by segmenting at least the entire image into at least two regions based on the relative distance of an object, such as a person, to the camera viewpoint, although the image may be segmented into other numbers (e.g., 3, 4, or more segmented regions). According to some example embodiments, for each segmented region, a machine learning model, such as MCCNN configured with at least one filter selected to process object sizes (e.g., head or person sizes) in the corresponding region, may determine a density map. According to some example embodiments, the density maps for each segmented region may then be combined to form a density map 105 of the entire image 100. Without limiting the scope, interpretation, or application of the claims appearing below in any way, a technical effect of one or more of the example embodiments disclosed herein may be an increased processing speed due to segmentation of the image as compared to processing the entire image, and/or another technical effect of one or more of the example embodiments disclosed herein may be a more accurate count, as each segment is processed with a filter dedicated to accounting for perspective effects caused by the area and the size of objects in the area.

Fig. 2A depicts an example of a CNN 200 according to some example embodiments. The CNN may include at least one convolutional layer 210, 230, at least one pooling layer 220, 240, and a fully-connected layer 250.

The convolution layer 210 may be referred to as a filter and may include a matrix that convolves at least a portion of the input image 100. As described above, the size of the filter or matrix may be varied to detect and filter objects. In this example, a 7×7 matrix is selected as a filter to convolve with image 100 at 210, so objects to be counted need less than 7×7 pixels in order to be captured correctly (while objects greater than 7×7 will be filtered out). The pooling layer 220 may be used to downsample the convolved image output by the convolved layer 210. To downsample the convolved image into a smaller image, a pooling layer may be formed by sliding a window (or vector) over the convolved image output by convolved layer 210. The pooling layer may have a stride length representing the width of the window in pixels. The fully connected layer 250 may generate the output 204.

Fig. 2B depicts another example of CNN299 according to some example embodiments. The CNN299 may be configured to determine how to segment the input image based on the depth map. The depth map provides information about the relative distance of an object, such as a person, head, etc., to the camera. For example, CNN299 may determine segments 298A-C of input image 100 based on depth map 277. The size-based perspective effect in a given segment may be the same or similar, so a filter convolving the segment may be better able to detect objects of interest such as heads, people, etc.

In the example of fig. 2B, CNN 299 may be trained to determine depth map 277, according to some example embodiments. As described, the depth map 277 may provide an indication of the relative distance of an object (e.g., person, head, etc.) to the camera. In this way, the depth map may provide an indication of the difference in size in the image caused by the viewing angle. In the depth map 277, objects farther from the camera may have brighter pixels than objects closer to the camera. As such, the depth map 277 may be used to segment the image 100 into two or more segmented regions, such as 298A-C, based on the difference in size of the view. Although the previous examples used a depth map with brighter pixels for objects farther away, these pixels may be darker or have other values to represent depth.

To further illustrate, the first segment region 298A may have objects that appear to be smaller in size (due to the perspective) than the second segment region 298B. Also, the second segment region 298B may have objects that appear to be smaller in size (due to the perspective) than the third segment region 298C. Although the previous example segmented the image 100 into three segments, other numbers of segments may be used.

In some example embodiments, the CNN 299 may be trained using reference images. These reference images may include objects such as people in a crowd of people, and tags indicating segments that are determined a priori based on relative size differences caused by the view angle. Furthermore, these segments of the reference image may correspond to objects of a particular size in each segment, and thus to the corresponding filter size. The CNN may then be trained until the CNN can learn to segment the reference image, which may also determine the filter size to be used for the segmentation. According to some example embodiments, once trained, the trained CNN 299 may be used to determine segments in other input images. In some example embodiments, training of CNNs

In the example of fig. 2B, CNN 299 may include 7 x 7 convolutional layer 210 (which is the initial filter layer), followed by 3 x 3 pooling layer 220, followed by 5 x 5 convolutional layer 230, followed by 3 x 3 pooling layer 240, followed by 3 x 3 convolutional layer 265, followed by 3 x 3 convolutional layer 267, followed by 3 x 3 pooling layer 268, and then coupled to fully-connected layer 250 (also referred to as the active layer). The fully connected layer may generate an output, in this example, a depth map 277. Although CNN 299 is depicted as having a configured layer, other types and numbers of layers may be implemented to provide machine learning that generates depth map 277 and associated segments 298A-C. In some example embodiments, one or more thresholds may be used to form the segments 298A-C. For example, pixels that are brighter than a certain threshold may be assigned to segment 298A, while pixels that are darker than a certain threshold may be assigned to segment 298C. Moreover, as described, each segment 298A-C may have an object of a certain size, and thus map to a filter of a given size at 410A, 410B, and 410C, as explained below with respect to FIG. 4.

Fig. 3A depicts an example of an artificial neuron Aj 350 that may be implemented in a neural network, such as CNN, MCCNN, or the like, according to some example embodiments. It should be appreciated that fig. 3A represents a model of an artificial neuron 350, and that the neuron 350 may have other configurations, including an input number and/or an output number. For example, the neuron 350 may include a plurality of inputs to receive pixel correlation values for an image.

Referring to fig. 3A, the neuron 350 may be based on the activation value a _i (t-1) (which corresponds to A ₀ -A ₇ ) 360A-H, connection weight w _ij 365A-H (which is labeled w _0j To w _7j ) And input values 310A-H (labeled S ₀ -S ₇ ) To generate output A _j (t) 370. At a given time t, each of the activation values 360A-H may be multiplied by one of the corresponding weights 365A-H. For example, the connection weight w _0j 365A times the activation value A ₀ 360A, connection weight w _1j 365B multiplied by the activation value A ₁ 360B, and so on. The products (i.e., the product of the connection and the activation value) are then added, and the resulting sum is operated on by the basis function K to generate node A at time t _j 350 output A _j (t) 370. Output 370 may be used as an activation value at a later time (e.g., at t+1) or provided to another node.

The neuron 350 may be implemented according to a neural model such as the following:

Where K corresponds to a basis function (examples of which include sigmoid, wavelet, and any other basis function), A _j (t) corresponds to the output value provided by a given neuron (e.g., the jth neuron) at a given time t, A _i (t-1) a previous output value (or activation value), w, corresponding to the connection i assigned to the jth neuron at the previous time t-1 _ij An ith connection value representing a jth neuron, wherein j varies according to the number of neurons, wherein the value of i varies between 0 and n, and wherein n corresponds to the number of connections to the neuron.

Fig. 3B depicts an interconnected neuron 350 forming a neural network 399 according to some example embodiments. Neural network 399 may be configured to provide CNNs, such as CNNs 200, 299, etc., MCCNNs, or portions of layers, such as neural networks (e.g., convolutional layer 210 may be implemented using a plurality of interconnected neurons 350). Neurons 350 comprising neural network 399 may be implemented using code, circuitry, and/or a combination thereof. In some example embodiments, neurons 350 and/or neural network 399 (which includes neurons 350) may be implemented using dedicated circuitry, including, for example, at least one Graphics Processing Unit (GPU) configured to better process parallel processing, matrix operations, etc., as compared to a conventional central processing unit, or dedicated neural network circuitry.

In the example of fig. 3B, the neural network 399 may include an input layer 360A, one or more hidden layers 360B, and an output layer 360C. Although not shown, other layers, such as a pooling layer, may also be implemented. It should be appreciated that the 3-2-3 node structure of the neural network is used to facilitate the description, and thus, the neural network 399 may also be constructed in other configurations, such as a 3 x 3 structure (with or without hidden layer (s)), a 5 x 5 structure (with or without hidden layer (s)), a 7 x 7 structure (with or without hidden layer (s)), and/or other structures (with or without hidden layer (s)).

During training of a neural network, such as neural network 399, training data, such as reference images with labels (e.g., indicative of segments, depth maps, crowd counts, etc.), may be fed as input to input layer 360A neurons over time (e.g., t, t+1, etc.), until neural network 399 learns to perform tasks. For example, in the example of fig. 3B, CNN 399 may receive labeled training data, such as reference images labeled with appropriate segments, so that CNN 299 may train iteratively until it learns the depth map and/or segments that formed the image. To further illustrate, neurons of the network may learn by optimizing to a mean square error (e.g., between the labeled training data at input layer 360A and the training data generated at the output of output layer 360C) using gradient descent or the like. The configuration of the neural network, such as weight values, activation values, basis functions, etc., may be saved to a storage device while the neural network is trained. The saved configuration represents the trained neural network.

Referring again to fig. 2b, cnn 299 may be used to segment image 100 into regions 298A-C. As described above, each of the segment areas 298A-C may have objects (e.g., header or person sizes) of approximately the same size, and thus map to filters of a given size at 410A, 410B, and 410C. Furthermore, according to some example embodiments, the segmentation regions 298A-C (and/or the filter sizes of the regions) may be provided to another machine learning model, such as the MCCNN400 shown in fig. 4.

According to some example embodiments, MCCNN400 may include CNNs 405A-C for each region segmented in the image. In the example of fig. 4, there are three segmentation areas 298A-C, and thus there are three columns in the MCCNN, each column including a corresponding one of the CNNs 405A-C.

The first CNN 405A may include a first convolution layer 410A that provides a filter of, for example, 3 x 3 pixels. The filter may be selected based on the size of the objects in the segmentation region 298A. As described above, the segment region 298A may have objects (e.g., header or person sizes) of approximately the same size, so for example, the segment region 298A may map to a filter size of 3 x 3 pixels at 410A. In other words, depth information defining the location of the segments in the image 100 may also enable the MCCNN to select an appropriate filter size for each segment 298A-C. The first convolution layer 410A may be followed by convolution layer 412A, pooling layer 414A, convolution layer 418A, pooling layer 417A, convolution layer 418A, and full connection layer 420A. Although the first CNN 405A includes intermediate layers 412A-418A in some configuration, other types and/or numbers of layers may be implemented.

The second CNN 405B may include a first convolution layer 410B that provides a filter of, for example, 5 x 5 pixels. The 5 x 5 pixel filter may be selected based on the size of the object in the segmentation region 298B. As described above with respect to filter 410A, the segment region 298B may have objects (e.g., heads or people sizes) of approximately the same size, so that, for example, the segment region 298B may map to a filter size of 5 x 5 pixels at 410B. The first convolution layer 410B may be followed by convolution layer 412B, pooling layer 414B, convolution layer 418B, pooling layer 417B, convolution layer 418B, and full connection layer 420B. Although the second CNN 405B includes intermediate layers 412B-418B in some configuration, other types and/or numbers of layers may be implemented.

The third CNN 405C may include a first convolution layer 410C that provides a filter of, for example, 7 x 7 pixels. The filter may be selected based on the size of the objects in the segmentation region 298C. The segment region 298C may also have objects (e.g., header or person sizes) of approximately the same size, so that, for example, the segment region 298C may map to a filter size of 7 x 7 pixels at 410C. In other words, depth information defining the position of segments in the image 100 may also enable selection of an appropriate filter size for each segment. The first convolution layer 410C may be followed by convolution layer 412C, pooling layer 414C, convolution layer 418C, pooling layer 417C, convolution layer 418C, and full connection layer 420C. Although the third CNN 405C includes intermediate layers 412C-418C in some configuration, other types and/or numbers of layers may be implemented.

According to some example embodiments, the MCCNN 400 (including 3 CNN columns in this example) may include a first CNN 405A, a second CNN 405B, and a third CNN 298C, the first CNN 405A may have a filter 410A that samples the first segmented region 298A and outputs a first density map 498A of the first region, the second CNN 405B may have a filter 410B that samples the second segmented region 298B of the image and outputs a second density map 498B of the second region, and the third CNN 298C may have a filter 410C that samples the second segmented region 298C of the image and outputs a third density map 498C of the third region. According to some example embodiments, to generate the density map 499 of the entire image 100, the first, second, and third density maps 498A, 498B, 498C may be combined. As described, the density map 499 may provide an estimate of the number of objects per square unit distance from which the number of objects in the image and the distribution of objects over the image may be determined. In this example, the object is a person, although other types of objects in the image may also be counted.

In some example embodiments, as described, the filters 410A-C in each of the columns CNN 405A-C may be selected based on the size of the object in the corresponding region and, in particular, the view angle difference caused by the size in the image. For example, in a given segmented region 298A-C of an image, the size of the person (or head thereof) may have the same similar perspective, and thus the same or similar size. As such, the filter 410A for the first CNN 405A may be a smaller filter to account for similar person/head sizes in the region 298A that is farther from the camera than the filter 410B for the region 298B that is closer to the camera (and thus requires a larger filter). Likewise, the filter 410B of the second CNN 405B for the processing region 298B may be a smaller filter than the filter 410C of the third CNN 405C for the processing region 298B. In this way, the MCCNN 400 may select a filter at 410A-C based on depth information for each of the three regions 298A-C, and each region may be processed using one of the corresponding columns CNN 405A, 405B, or 405B C that is specifically configured for the approximate size of the object (e.g., head or person) in the corresponding region. Accordingly, the MCCNN 400 may select the size of the corresponding initial filter 410A, 410B, or 410C based on the depth information indicating the segment and the object size in the segment so that the object in the region may pass the corresponding filter.

In some example embodiments, MCCNN 400 may be trained using a reference set of images. These reference images may include reference images that have been segmented and have a known density map for each segment. The reference image may represent a ground truth in the sense that the number of people in the image(s) (or segment (s)) or density map thereof is known with a degree of certainty. The MCCNN 400 may then be trained until the MCCNN can learn to generate a density map of the reference image. According to some example embodiments, once trained, the trained MCCNN may be used to determine density maps of other input images.

Referring again to fig. 1, an image 100 (which is being processed to determine an object count) may represent a video stream captured by at least one camera (such as an omnidirectional or multi-view camera, etc.). One example of an omnidirectional multi-view camera is the nokia OzO camera, which can generate 360 panoramic images in multiple planes. In the case of an omnidirectional multi-view camera, images from the camera may be input to CNN 299 and/or MCCNN 400 in order to generate a density map and a corresponding population count in each image. To further illustrate, the OzO camera may include a plurality of cameras, and the images from each of these cameras may be processed to enable segmentation and/or to determine a density map from which crowd counts may be determined. Referring to fig. 4, each of the ozo cameras may be input into a separate CNN of the MCCNN, and then the output density maps may be combined to form an aggregate density map 499.

Fig. 5A depicts a process flow for training a machine learning model, such as CNN 299, to learn how to generate depth information, such as a depth map, to enable image segmentation, according to some example embodiments. The description of fig. 5A refers to fig. 1 and 2B.

According to some example embodiments, at 502, at least one reference image marked with depth information may be received. For example, CNN 299 may receive a reference image with a label indicating the depth of each image. To further illustrate, each reference image may have a corresponding depth map and/or a location of a segment within the image. Objects in the segments in the reference image(s) may be about the same distance from the camera and thus have about the same size to enable filtering with the same size filter.

According to some example embodiments, at 504, a machine learning model may be trained to learn based on the received reference images. For example, CNN 299 may train based on the received images to learn how to generate depth information (such as a depth map), the location of segments of the received reference images, and/or the size of objects (or filter sizes) of each segment. The training may be iterative using gradient descent, or the like. According to some example embodiments, when training the CNN, the configuration of the CNN (e.g., weight values, activation values, basis functions, etc.) may be saved to a storage device at 506. The saved configuration represents a trained CNN that may be used in the operational phase to determine depth information (such as a depth map, segments) of images other than the reference image, and/or the size (or filter size) of the object of each segment.

Fig. 5B depicts a process flow for training a machine learning model, such as MCCNN, to provide object count information, according to some example embodiments. The description of fig. 5A refers to fig. 1 and 4.

According to some example embodiments, at 512, at least one reference image marked with density information may be received. For example, MCCNN 400 may receive reference images with labels indicating segments in each image, as well as densities of objects in the segments, such as person/head per square meter, object count, etc. For example, the reference image 100 (fig. 4) may be segmented a priori, and each segment may have a corresponding density map to enable training. Furthermore, each segment may have objects of approximately the same size (with respect to view angle), so a given filter may be used on the objects in the corresponding segment.

According to some example embodiments, at 514, a machine learning model may be trained to learn to determine a density map. For example, MCCNN 400 may train to learn how to generate object density information, such as density maps, counts, etc., based on received reference images. In some example embodiments, each column of CNNs 405A-C of the MCCNN may be trained using a first convolution layer having a specially selected filter to account for perspective effects caused by the size of the region processed by that column of CNNs. According to some example embodiments, while training the MCCNN, at 516, the configuration of the MCCNN (e.g., weight values, activation values, basis functions, etc.) may be saved to a storage device. The saved configuration represents a trained MCCNN that can be used in the operational phase to determine density information (such as depth maps and segmentation) for images other than the reference image.

FIG. 5C depicts a process flow for a trained machine learning model at an operational stage according to some example embodiments. The description of fig. 5A refers to fig. 1 and 2B.

According to some example embodiments, at 522, at least one image may be received through a trained machine learning model. For example, the trained CNN 299 may receive at least one image 100 requiring an estimate of object count. According to some example embodiments, the trained CNN may process at least one input image 100 to determine depth information at 524, which may be in the form of a depth map, and/or an indication of where at least one image should be segmented. The depth information may also indicate a size of the object in the segment(s), and/or a corresponding filter size of the segment(s). According to some example embodiments, at 526, a trained machine learning model, such as trained CNN 299, may output depth information to another machine learning model, such as MCCNN 400.

FIG. 5D depicts a process flow for a trained machine learning model at an operational stage according to some example embodiments. The description of fig. 5A refers to fig. 1 and 4.

According to some example embodiments, at 532, at least one image may be received through a trained machine learning model. For example, the trained MCCNN 400 may receive at least one image. Further, an image with depth information may be received to enable segmentation of the image 100 into multiple portions. In the example of FIG. 4, image 100 is segmented into 3 portions 298A-C, although other numbers of segments may be used. Furthermore, the depth information may enable the MCCNN to select a filter sized at 410A-C to process the object size found in each segment 298A-C.

According to some example embodiments, each segment region 298A-C may be processed by a CNN405A-C of the MCCNN 400 at 534. In particular, the segmentation may be segmented based on depth information to account for view-induced size differences. This enables each of the CNNs 405A-C to have a filter of a size that is better suited for objects (such as heads, people, etc.) in the corresponding segment that are processed by the corresponding CNN. For example, CNN405A includes objects in the background (which may look smaller due to viewing angle), so the convolutional layer 410A filter is, for example, a 3 x 3 matrix to accommodate relatively small sized heads and/or people. As described above, the size of the filter (which in this example is 3 x 3) may be selected to pass the object of interest (in this example a person). In contrast, CNN 405C has objects in the foreground (which look larger due to the perspective) so the 410C filter of the convolutional layer is, for example, a 7 x 7 matrix to accommodate heads and/or people of relatively large size.

According to some example embodiments, at 536, the trained machine learning model may generate a density map for each segmented region of the image. As shown in FIG. 4, each column of CNNs 405A-C generates a density map 498A-C.

According to some example embodiments, at 538, the trained machine learning model may combine the density maps for each region to form a density map of the entire image received at the input. For example, MCCNN400 may combine density maps 498A-C into a density map 499, which represents the density map 499 of the entire image 100.

According to some example embodiments, at 540, the trained machine learning model may output an indication of the object count. For example, MCCNN400 may output density map 499, or further process the density map to provide a count of the entire image (such as a person count), or a count of a portion of the image.

Fig. 6 depicts a block diagram illustrating a computing system 600, according to some example embodiments. According to some example embodiments, computing system 600 may be used to implement a machine learning model, such as CNN 200, CNN 299, MCCNN400, etc., disclosed herein, including fig. 5A-5D, to perform counting of objects in an image. For example, according to some example embodiments, the system 600 may include or be included in a device, such as a mobile phone, a smart phone, a camera (e.g., ozO, a closed circuit television, a webcam), an unmanned aerial vehicle, an autonomous vehicle, an automobile, an unmanned aerial vehicle, an autonomous vehicle, and/or an internet of things (IoT sensors, such as traffic sensors, industrial sensors, etc.) to enable counting of objects.

As shown in fig. 6, computing system 600 may include a processor 610, a memory 620, a storage device 630, an input/output device 640, and/or a camera 660 (which may be used to capture images including objects to be counted, according to some example embodiments). The processor 610, memory 620, storage 630, and input/output devices 640 may be interconnected via a system bus 650. The processor 610 may be capable of processing instructions for execution within the computing system 600. Such executed instructions may implement one or more aspects of a machine learning model, such as CNN 200, CNN 299, MCCNN 400, and the like. The processor 610 may be capable of processing instructions stored in the memory 620 and/or the storage device 630 to display graphical information for a user interface provided via the input/output device 640. Memory 620 may be a computer-readable medium, such as volatile or non-volatile media that stores information within computing system 600. Memory 620 may store instructions, such as computer program code. The storage device 630 may be capable of providing persistent storage for the computing system 600. Storage device 630 may be a floppy disk device, a hard disk device, an optical disk device, or a tape device, or other suitable persistent storage mechanism. The input/output devices 640 provide input/output operations for the computing system 600. In some example embodiments, the input/output device 640 includes a keyboard and/or a pointing device. In various implementations, the input/output device 640 includes a display unit for displaying graphical user interfaces. Alternatively or additionally, the input/output device 640 may include wireless and/or wired interfaces to enable communication with other devices, such as other network nodes. For example, the input/output devices 640 may include an ethernet interface, a WiFi interface, a cellular interface, and/or other wired and/or wireless interfaces to allow communication with one or more wired and/or wireless networks and/or devices.

Fig. 7 illustrates a block diagram of an apparatus 10 according to some example embodiments. The apparatus 10 may represent a user device such as a wireless device, examples of which include a smart phone, tablet, etc. According to some example embodiments, the apparatus 10 may be used to implement a machine learning model, such as CNN 200, CNN 299, MCCNN400, etc., disclosed herein, including fig. 5A-5D, to perform counting of objects in an image. Further, the apparatus 10 may include a camera 799 and the processor 20 may include a GPU or other special purpose processor to process the processing of the machine learning model. Similar to the system of fig. 6, according to some example embodiments, the apparatus 10 may include or be included in an apparatus such as a mobile phone, a smart phone, a camera (e.g., ozO, a closed circuit television, a webcam), an unmanned aerial vehicle, an autonomous vehicle, an automobile, an unmanned aerial vehicle, an autonomous vehicle, and/or an internet of things (IoT sensors, such as traffic sensors, industrial sensors, etc.) to enable counting of objects.

The apparatus 10 may include at least one antenna 12 in communication with a transmitter 14 and a receiver 16. Alternatively, the transmit antenna and the receive antenna may be separate. The apparatus 10 may also include a processor 20, the processor 20 being configured to provide and receive signals to and from the transmitter and receiver, respectively, and to control the functions of the apparatus. The processor 20 may be configured to control the functions of the transmitter and receiver by implementing control signaling via electrical leads to the transmitter and receiver. Likewise, the processor 20 may be configured to control other elements of the apparatus 10 by implementing control signaling via electrical leads connecting the processor 20 to the other elements, such as a display or memory. The processor 20 may be implemented in a variety of ways, including, for example, circuitry, at least one processing core, one or more microprocessors with accompanying digital signal processor(s), one or more processors without an accompanying digital signal processor, one or more coprocessors, one or more multi-core processors, one or more controllers, processing circuitry, one or more computers, various other processing elements including integrated circuits (e.g., application Specific Integrated Circuits (ASICs), field programmable controller gate arrays (FPGAs), etc.), or some combination thereof. Thus, although shown as a single processor in fig. 7, in some example embodiments, the processor 20 may include multiple processors or processing cores.

The apparatus 10 may be capable of operating with one or more air interface standards, communication protocols, modulation types, access types, and/or the like. The signals transmitted and received by processor 20 may include signaling information in accordance with the air interface standard of the applicable cellular system and/or any number of different wired or wireless networking technologies including, but not limited to, wi-Fi, wireless Local Access Network (WLAN) technologies such as Institute of Electrical and Electronics Engineers (IEEE) 802.11, 802.16, 802.3, ADSL, DOCSIS, etc. In addition, these signals may include voice data, user-generated data, user-requested data, and the like.

For example, the apparatus 10 and/or a cellular modem therein may be capable of operating in accordance with various first generation (1G) communication protocols, second generation (2G or 2.5G) communication protocols, third generation (3G) communication protocols, fourth generation (4G) communication protocols, fifth generation (5G) communication protocols, internet protocol multimedia subsystem (IMS) communication protocols (e.g., session Initiation Protocol (SIP)), and/or the like. For example, the apparatus 10 may be capable of operating in accordance with 2G wireless communication protocols IS-136, time division multiple Access TDMA, global System for Mobile communications, GSM, IS-95, code division multiple Access, CDMA, and the like. In addition, for example, the apparatus 10 may be capable of operating in accordance with 2.5G wireless communication protocols General Packet Radio Service (GPRS), enhanced Data GSM Environment (EDGE), or the like. In addition, for example, the apparatus 10 may be capable of operating in accordance with a 3G wireless communication protocol, such as Universal Mobile Telecommunications System (UMTS), code division multiple access 2000 (CDMA 2000), wideband Code Division Multiple Access (WCDMA), time division synchronous code division multiple access (TD-SCDMA), and the like. Additionally, the apparatus 10 may be capable of operating in accordance with a 3.9G wireless communication protocol, such as Long Term Evolution (LTE), evolved universal terrestrial radio access network (E-UTRAN), and the like. Additionally, for example, the apparatus 10 may be capable of operating in accordance with 4G wireless communication protocols such as LTE-advanced, 5G, and the like, as well as similar wireless communication protocols that may be developed subsequently.

It should be appreciated that the processor 20 may include circuitry for implementing audio/video and logic functions of the apparatus 10. For example, the processor 20 may include a digital signal processor device, a microprocessor device, an analog-to-digital converter, a digital-to-analog converter, and the like. The control and signal processing functions of the apparatus 10 may be distributed among these devices according to their respective capabilities. The processor 20 may additionally include an internal Voice Coder (VC) 20a, an internal Data Modem (DM) 20b, and the like. Further, the processor 20 may include functionality for operating one or more software programs, which may be stored in memory. In general, the processor 20 and stored software instructions may be configured to cause the apparatus 10 to perform actions. For example, the processor 20 may be capable of operating a connectivity program, such as a web browser. The connectivity program may allow the apparatus 10 to transmit and receive network content, such as location-based content, according to a protocol, such as the wireless application protocol WAP, hypertext transfer protocol HTTP and/or the like.

The apparatus 10 may also include a user interface including, for example, a headset or speaker 24, a ringer 22, a microphone 26, a display 28, a user input interface, and/or the like, which may be operatively coupled to the processor 20. As described above, the display 28 may include a touch-sensitive display in which a user may touch and/or gesture to make selections, enter values, and so forth. The processor 20 may also include user interface circuitry configured to control at least some functions of one or more elements of a user interface, such as a speaker 24, ringer 22, microphone 26, display 28, and/or the like. The processor 20 and/or user interface circuitry comprising the processor 20 may be configured to control one or more functions of one or more elements of the user interface through computer program instructions (e.g., software and/or firmware) stored on a memory accessible to the processor 20 (e.g., the volatile memory 40, the non-volatile memory 42, etc.). The apparatus 10 may include a battery for powering various circuits that are related to the mobile terminal (e.g., a circuit that provides mechanical vibration as a detectable output). The user input interface may include devices that allow apparatus 20 to receive data, such as a keypad 30 (which may be a virtual keyboard presented on display 28, or an externally coupled keyboard) and/or other input devices.

As shown in fig. 7, the apparatus 10 may also include one or more mechanisms for sharing and/or acquiring data. For example, the apparatus 10 may include a short range Radio Frequency (RF) transceiver and/or interrogator 64 so that data may be shared with and/or obtained from electronic devices in accordance with RF techniques. The device 10 may include other short-range transceivers, such as an Infrared (IR) transceiver 66, using Bluetooth ^TM Bluetooth operating wireless technology ^TM (BT) transceiver 68, wireless Universal Serial Bus (USB) transceiver 70, bluetooth ^TM Low power transceiver, zigBee transceiver, ANT transceiver, cellular device-to-device transceiver,Wireless local area link transceivers, and/or any other short-range radio technology. For example, the apparatus 10 and in particular the short-range transceiver may be capable of transmitting data to and/or receiving data from an electronic device in the vicinity of the apparatus (such as within 10 meters). The apparatus 10 including a Wi-Fi or wireless local area network networking modem may also be capable of transmitting and/or receiving data from an electronic device according to various wireless networking technologies, including 6LoWpan, wi-Fi low power, WLAN technologies, such as IEEE 802.11 technologies, IEEE 802.15 technologies, IEEE 802.16 technologies, and the like.

The apparatus 10 may include a memory that may store information elements related to a mobile subscriber, such as a Subscriber Identity Module (SIM) 38, a removable user identity module (R-UIM), an eUICC, a UICC, and so forth. In addition to the SIM, the device 10 may include other removable and/or fixed memory. The apparatus 10 may include volatile memory 40 and/or non-volatile memory 42. For example, volatile memory 40 may include Random Access Memory (RAM), including dynamic and/or static RAM, on-chip or off-chip cache memory, and the like. Nonvolatile memory 42, which may be embedded and/or removable, may include, for example, read-only memory, flash memory, magnetic storage devices (e.g., hard disk, floppy disk drive, magnetic tape), optical disk drive and/or media, nonvolatile random access memory (NVRAM), and the like. Like volatile memory 40, nonvolatile memory 42 may include a cache area for temporary storage of data. At least a portion of the volatile and/or nonvolatile memory may be embedded in the processor 20. The memory may store one or more software programs, instructions, information, data, etc., that may be used by the apparatus to perform the operations disclosed herein, including, for example: processing, by the trained machine learning model, a first segment of the image and a second segment of the image, the first segment being processed using a first filter selected based on the depth information to enable formation of a first density map and the second segment being processed using a second filter selected based on the depth information to enable formation of a second density map; combining the first density map and the second density map by a trained machine learning model to form a density map of the image; the output is provided based on the density map by a trained machine learning model, the output representing an estimate of the number of objects in the image, and/or other aspects disclosed herein with respect to CNN, MCCNN 400, etc. for counting objects in the image.

The memories may include an identifier, such as an International Mobile Equipment Identification (IMEI) code, capable of uniquely identifying the apparatus 10. The memories may include an identifier, such as an International Mobile Equipment Identification (IMEI) code, capable of uniquely identifying the apparatus 10. In an example embodiment, the processor 20 may be configured to control and/or provide one or more aspects disclosed herein (e.g., see processes 600, 700, and/or other operations disclosed herein) using computer code stored at the memory 40 and/or 42. For example, processor 20 may be configured, using computer code stored at memory 40 and/or 42, to include at least, for example: processing, by the trained machine learning model, a first segment of the image and a second segment of the image, the first segment being processed using a first filter selected based on the depth information to enable formation of a first density map and the second segment being processed using a second filter selected based on the depth information to enable formation of a second density map; combining the first density map and the second density map by a trained machine learning model to form a density map of the image; and/or other aspects disclosed herein with respect to CNN, MCCNN 400, etc. for counting objects in an image.

Some embodiments disclosed herein may be implemented in software, hardware, application logic, or a combination of software, hardware and application logic. For example, software, application logic and/or hardware may reside on the memory 40, the control device 20 or the electronic components. In some example embodiments, the application logic, software, or instruction sets are maintained on any one of various conventional computer-readable media. In the context of this document, a "computer-readable medium" can be any non-transitory medium that can contain, store, communicate, propagate, or transport the instructions for use by or in connection with the instruction execution system, apparatus, or device (such as a computer or data processor circuitry), examples of which are depicted in fig. 7, the computer-readable medium can include a non-transitory computer-readable storage medium that can be any medium that can contain or store the instructions for use by or in connection with the instruction execution system, apparatus, or device (such as a computer).

The subject matter described herein may be implemented in systems, devices, methods, and/or articles of manufacture, depending on the desired configuration. For example, the base stations and user equipment (or one or more components thereof) and/or processes described herein may be implemented using one or more of the following: a processor executing program code, an Application Specific Integrated Circuit (ASIC), a Digital Signal Processor (DSP), an embedded processor, a Field Programmable Gate Array (FPGA), and/or a combination thereof. These various implementations may include implementations in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, coupled to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device. These computer programs (also known as programs, software applications, components, program code or code) include machine instructions for a programmable processor, and may be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the term "computer-readable medium" refers to any computer program product, machine-readable medium, computer-readable storage medium, apparatus and/or device (e.g., magnetic discs, optical disks, memory, programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions. Similarly, systems are also described herein that may include a processor and a memory coupled to the processor. The memory may include one or more programs that cause the processor to perform one or more of the operations described herein.

Although a few variations have been described in detail above, other modifications or additions are possible. In particular, additional features and/or variations may be provided in addition to those set forth herein. Furthermore, implementations described above may involve various combinations and sub-combinations of features disclosed and/or combinations and sub-combinations of several additional features disclosed above. Other embodiments may be within the scope of the following claims.

The different functions discussed herein may be performed in a different order and/or concurrently with each other, if desired. Furthermore, one or more of the above-described functions may be optional, or may be combined, if desired. Although various aspects of some embodiments are set out in the independent claims, other aspects of some embodiments comprise other combinations of features from the described embodiments and/or the dependent claims with the features of the independent claims, and not solely the combinations explicitly set out in the claims. It should also be noted herein that while the above describes example embodiments, these descriptions should not be interpreted in a limiting sense. Rather, several variations and modifications may be made without departing from the scope of some embodiments as defined in the appended claims. Other embodiments may be within the scope of the following claims. The term "based on" includes "based at least on. Unless otherwise indicated, use of the phrase "such as" means "such as, for example".

Claims

1. A method of processing an image, comprising:

receiving, by a trained machine learning model, an image comprising a plurality of objects, wherein the image is segmented into at least a first segment and a second segment based on depth information;

processing, by the trained machine learning model, the first segment of the image and the second segment of the image, the first segment being processed using a first filter selected based on the depth information to enable formation of a first density map, and the second segment being processed using a second filter selected based on the depth information to enable formation of a second density map;

combining the first density map and the second density map by the trained machine learning model to form a density map for the image; and

an output is provided by the trained machine learning model based on the density map, the output representing an estimate of a number of objects in the image.

2. The method of claim 1, wherein the depth information is received from another machine learning model trained to output the depth information from the image.

3. The method of claim 1, wherein the trained machine learning model comprises a multi-column convolutional neural network comprising a first convolutional neural network and a second convolutional neural network.

4. The method of claim 3, wherein the first convolutional neural network comprises the first filter, wherein the second convolutional neural network comprises the second filter, and wherein the first filter and the second filter each comprise a convolutional layer.

5. The method of any of claims 1 to 4, wherein the depth information indicates a position of the first segment and/or the second segment.

6. The method of any of claims 1 to 4, wherein the depth information indicates an object size due to a distance from a camera, and/or wherein the depth information indicates a first filter size of the first filter and a second filter size of the second filter.

7. The method of claim 6, further comprising:

the first filter size of the first filter and the second filter size of the second filter are selected by the trained machine learning model and based on the depth information.

8. The method of any one of claims 1 to 4, further comprising:

the machine learning model is trained based on the reference image to learn generation of the density map.

9. The method of any one of claims 1 to 4, wherein the plurality of objects comprises a plurality of people, a plurality of vehicles, and/or a group of people.

10. The method of any of claims 1-4, wherein the first density map estimates a density of objects in the first segment, wherein the second density map estimates a density of objects in the second segment, and wherein the density map estimates a density of objects in the image.

11. An apparatus for processing an image, comprising:

at least one processor; and

at least one memory including program code that, when executed, causes the apparatus at least to:

12. The apparatus of claim 11, wherein the depth information is received from another machine learning model trained to output the depth information from the image.

13. The apparatus of claim 11, wherein the trained machine learning model comprises a multi-column convolutional neural network comprising a first convolutional neural network and a second convolutional neural network.

14. The apparatus of claim 13, wherein the first convolutional neural network comprises the first filter, wherein the second convolutional neural network comprises the second filter, and wherein the first filter and the second filter each comprise a convolutional layer.

15. The apparatus of any of claims 11 to 14, wherein the depth information indicates a position of the first segment and/or the second segment.

16. The apparatus of any of claims 11 to 14, wherein the depth information indicates an object size due to a distance from a camera, and/or wherein the depth information indicates a first filter size of the first filter and a second filter size of the second filter.

17. An apparatus of claim 16, wherein the apparatus is further caused to at least:

18. The apparatus of any of claims 11 to 14, wherein the apparatus is further caused to at least:

19. The apparatus of any of claims 11 to 14, wherein the plurality of objects comprises a plurality of persons, a plurality of vehicles, and/or a group of persons.

20. The apparatus of any of claims 11 to 14, wherein the first density map estimates a density of objects in the first segment, wherein the second density map estimates a density of objects in the second segment, and wherein the density map estimates a density of objects in the image.

21. An apparatus for processing an image, comprising:

means for receiving an image comprising a plurality of objects through a trained machine learning model, wherein the image is segmented into at least a first segment and a second segment based on depth information;

means for processing the first segment of an image and the second segment of the image by the trained machine learning model, the first segment being processed using a first filter selected based on the depth information to enable formation of a first density map and the second segment being processed using a second filter selected based on the depth information to enable formation of a second density map;

means for combining the first density map and the second density map by the trained machine learning model to form a density map for the image; and

means for providing an output based on the density map through the trained machine learning model, the output representing an estimate of a number of objects in the image.

22. The apparatus of claim 21, further comprising means for performing the method of any of claims 2 to 10.

23. A non-transitory computer-readable medium comprising program code that, when executed, causes operations comprising: