WO2023176562A1

WO2023176562A1 - Information processing apparatus, information processing method, and information processing program

Info

Publication number: WO2023176562A1
Application number: PCT/JP2023/008411
Authority: WO
Inventors: 雄二永松
Original assignee: ソニーグループ株式会社
Priority date: 2022-03-17
Filing date: 2023-03-06
Publication date: 2023-09-21

Abstract

An information processing apparatus according to an embodiment of the present disclosure comprises an object detecting unit and an object image extracting unit. The object detecting unit generates statistical information relating to a pixel region to be tracked, included in an input image. The object image extracting unit determines an extraction region to be tracked, on the basis of the statistical information, and extracts an object image, as an image of the extraction region, from the input image.

Description

Information processing device, information processing method, and information processing program

The present disclosure relates to an information processing device, an information processing method, and an information processing program.

In recent years, there has been an increasing need for mobile objects such as robots that recognize the external environment and move autonomously according to the recognized environment. A technique for recognizing surrounding people and objects is disclosed in Patent Document 1, for example.

Japanese Patent Application Publication No. 2010-220122

By the way, in Patent Document 1, a rectangular filter is used when extracting a person or object from an acquired surrounding image. However, when a rectangular filter is used, a large amount of background images that become noise are included, and for example, the accuracy of identifying the same tracking target between frames (tracking accuracy) may be reduced. Therefore, it is desirable to provide an information processing device, an information processing method, and an information processing program that can suppress noise to a low level.

An information processing device according to an embodiment of the present disclosure includes an object detection section and an object image extraction section. The object detection unit generates statistical information of a pixel region to be tracked included in the input image. The object image extraction unit determines an extraction region to be tracked based on the statistical information, and extracts an object image from the input image as an image of the extraction region.

An information processing method according to an embodiment of the present disclosure includes the following two.
(A1) Generating statistical information on the pixel area of the tracking target included in the input image (A2) Determining the extraction area of the tracking target based on the statistical information, and extracting an object image from the input image as an image of the extraction area to do

The information processing program according to an embodiment of the present disclosure causes a computer to execute the following two things.
(B1) Generating statistical information on the pixel area of the tracking target included in the input image (B2) Determining the extraction area of the tracking target based on the statistical information, and extracting an object image from the input image as an image of the extraction area to do

In an information processing device, an information processing method, and an information processing program according to an embodiment of the present disclosure, an extraction region of a tracking target is determined based on statistical information of a pixel region of a tracking target included in an input image, and , the object image is extracted as the image of the extraction area. In this way, by extracting the tracking target image based on the tracking target's statistical information, the proportion of the background image included in the object image can be reduced compared to the case of extracting the tracking target image using a rectangular filter. It can be suppressed.

FIG. 1 is a diagram illustrating an example of functional blocks of an information processing system according to a first embodiment of the present disclosure. FIG. 2 is a diagram illustrating a method for determining the same object using two object images. FIG. 3 is a diagram illustrating a method of acquiring an object image from an input image. FIG. 4 is a diagram illustrating an example of information processing in the information processing system. FIG. 5 is a diagram showing an example in which an offset is provided in the extraction area. FIG. 6 is a diagram showing an example of the t distribution. FIG. 7 is a diagram illustrating a method for determining identical objects using a histogram. FIG. 8 is a diagram illustrating a modified example of the functional blocks of the information processing system in FIG. 1. FIG. 9 is a diagram illustrating an example of functional blocks of an information processing system according to the second embodiment of the present disclosure.

Hereinafter, embodiments for carrying out the present disclosure will be described in detail with reference to the drawings. Note that the explanation will be given in the following order.

1. Background 2. First embodiment (FIGS. 1 to 4)
3. Modifications (Figures 5 to 7)
4. Second embodiment (FIG. 8)

<1. Background＞
There is a growing need for autonomous mobile robots that can move around offices and towns. By tracking people and objects around the robot, it becomes possible to perform actions such as avoiding collisions and following the tracked target by predicting their actions. Since the robot is powered by batteries, it requires a highly accurate tracking method with low power consumption.

Conventionally, methods of acquiring surrounding images and tracking surrounding people and objects have been known. In such a method, a pixel region corresponding to a tracking target is extracted for each frame, and the same tracking target is identified between frames using the degree of matching of pixel regions as an index. Here, when extracting a pixel region to be tracked, a method is known in which a rectangular or specific shaped filter is swept over an image. A filter is sometimes called a kernel.

However, there are various problems with this method. First, since the size of the tracking target in the image is unknown, it is necessary to sweep filters of various sizes over the image. Therefore, there is a problem that the amount of calculation increases. Furthermore, when detecting partial locations such as hands and feet, it is necessary to sweep the filters corresponding to those locations. Furthermore, since the number of people or objects in an image is unknown, if the most matching pixel region is extracted, it will be difficult to detect multiple people. On the other hand, if a matching threshold is provided and multiple people are attempted to be detected, there is a possibility that the same person will be detected multiple times.

Therefore, it is conceivable to extract the pixel region to be tracked using a neural network instead of the sweep (for example, see Patent Document 1). In this case, since there is no need for sweeping, the amount of calculation can be reduced. Further, even if the number of people or objects in an image is unknown, it is possible to identify the tracking target with high accuracy. It is also possible to detect partial areas such as hands and feet, and it is also possible to deal with occlusion. However, when extracting a tracking target using a neural network, if a rectangular filter surrounding the tracking target is used, a large amount of background images that become noise will be included, which may reduce pattern matching accuracy.

When extracting a tracking target using a neural network, for example, if a filter with a specific shape that matches the shape of a person is used, almost no background image that becomes noise will be included, and pattern matching accuracy may be improved. be. However, since neural networks perform inference on a pixel-by-pixel basis, when a filter with such a specific shape is used, a network with a large amount of calculation is required. As described above, the conventional method has a problem in that it is difficult to keep both the amount of calculations and the amount of noise low. Therefore, the inventor of the present application will explain below a measure that can reduce both the amount of calculation and the amount of noise.

<1. First embodiment>
[composition]
An information processing system 1 according to a first embodiment of the present disclosure will be described. FIG. 1 shows an example of functional blocks of an information processing system 1. As shown in FIG. For example, as shown in FIG. 1, the information processing system 1 includes a sensor device section 10, an object detection section 20, a storage section 30, an object image extraction section 40, and an object tracking section 50.

The object detection section 20, object image extraction section 40, and object tracking section 50 are configured to include, for example, a CPU (Central Processing Unit) and a GPU (Graphics Processing Unit). The object detecting section 20, the object image extracting section 40, and the object tracking section 50, for example, are configured to read information written in the program 33 by loading various programs (for example, a program 33 described later) stored in the storage section 30 into the CPU. Perform a series of steps. Note that the object detection section 20, object image extraction section 40, and object tracking section 50 may be configured with an MPU (Micro-Processing Unit) that executes the respective functions.

The sensor device section 10 includes, for example, a sensor element that recognizes the external environment and acquires environmental data corresponding to the recognized external environment. The sensor element outputs the acquired environmental data to the object detection section 20. The sensor element is, for example, an RGB camera, an RGB-D camera, a depth sensor, an infrared sensor, or an event camera.

The RGB camera is, for example, a single-application visible light image sensor that outputs an RGB image obtained by receiving visible light and converting it into an electrical signal. The RGB-D camera is, for example, a binocular visible light image sensor, and outputs an RGB-D image (an RGB image and a distance image obtained from parallax). The depth sensor is, for example, a ToF (Time of Flight) sensor or a Lider (Laser Imaging Detection and Ranging) sensor, and outputs a distance image obtained by measuring scattered light in response to pulsed laser irradiation. The infrared sensor outputs an infrared image obtained by, for example, receiving infrared light and converting it into an electrical signal. The event camera is, for example, a single-purpose visible light image sensor, and outputs a difference between RGB images (difference image) between frames. The sensor device unit 10 outputs various images (for example, an RGB image, an RGB-D image, a distance image, an infrared image, or a difference image) obtained from the external environment as an input image Iin (see FIG. 2(A)). ).

The object detection unit 20 generates statistical information of the pixel region of the tracking target included in the input image Iin obtained by the sensor device unit 10 and type information of the tracking target. The object detection unit 20 stores the generated statistical information and type information, and the input image Iin in the storage unit 30. The statistical information generated by the object detection unit 20 includes the average position (μx, μy) and variance-covariance values (ρxx, ρxy, ρyy) of the pixel region to be tracked (see FIG. 2(B)). The statistical information generated by the object detection section 20 is stored in the statistical information 31 in the storage section 30.

The average position (μx, μy) is, for example, two-dimensional coordinates corresponding to the center position in the X-axis direction and the center position in the Y-axis direction in the pixel region of the tracking target included in the input image Iin. The variance-covariance values (ρxx, ρxy, ρyy) are the variance-covariance values of the pixel region to be tracked included in the input image Iin. ρxx is the (1, 1) element in the variance-covariance matrix. ρyy is the (2,2) element in the variance-covariance matrix. ρxy is the (2,1), (1,2) element in the variance-covariance matrix.

The type information generated by the object detection unit 20 includes, for example, the name of the tracked object, such as a person or a car. The name of the tracked object roughly indicates the shape, size, and other characteristics of the tracked object. The type information generated by the object detection section 20 is stored in the type information 32 in the storage section 30.

The object detection unit 20 has a machine learning model such as a neural network. This neural network uses, for example, a learning image (an image containing a tracking target) and statistical information (average position (μx, μy) and variance/covariance value (ρxx , ρxy, ρyy)) and tracking target type information included in the learning image as teaching data. When input image Iin is input, this neural network calculates statistical information (average position (μx, μy) and variance/covariance values (ρxx, ρxy, ρyy )) and the tracking target type information included in the input image Iin. The object detection unit 20 uses a neural network to output statistical information (average position (μx, μy) and variance-covariance values (ρxx, ρxy, ρyy)) and tracking target type information from the input image Iin. do.

The storage unit 30 is, for example, a recording medium such as a semiconductor memory or a hard disk. The storage unit 30 stores various programs (for example, a program 33). The program 33 describes a series of procedures for realizing the respective functions of the object detection section 20, object image extraction section 40, and object tracking section 50. The storage unit 30 stores various data (for example, input image Iin, object image Iob, statistical information 31, and type information 32) generated by executing the program 33.

The object image extraction unit 40 determines the extraction area PA of the tracking target based on the statistical information generated by the object detection unit 20 and the type information of the tracking target (see FIG. 2(C)). Specifically, the object image extraction unit 40 determines the extraction area PA based on the average position (μx, μy), the variance-covariance values (ρxx, ρxy, ρyy), and the tracking target type information. The extraction area PA has an elliptical shape. The object image extraction unit 40 derives the radius of the ellipse based on the variance-covariance values (ρxx, ρxy, ρyy), and corrects the derived radius of the ellipse based on the tracking target type information. The value derived in this way (radius (rx, ry)) is the Mahalanobis distance. The ellipse whose radius is the Mahalanobis distance has a spread corresponding to the pixel distribution of the tracking target (see FIG. 2C). Therefore, by using such an elliptical filter, it is possible to obtain an image (object image Iob) in which the background image that becomes noise is suppressed (see FIG. 2(D)).

For example, if the tracked object is a car, the car has an overall rounded oval shape. Therefore, no matter which direction the vehicle is viewed from, the vehicle to be tracked can be surrounded by an ellipse. Therefore, for example, by determining an ellipse with a Mahalanobis distance of 3 as the extraction area PA, it is possible to cover approximately 99.7% of the pixel area to be tracked with the extraction area PA, and to suppress the introduction of noise. On the other hand, when the tracking target is a person, for example, if an ellipse with a Mahalanobis distance of 3 is determined as the extraction area PA, the extraction area PA will also cover the ends of the person's limbs, which may cause noise to be mixed in. There is a possibility that the amount will be large and cause problems in tracking the target. In this way, by changing the setting of the Mahalanobis distance depending on the type of the tracked object, it is possible to perform optimal object tracking for each tracked object.

For example, when the tracking target type information is a person, the object image extraction unit 40 corrects the radius of the ellipse derived based on the variance-covariance values (ρxx, ρxy, ρyy) to be smaller. For example, when the tracking target type information is a passenger car, the object image extraction unit 40 greatly corrects the radius of the ellipse derived based on the variance-covariance values (ρxx, ρxy, ρyy). The object image extraction unit 40 outputs the value derived in this way as the ellipse radius (rx, ry) of the extraction area PA. The derived ellipse radius (rx, ry) is stored in the statistical information 31 in the storage unit 30 in association with the identifier of the tracked object, along with the average position (μx, μy) and variance-covariance values (ρxx, ρxy, ρyy). Ru.

The object image extraction unit 40 further extracts an object image Iob from the input image Iin as an image of the extraction area PA (see FIG. 2(D)). That is, the object image extraction unit 40 extracts the object image Iob from the input image Iin based on the average position (μx, μy) and the variance-covariance values (ρxx, ρxy, ρyy).

The object tracking unit 50 tracks a tracking target included in the input image Iin. In order to perform such tracking, the object tracking unit 50 identifies the same tracking target between frames. Specifically, the object tracking unit 50 determines whether the same tracking target is included in a plurality of input images Iin obtained at different times.

For example, as shown in FIG. , ry(t)), the average position (μx(t+1), μy(t+1)) and ellipse radius (rx(t+1), ry(t+1)) obtained from the input image Iin(t+1) at time t+1. Contrast. Then, the object tracking unit 50 determines whether or not the two input images Iin(t) and Iin(t+1) include the same tracking target based on the degree of matching. As a result, if the same tracking target is included in the two input images Iin(t) and Iin(t+1), the object tracking unit 50, for example, selects an object image Iob(t+1) that includes the tracking target. and the average position (μx(t+1), μy(t+1)) of the pixel area to be tracked are output to the outside.

[motion]
Next, information processing in the information processing system 1 will be explained. FIG. 3 shows an example of information processing in the information processing system 1.

First, the sensor device section 10 acquires the input image Iin (step S101). Next, the object detection unit 20 generates statistical information of the pixel region of the tracking target included in the input image Iin obtained by the sensor device unit 10 and type information of the tracking target (step S102). Next, the object image extraction unit 40 determines the extraction area PA of the pixels of the tracking target based on the statistical information generated by the object detection unit 20 and the type information of the tracking target (step S103). Next, the object image extraction unit 40 extracts the object image Iob from the input image Iin as an image of the extraction area PA (step S104).

Next, the object tracking unit 50 identifies the same tracking target between frames. Specifically, the object tracking unit 50 determines whether the same tracking target is included in a plurality of input images Iin obtained at different times (step S105). As a result, if the same tracking target is included in the plurality of input images Iin (step S105; Y), the object tracking unit 50 tracks the tracking target (step S106). The object tracking unit 50 outputs, for example, an object image Iob including the tracking target and the average position (μx, μy) of the pixel region of the tracking target to the outside. On the other hand, if the same tracking target is not included in the plurality of input images Iin (step S105; N), the object tracking unit 50 ends tracking of the tracking target (step S107). For example, the object tracking unit 50 outputs an error code to the outside.

[effect]
Next, the effects of the information processing system 1 will be explained.

In this embodiment, the extraction area PA of the tracking target is determined based on the statistical information of the pixel area of the tracking target included in the input image Iin, and the object image Iob is extracted from the input image Iin as an image of the extraction area PA. Ru. In this way, by extracting the image of the tracking target based on the statistical information of the tracking target, the proportion of the background image included in the object image Iob can be reduced compared to the case of extracting the area of the tracking target using a rectangular filter. can be suppressed. As a result, noise can be kept low.

Furthermore, in this embodiment, the radius of the extraction area PA (ellipse radius (rx, ry)) is determined based on the variance-covariance values (ρxx, ρxy, ρyy), and the average position (μx, μy) and ellipse radius The extraction area PA is determined based on (rx, ry). In this way, by determining the extraction area PA based on the variance-covariance values of the pixel area to be tracked, the proportion of the background image included in the object image Iob can be suppressed compared to the case where a rectangular filter is used. As a result, noise can be kept low.

Furthermore, in this embodiment, the extraction area PA is determined based on the statistical information of the pixel area of the tracking target and the type information of the tracking target. Thereby, for example, the ellipse radius (rx, ry) can be adjusted depending on the type of tracking target, and the proportion of the background image included in the object image Iob can be effectively suppressed. As a result, noise can be kept low.

Furthermore, in this embodiment, statistical information and tracking target type information are generated from the input image Iin using a neural network. Thereby, the amount of calculation can be kept low compared to the case where the filter is swept.

Furthermore, in this embodiment, it is determined whether the same tracking target is included in a plurality of input images Iin obtained at different times. Thereby, when the same tracking target is identified between frames, the tracking target included in the input image Iin can be tracked.

<2. Modified example>
Hereinafter, a modification of the information processing system 1 according to the above embodiment will be described. In the following modified examples, the same components as those in the above embodiment will be described with the same reference numerals.

[Modification A]
In the embodiment described above, the object image extraction unit 40 determines not only the radius of the extraction area PA but also the average position (μx, An offset α of μy) may be determined.

For example, if the tracking target is a person, facial information can be important information for tracking. For example, when a pedestrian in the city is to be tracked, the entire face can be included in the object image Iob by moving the extraction area PA to a position above the average position (μx, μy) by +α. Further, since the position of the face can be predicted from the size of the ellipse radius (rx, ry), the value of α can be determined based on the ellipse radius (rx, ry) of the ellipse. Therefore, when the tracking target type information is a person, the object image extraction unit 40 determines the offset α of the average position (μx, μy) based on the size of the ellipse radius (rx, ry). good. By doing this, it is possible to perform tracking using a face image.

[Modification B]
In the above embodiment and modification A, the object detection unit 20 corrects the generated variance-covariance values (ρxx, ρxy, ρyy) using the t-distribution. ') may be derived.

The Mahalanobis distance represents the distribution of data when the pixel distribution of the tracking target is assumed to be a normal distribution. In neural networks, the larger the size of the pixel region to be tracked, the more accurate the inference tends to be, and conversely, the smaller the size, the lower the accuracy of the inference. Therefore, for a tracking target whose pixel area is small in size, the ellipse radius (rx, ry) may be derived using a distribution with a larger variance than the normal distribution. Specifically, for a tracking target with a small pixel area size, the variance-covariance values (ρxx, ρxy, ρyy) obtained assuming that the pixel distribution of the tracking target is a normal distribution are calculated using the t distribution. The ellipse radius (rx, ry) may be derived using the correction values (ρxx', ρxy', ρyy') obtained thereby.

Here, the t distribution is, for example, a probability density function as shown in FIG. This probability density function is a conditional probability density function in which the average position (μx, μy) and variance-covariance values (ρxx, ρxy, ρyy), which are parameters of a normal distribution, are known. This probability density function further has a property that as the degree of freedom ν approaches infinity, it approaches a normal distribution. When the degree of freedom ν is small, that is, when the number of pixels included in the pixel region of the tracking target is small, the pixel distribution of the tracking target has a gentle distribution that is a squashed normal distribution.

In this modification, by utilizing such characteristics of the t-distribution, when the number of pixels included in the pixel region to be tracked is small, the obtained variance-covariance values (ρxx, ρxy, ρyy ) is corrected. Thereby, even if the number of pixels included in the pixel region to be tracked is small, it is possible to suppress a decrease in tracking accuracy.

[Modification C]
In the above embodiment and modifications A and B, the object tracking unit 50 generates a histogram Hg for each object image Iob, and based on the generated histogram Hg, as shown in FIG. It may be determined whether the same tracking target is included in a plurality of input images Iin having different times.

For example, as shown in FIG. 7, the object tracking unit 50 calculates the average position (μx(t), μy(t)) and ellipse radius (rx(t), ry(t)) at time t, and the time t+1. The average position (μx(t+1), μy(t+1)) and the ellipse radius (rx(t+1), ry(t+1)) are compared. Further, the object tracking unit 50 compares the histogram Hg(t) at time t and the histogram Hg(t+1) at time t+1, for example, as shown in FIG. The object tracking unit 50 determines whether or not the two input images Iin(t) and Iin(t+1) include the same tracking target based on the degree of matching. As a result, if the same tracking target is included in the two input images Iin(t) and Iin(t+1), the object tracking unit 50, for example, selects an object image Iob(t+1) that includes the tracking target. and the average position (μx(t+1), μy(t+1)) of the pixel area to be tracked are output to the outside. By using the histogram Hg for discrimination in this way, it becomes possible to perform the above discrimination with higher accuracy.

[Modification D]
In the above embodiments and modifications A to C, the information processing system 1 may further include a frame interpolation unit 60, for example, as shown in FIG. The frame interpolation unit 60 includes, for example, a CPU and a GPU. For example, the frame interpolation unit 60 executes a series of procedures described in the program 33 by loading various programs (for example, the program 33) stored in the storage unit 30 into the CPU. Note that the frame interpolation unit 60 may be configured with an MPU that executes its functions.

Here, it is assumed that the object detection unit 20 generates statistical information at a predetermined operating frequency. At this time, the frame interpolation unit 60 performs frame interpolation of the statistical information generated by the object detection unit 20 using a Kalman filter. When the statistical information generated by the object detecting section 20 is input, the Kalman filter generates an estimated value of the statistical information at a frequency higher than the operating frequency of the object detecting section 20 based on the input statistical information. The object detection unit 20 outputs the estimated value of the statistical information generated using the Kalman filter to the object image extraction unit 40. The object image extraction unit 40 determines the extraction area PA of the tracking target based on the estimated value of the statistical information generated by the frame interpolation unit 60 and the tracking target type information generated by the object detection unit 20.

Therefore, by operating the frame interpolation section 60 at an operating frequency N times that of the object detection section 20, the operating frequency of the neural network can be reduced to 1/N. As a result, the amount of calculation of the neural network can be reduced.

<3. Second embodiment>
Next, an information processing system 2 according to a second embodiment of the present disclosure will be described. FIG. 9 shows an example of functional blocks of the information processing system 2. As shown in FIG. For example, as shown in FIG. 9, the information processing system 2 includes an image DB (DataBase) generation unit instead of the object tracking unit 50 in the information processing system 2 according to the above embodiment and modifications A to D thereof. 70 and an image DB 80 are provided.

The image DB generation unit 70 stores the generated object image Iob in the image DB 80 every time the object image Iob is generated by the object image extraction unit 40. The image DB 80 stores a large number of object images Iob. The image DB 80 can be used, for example, as an image for learning a neural network (an image including a tracking target).

Although the present disclosure has been described above with reference to a plurality of embodiments and modifications thereof, the present disclosure is not limited to the above embodiments, etc., and various modifications are possible.

For example, in the above embodiments, the object detection section 20, the storage section 30, the object image extraction section 40, and the object tracking section 50 may all be installed in a common information processing device. Furthermore, for example, in the above embodiments, the sensor device section 10, the object detection section 20, the storage section 30, the object image extraction section 40, and the object tracking section 50 may all be installed in a common information processing device. .

Furthermore, for example, in the above embodiments, the object detection section 20, the storage section 30, the object image extraction section 40, the image DB generation section 70, and the image DB 80 may all be installed in a common information processing device. Further, for example, in the above embodiments, the sensor device section 10, the object detection section 20, the storage section 30, the object image extraction section 40, the image DB generation section 70, and the image DB 80 are all installed in a common information processing device. You can leave it there.

Note that the effects described in this specification are merely examples. The effects of the present disclosure are not limited to the effects described herein. The present disclosure may have advantages other than those described herein.

Further, for example, the present disclosure can take the following configuration.
(1)
an object detection unit that generates statistical information of a pixel region to be tracked included in the input image;
An information processing apparatus, comprising: an object image extraction unit that determines an extraction area of the tracking target based on the statistical information and extracts an object image from the input image as an image of the extraction area.
(2)
The statistical information includes an average position and a variance-covariance value of the pixel region to be tracked,
The extraction area has an elliptical shape,
The information processing device according to (1), wherein the object image extraction unit determines a radius of the extraction region based on the variance-covariance value, and determines the extraction region based on the average position and the radius.
(3)
The object detection unit derives a correction value by correcting the generated variance-covariance value using a t-distribution,
The information processing device according to (2), wherein the object image extraction unit determines the radius of the extraction area based on the correction value.
(4)
The object detection unit generates the statistical information and the tracking target type information from the input image,
The information processing device according to (1), wherein the object image extraction unit determines the extraction area based on the statistical information and the type information.
(5)
The statistical information includes an average position and a variance-covariance value of the pixel distribution of the tracking target,
The extraction area has an elliptical shape,
The object image extraction unit determines a radius of the extraction area and an offset of the average position based on the variance-covariance value and the type information, and determines the radius of the extraction area and the offset of the average position based on the average position, the radius, and the offset. The information processing device according to (4), which determines an extraction area.
(6)
The object detection unit derives a correction value by correcting the generated variance-covariance value using a t-distribution,
The information processing device according to (5), wherein the object image extraction unit determines the radius of the extraction area and the offset of the average position based on the correction value and the type information.
(7)
The information processing device according to any one of (1) to (3), wherein the object detection unit outputs the statistical information from the input image using a neural network.
(8)
The information processing device according to any one of (4) to (6), wherein the object detection unit outputs the statistical information and the type information from the input image using a neural network.
(9)
The standard object detection unit generates the statistical information at a predetermined operating frequency,
The information processing device according to any one of (1) to (8), further comprising a frame interpolation unit that performs frame interpolation of the statistical information generated by the systematic object detection unit using a Kalman filter. Information processing device.
(10)
According to any one of (1) to (9), further comprising an object tracking unit that determines whether the same tracking target is included in the plurality of input images obtained at different times. information processing equipment.
(11)
The information processing device according to (10), wherein the object tracking unit generates a histogram for each of the object images, and performs the discrimination based on the generated histogram.
(12)
Generating statistical information of a pixel region to be tracked included in an input image;
An information processing method comprising: determining an extraction region of the tracking target based on the statistical information, and extracting an object image from the input image as an image of the extraction region.
(13)
Generating statistical information of a pixel region to be tracked included in an input image;
An information processing program that causes a computer to execute the following steps: determining an extraction region of the tracking target based on the statistical information, and extracting an object image from the input image as an image of the extraction region.

In an information processing device, an information processing method, and an information processing program according to an embodiment of the present disclosure, an extraction region of a tracking target is determined based on statistical information of a pixel region of a tracking target included in an input image, and , the object image is extracted as the image of the extraction area. In this way, by extracting the image of the tracking target based on the statistical information of the tracking target, the proportion of the background image included in the object image can be reduced compared to the case of extracting the image of the tracking target using a rectangular filter. It can be suppressed. As a result, noise can be kept low.

This application claims priority based on Japanese Patent Application No. 2022-042856 filed at the Japan Patent Office on March 17, 2022, and all contents of this application are incorporated herein by reference. Incorporate it into your application.

Various modifications, combinations, subcombinations, and changes may occur to those skilled in the art, depending on design requirements and other factors, which may come within the scope of the appended claims and their equivalents. It is understood that the

Claims

an object detection unit that generates statistical information of a pixel region to be tracked included in the input image;
An information processing apparatus, comprising: an object image extraction unit that determines an extraction area of the tracking target based on the statistical information and extracts an object image from the input image as an image of the extraction area.
The statistical information includes an average position and a variance-covariance value of the pixel region to be tracked,
The extraction area has an elliptical shape,
The information processing device according to claim 1, wherein the object image extraction unit determines a radius of the extraction region based on the variance-covariance value, and determines the extraction region based on the average position and the radius.
The object detection unit derives a correction value by correcting the generated variance-covariance value using a t-distribution,
The information processing device according to claim 2, wherein the object image extraction unit determines a radius of the extraction area based on the correction value.
The object detection unit generates the statistical information and the tracking target type information from the input image,
The information processing device according to claim 1, wherein the object image extraction unit determines the extraction area based on the statistical information and the type information.
The statistical information includes an average position and a variance-covariance value of the pixel distribution of the tracking target,
The extraction area has an elliptical shape,
The object image extraction unit determines a radius of the extraction area and an offset of the average position based on the variance-covariance value and the type information, and determines the radius of the extraction area and the offset of the average position based on the average position, the radius, and the offset. The information processing device according to claim 4, wherein an extraction area is determined.
The object detection unit derives a correction value by correcting the generated variance-covariance value using a t-distribution,
The information processing device according to claim 5, wherein the object image extraction unit determines a radius of the extraction area and an offset of the average position based on the correction value and the type information.
The information processing device according to claim 1, wherein the object detection unit outputs the statistical information from the input image using a neural network.
The information processing device according to claim 4, wherein the object detection unit outputs the statistical information and the type information from the input image using a neural network.
The standard object detection unit generates the statistical information at a predetermined operating frequency,
The information processing device according to claim 1, further comprising a frame interpolation unit that performs frame interpolation of the statistical information generated by the statistical object detection unit using a Kalman filter.
The information processing apparatus according to claim 1, further comprising an object tracking unit that determines whether or not the same tracking target is included in the plurality of input images obtained at different times.
The information processing device according to claim 10, wherein the object tracking unit generates a histogram for each of the object images, and performs the discrimination based on the generated histogram.
Generating statistical information of a pixel region to be tracked included in an input image;
An information processing method comprising: determining an extraction region of the tracking target based on the statistical information, and extracting an object image from the input image as an image of the extraction region.
Generating statistical information of a pixel region to be tracked included in an input image;
An information processing program that causes a computer to execute the following steps: determining an extraction region of the tracking target based on the statistical information, and extracting an object image from the input image as an image of the extraction region.