WO2024024055A1

WO2024024055A1 - Information processing method, device, and program

Info

Publication number: WO2024024055A1
Application number: PCT/JP2022/029178
Authority: WO
Inventors: 帆楊; 成幸小田嶋
Original assignee: 富士通株式会社
Priority date: 2022-07-28
Filing date: 2022-07-28
Publication date: 2024-02-01

Abstract

This information processing device estimates three-dimensional posture information on an object on the basis of a camera parameter and two-dimensional posture information on the object within each 2D-BBOX (420, 421) detected using a detector from images (400, 401) included in a multi-view image, calculates a refined 2D-BBOX range from two-dimensional posture information on the object obtained by reprojecting the estimated three-dimensional position information on each image (400, 401, 402), and generates each calculated 2D-BBOX range as a pseudo-label (440, 441, 442).

Description

Information processing method, device, and program

The disclosed technology relates to an information processing method, an information processing device, and an information processing program.

Conventionally, a target object has been detected from multi-view images obtained by photographing the target object from a plurality of different viewpoints. For example, an image monitoring device has been proposed that captures images of people in different background images by capturing images from different directions using a ceiling camera and a wall camera. This device projects the pixels of the changing area in the wall camera image onto the ceiling camera image to obtain an epipolar line, extracts the area of the epipolar line that has the same characteristics as the pixels of the changing area, and based on the existing area of the area. The projection area is generated using Further, this device combines the projected area and the changed area in the ceiling camera image to obtain a combined changed area, and detects a person in the ceiling camera image based on the combined changed area.

For example, machine learning models such as neural networks are used to detect objects from images. In order to perform machine learning on such a machine learning model, a large amount of data with correct labels indicating the position information of an object in an image is required. However, preparing a large amount of data with correct answer labels requires a huge amount of work cost. Therefore, the position information of the object detected by the machine learning model is used as a pseudo label, and in addition to the data with the correct answer label prepared in advance, the data with the pseudo label is also used to perform machine learning of the machine learning model. Supervised learning methods have also been proposed.

Japanese Patent Application Publication No. 2010-045501

As mentioned above, in semi-supervised learning, when using the position information of an object detected by a machine learning model as a pseudo label, if the accuracy of the position information of the detected object is low, it may be difficult to use the pseudo label. The accuracy of the machine learning model on which machine learning is performed also decreases. In particular, when the target object can take various postures, such as a gymnast, it is difficult to accurately detect the position information of the target object from the image.

As one aspect, the disclosed technology aims to accurately calculate position information of an object in an image.

As one aspect, the disclosed technology acquires a plurality of images taken by each of a plurality of cameras that take images of a target object from a plurality of different viewpoints. Further, the disclosed technology calculates three-dimensional position information of the object based on two-dimensional position information of the object detected from each of the plurality of images and camera parameters of each of the plurality of cameras. presume. The disclosed technology projects three-dimensional position information of the object onto the at least one image based on camera parameters of a camera that has taken at least one of the plurality of images, and projects the three-dimensional position information of the object onto the at least one image. Calculate two-dimensional position information of the object at .

One aspect is that the positional information of the object in the image can be calculated with high accuracy.

FIG. 2 is a schematic diagram showing a connection between an information processing device and a camera according to the present embodiment. FIG. 3 is a diagram for explaining machine learning of a detector that detects 2D-BBOX and detection of 2D-BBOX. FIG. 3 is a diagram for explaining machine learning of a detector using semi-supervised learning. 1 is a functional block diagram of an information processing device according to an embodiment. FIG. FIG. 2 is a diagram for explaining a 2D-BBOX. FIG. 3 is a diagram for explaining two-dimensional posture information of a target object. FIG. 3 is a diagram for explaining projection of three-dimensional posture information of a target object onto an image and calculation of two-dimensional posture information of the target object. FIG. 7 is a diagram for explaining the effect of calculating two-dimensional posture information by projecting three-dimensional posture information. FIG. 6 is a diagram for explaining selection of pseudo labels based on spatial restrictions. FIG. 6 is a diagram for explaining selection of pseudo labels based on time restrictions. FIG. 3 is a diagram for explaining selection based on evaluation of pseudo labels. 1 is a diagram showing a schematic configuration of a computer functioning as an information processing device according to the present embodiment. 3 is a flowchart illustrating an example of information processing according to the present embodiment. FIG. 3 is a diagram illustrating an example of a pseudo label generation result by the information processing device according to the present embodiment. FIG. 2 is a diagram for explaining application of the information processing device according to the present embodiment to a scoring system for gymnastics competitions.

Hereinafter, an example of an embodiment according to the disclosed technology will be described with reference to the drawings.

As shown in FIG. 1, the information processing apparatus 10 according to the present embodiment includes a plurality of cameras 30n each of which photographs an object (in the example of FIG. 1, the object is a person) 90 at a viewpoint n from a different direction. connected to. In the example of FIG. 1, n=0, 1, and 2, and a camera 300 that takes pictures from viewpoint 0, a camera 301 that takes pictures from viewpoint 1, and a camera 302 that takes pictures from viewpoint 2 are connected to the information processing device 10. . Note that the number of cameras 30n connected to the information processing device 10 is not limited to the example in FIG. 1, and may be two or four or more.

The camera 30n is installed at an angle and position where the object 90 falls within the photographing range. Images captured by the camera 30n are sequentially input to the information processing device 10. Note that a synchronization signal is sent to each camera 30n, and the images taken by each camera 30n are synchronized.

The information processing device 10 generates a refined object based on two-dimensional position information of the object 90 detected from each of a plurality of images taken from a plurality of different viewpoints (hereinafter referred to as "multi-view images"). Two-dimensional position information of the object 90 is calculated.

Here, in order to detect the target object 90 from each image, a detector that is a machine learning model such as a neural network is used. For example, as shown in the upper diagram of FIG. 2, this detector is generated by machine learning using a large amount of images with correct labels indicating two-dimensional position information of the target object 90. In the example of FIG. 2, the coordinates [x ₁ , y ₁ ] and The coordinates [x ₂ , y ₂ ] of the lower right point are used as the correct label. As shown in the lower part of FIG. 2, by inputting an unlabeled image to a detector that has undergone machine learning using images with correct labels, the position of the object 90 can be determined from that image. The 2D-BBOX shown is detected.

As mentioned above, preparing a large number of images with correct answers in order to perform machine learning on the detector requires a huge amount of work cost. Therefore, it is conceivable to perform machine learning of the detector using semi-supervised learning as shown in FIG. Specifically, in semi-supervised learning, an unlabeled image is input to a detector that has undergone machine learning (“machine learning 1” in Figure 3) on an image with a correct answer label. Generate as. Machine learning of the detector ("machine learning 2" in FIG. 3) is then performed on the pseudo-labeled image in which the generated pseudo-label is added to the original image. As a result, a large amount of pseudo-labeled images can be used for machine learning, and machine learning of the detector can be performed even when there are few images with correct answers.

However, when the detection result of a single image is used as a false label, it is difficult to reduce the number of false labels that are false positives and false negatives. Further, even pseudo labels that are not false positives or false negatives may have a positional bias problem. Position bias refers to a deviation in the position of the area indicated by the pseudo label from the actual area of the object 90 on the image, such as a shift in the position, or a deviation in the size of the area, such as being larger or smaller. . Furthermore, especially when the object 90 can take various postures, such as a gymnast, it is difficult to generate balanced pseudo-labels for all postures. For example, when there are too many pseudo labels generated from detection results that detect a gymnast's posture while performing, there may be too many pseudo labels generated from detection results that detect a gymnast's posture while she is simply standing. be. In this case, it cannot be said that balanced pseudo-labels are generated according to the variety of postures of gymnasts.

It is also conceivable to use multi-view images to select pseudo-labels with high reliability from among the generated pseudo-labels based on the correspondence between images included in the multi-view images. However, in this case, since false labels are selected from the detected 2D-BBOX, false positive false labels can be reduced, but false negative false labels cannot be reduced. Furthermore, the problem of the positional bias described above and the problem of not being able to generate a balanced pseudo label according to the diversity of postures of the object 90 remain.

Therefore, in this embodiment, the two-dimensional position information of the target object 90 in the image is calculated with high accuracy so that false positive and false negative false labels can be reduced. Furthermore, in this embodiment, a pseudo label is generated in which the bias of the position of the object 90 with respect to the actual area is corrected. Furthermore, in this embodiment, a well-balanced pseudo label corresponding to the diversity of postures of the object 90 is generated. The information processing device 10 according to this embodiment will be described in detail below.

As shown in FIG. 4, the information processing device 10 functionally includes an acquisition section 11, an estimation section 12, a generation section 13, a selection section 14, and a machine learning section 15. Further, a detector 22 and a camera parameter database (DB) 24 are stored in a predetermined storage area of the information processing device 10. The detector 22 is a machine learning model for detecting a 2D-BBOX indicating the area of the target object 90 from an image, which is generated by machine learning using images with correct answers as training data. The camera parameter DB 24 stores internal parameters and external parameters of each camera 30n. Note that the generation unit 13 is an example of a “calculation unit” of the disclosed technology.

The acquisition unit 11 acquires time-series multi-view images captured by a plurality of cameras 30n.

The estimation unit 12 estimates the three-dimensional shape of the object 90 based on the 2D-BBOX indicating the area of the object 90 detected from each image included in the multi-view image and the camera parameters of each camera that captured each image. Estimate location information.

Specifically, as shown in FIG. 5, the estimation unit 12 uses the detector 22 to detect a 2D-BBOX 42n indicating the area of the target object 90 from the image 40n taken by the camera 30n. Then, the estimation unit 12 uses a recognition model (not shown) generated in advance by machine learning to recognize one or more parts of the person, which is the object 90, from the detected 2D-BBOX 42n. Estimate the two-dimensional position information of each part. For example, as shown in FIG. 6, when the recognition model recognizes the position of each joint, etc. of a person (object 90) (black circles in FIG. 6), the estimation unit 12 calculates the position of each joint, etc. The coordinate values are estimated as two-dimensional position information of the object 90. Hereinafter, a group of two-dimensional position information of each part such as a joint of the object 90 will be referred to as two-dimensional posture information.

Furthermore, the estimation unit 12 uses the camera parameters of the camera 30n stored in the camera parameter DB 24 and the estimated two-dimensional posture information of the object 90 to determine the three-dimensional shape of each part of the object 90 by triangulation. Estimate location information. Hereinafter, a group of three-dimensional position information of each part of the object 90, such as joints, will be referred to as three-dimensional posture information. When the recognition model recognizes n parts such as joints for one person, which is the object 90, the three-dimensional posture information is expressed as {[P _X ¹ , P _Y ¹ , P _Z ¹ ], [P _X ² , P _Y ² , P _Z ² ], ..., [P _X ⁿ , P _Y ⁿ , P _Z ⁿ ]}.

As shown in A of FIG. 7, the generation unit 13 projects three-dimensional posture information of the object 90 onto the image 40n based on the camera parameters of the camera 30n that captured the image 40n included in the multi-view image, Two-dimensional posture information of the refined object 90 in the image 40n is calculated. Specifically, three-dimensional posture information {[P _X ¹ _, _PY ¹ , P _Z ¹ ], [P _X ² , PY ₂ ^, ^P _Z ² ], ..., [ _P ⁿ , P _Z ⁿ ]}, the two-dimensional posture information corresponding to {[p _x ¹ , p _y ¹ ], [p _x ² , p _y ² ], ..., [p _x ⁿ , p _y ⁿ ] }. In this case, the generation unit 13 calculates two-dimensional posture information using equation (1) below. Note that in equation (1), H is a three-dimensional to two-dimensional projection matrix determined from the camera parameters of the camera 30n.

As shown in FIG. 7B, the generation unit 13 generates a pseudo label 44n indicating the area of the target object 90 based on the calculated two-dimensional posture information. Specifically, as shown in equation (2) below, the generation unit 13 uses the maximum and minimum values of the two-dimensional coordinates of each point included in the two-dimensional posture information to generate the upper left corner of the pseudo label 44n. The coordinates of the point [x ₁ , y ₁ ] and the coordinates of the lower right point [x ₂ , y ₂ ] are calculated.

Note that w and h in equation (2) are the width and height of the circumscribed rectangle of the object 90 indicated by the calculated two-dimensional posture information. In calculating the coordinates [x ₁ , y ₁ ] and [x ₂ , y ₂ ], by subtracting or adding the value obtained by multiplying the width w or height h by a constant α (for example, α = 0.05), An area obtained by adding a predetermined margin to a w×h circumscribed rectangle is calculated as the range of the pseudo label 44n. Note that the margin is not limited to the value obtained by multiplying the width w or the height h by the constant α. The range of the pseudo label 44n may be a range added in the vertical and horizontal directions of the w×h range, with a predetermined pixel (for example, 5 pixels) as a margin.

In this way, three-dimensional posture information is estimated from the two-dimensional posture information of each image 40n, and the three-dimensional posture information is reprojected onto each image 40n to calculate refined two-dimensional posture information. By doing so, it is possible to improve the generation accuracy of pseudo labels. For example, as shown in FIG. 8, the multi-view image includes images 400, 401, and 402, and the estimation unit 12 detects 2D-BBOX 420 and 421 from images 400 and 401, and from image 402, 2D-BBOX 420 and 421 are detected from image 402. Assume that BBOX 422 is not detected. Even in this case, the generation unit 13 can generate the pseudo label 442 from the image 402 by reprojecting the three-dimensional posture information onto the image 402. That is, false negative false labels can be reduced.

In addition, the generation unit 13 corrects the positional bias occurring in the 2D-BBOXs 420 and 421 by reprojecting the three-dimensional posture information on the images 400 and 401 and generating pseudo labels 440 and 441. can do.

The selection unit 14 selects a pseudo label to be used for machine learning of the detector 22 from the pseudo labels 44n generated by the generation unit 13 based on spatial and temporal restrictions.

Specifically, the selection unit 14 selects in advance the position of the object 90 in the three-dimensional space (hereinafter referred to as "three-dimensional position") indicated by the three-dimensional posture information that is the projection source when generating the pseudo label 44n. If it is included in the predetermined range, that pseudo label 44n is selected. For example, if the target object 90 is a gymnast, the predetermined range may be a competition area depending on the competition event. More specifically, in the case of an event that uses equipment, a predetermined range including the equipment may be defined as the competition area, and if the event is on the floor, a predetermined range including the prescribed performance range may be defined as the competition area.

For example, as shown in FIG. 9, it is assumed that the three-dimensional position 46A from which the pseudo label 440A generated from the image 400 and the pseudo label 441A generated from the image 401 are projected is within the competition area. In this case, the selection unit 14 selects the pseudo labels 440A and 441A as the pseudo labels 44n used for machine learning. On the other hand, it is assumed that the three-dimensional position 46B from which the pseudo label 440B generated from the image 400 and the pseudo label 441B generated from the image 401 are projected is outside the competition area. In this case, the selection unit 14 excludes the pseudo labels 440B and 441B from the pseudo labels 44n used for machine learning. As a result, in the case where an assistant other than a player, a referee, etc. is erroneously detected, it is possible to exclude the pseudo labels 44n generated for those persons.

Furthermore, when the photographing time of the image for which the pseudo label 44n has been generated is included in a predetermined time range, the selection unit 14 selects the pseudo label 44n as the pseudo label 44n to be used for machine learning. For example, if the object 90 is a gymnast, the predetermined time range may be a time range corresponding to the time from the start to the end of the performance.

More specifically, as shown in FIG. 10, the selection unit 14 selects a start frame corresponding to the start of the performance and an end frame corresponding to the end of the performance from each frame of a series of time-series multi-view images. Identify. In the case of an event that uses equipment, for example, the selection unit 14 specifies, as the start frame, a frame that is a predetermined frame before the moment when the athlete enters the competition area and his or her feet first leave the floor. Furthermore, the selection unit 14 specifies a frame that is a predetermined frame before the player leaves the competition area as the end frame. Then, the selection unit 14 selects a pseudo label 44n generated from a frame (image 40n) included in the target time, with the target time being from the start frame to the end frame. On the other hand, the selection unit 14 excludes pseudo labels 44n generated from non-target frames outside the target time. As a result, it is possible to exclude pseudo labels 44n based on the posture of the athlete who is simply standing before the start of a performance, and it is possible to select a well-balanced pseudo label 44n that corresponds to the variety of postures of the athlete. can.

Furthermore, the selection unit 14 evaluates the quality of the generated pseudo label 44n, and selects it as the pseudo label 44n to be used for machine learning of the detector 22 if the evaluation result satisfies the criteria. Specifically, the selection unit 14 determines the degree of overlap between the 2D-BBOX 42n detected by the estimation unit 12 using the detector 22 and the pseudo label 44n generated by the generation unit 13 based on the 2D-BBOX 42n. calculate. The degree of overlap may be, for example, the area of the overlapped portion/the area of the pseudo label 44n. As shown in FIG. 11, the selection unit 14 selects pseudo labels 44n whose degree of overlap is greater than or equal to a predetermined threshold, and excludes pseudo labels 44n whose degree of overlap is less than the threshold.

Further, the selection unit 14 presents the pseudo labels 44n whose degree of overlap is less than the threshold value to the user, accepts the user's decision to accept or reject the pseudo labels 44n, and uses the pseudo labels 44n adopted by the user as pseudo labels 44n for use in machine learning of the detector 22. It may be selected as the label 44n. As a result, compared to the case where the user makes all the decisions regarding whether or not to accept the generated pseudo labels 44n, the user is made to make a decision only about the pseudo labels 44n that do not meet the criteria, so the burden on the user can be reduced.

The machine learning unit 15 executes machine learning of the detector 22 using the pseudo-labeled image obtained by adding the pseudo label 44n selected by the selection unit 14 to the image 40n and the correct-answered image as training data. The machine learning unit 15 causes the acquisition unit 11, the estimation unit 12, the generation unit 13, and the selection unit 14 to repeatedly execute the processing, and repeatedly executes machine learning of the detector 22 using the obtained pseudo label 44n. By repeating the process, the number of images with pseudo labels increases, so the detection accuracy of the 2D-BBOX 42n by the detector 22 improves, and the generation accuracy of the pseudo labels 44n also improves. Further, in the repeated processing, the selection unit 14 uses only the pseudo labels 44n whose quality evaluation results meet the standards, thereby further improving the detection accuracy of the 2D-BBOX 42n by the detector 22.

The information processing device 10 may be realized, for example, by a computer 50 shown in FIG. 12. The computer 50 includes a CPU (Central Processing Unit) 51, a memory 52 as a temporary storage area, and a nonvolatile storage device 53. The computer 50 also includes an input/output device 54 such as an input device and a display device, and an R/W (Read/Write) device 55 that controls reading and writing of data to and from a storage medium 59. The computer 50 also includes a communication I/F (Interface) 56 connected to a network such as the Internet. The CPU 51, memory 52, storage device 53, input/output device 54, R/W device 55, and communication I/F 56 are connected to each other via a bus 57.

The storage device 53 is, for example, an HDD (Hard Disk Drive), an SSD (Solid State Drive), a flash memory, or the like. An information processing program 60 for causing the computer 50 to function as the information processing device 10 is stored in the storage device 53 as a storage medium. The information processing program 60 includes an acquisition process control instruction 61 , an estimation process control instruction 62 , a generation process control instruction 63 , a selection process control instruction 64 , and a machine learning process control instruction 65 . Furthermore, the storage device 53 has an information storage area 70 in which information constituting the detector 22 and camera parameter DB 24 is stored.

The CPU 51 reads the information processing program 60 from the storage device 53, expands it onto the memory 52, and sequentially executes control commands included in the information processing program 60. The CPU 51 operates as the acquisition unit 11 shown in FIG. 4 by executing the acquisition process control instruction 61. Further, the CPU 51 operates as the estimation unit 12 shown in FIG. 4 by executing the estimation process control instruction 62. Further, the CPU 51 operates as the generation unit 13 shown in FIG. 4 by executing the generation process control instruction 63. Further, the CPU 51 operates as the selection unit 14 shown in FIG. 4 by executing the selection process control instruction 64. Further, the CPU 51 operates as the machine learning section 15 shown in FIG. 4 by executing the machine learning process control instruction 65. Further, the CPU 51 reads information from the information storage area 70 and develops the detector 22 and camera parameter DB 24 in the memory 52. Thereby, the computer 50 that has executed the information processing program 60 functions as the information processing device 10. Note that the CPU 51 that executes the program is hardware.

Note that the functions realized by the information processing program 60 may be realized by, for example, a semiconductor integrated circuit, more specifically, an ASIC (Application Specific Integrated Circuit), an FPGA (Field-Programmable Gate Array), or the like.

Next, the operation of the information processing device 10 according to this embodiment will be explained. When time-series multi-view images are input to the information processing device 10 and the detector 22 is instructed to perform machine learning, the information processing device 10 executes the information processing shown in FIG. 13. Note that the information processing is an example of an information processing method of the disclosed technology.

In step S11, the acquisition unit 11 acquires a plurality of time-series multi-view images. Next, in step S12, the estimation unit 12 uses the detector 22 to detect the 2D-BBOX 42n indicating the area of the target object 90 from each image 40n included in the multi-view image. Then, the estimation unit 12 estimates two-dimensional posture information of the object 90 from the detected 2D-BBOX 42n using the recognition model. Next, in step S13, the estimating unit 12 uses the camera parameters of the camera 30n stored in the camera parameter DB 24 and the estimated two-dimensional posture information of the object 90 to determine the shape of the object 90 by triangulation. Estimate three-dimensional posture information.

Next, in step S14, the generation unit 13 projects the three-dimensional posture information of the object 90 onto each image 40n based on the camera parameters of the camera 30n that captured each image 40n, and Two-dimensional posture information of the object 90 is calculated. Then, the generation unit 13 generates a pseudo label 44n based on the calculated two-dimensional posture information.

Next, in step S15, the selection unit 14 selects a pseudo label to be used for machine learning of the detector 22 from the pseudo labels 44n generated in step S14, based on spatiotemporal restrictions. Specifically, when the three-dimensional position of the object 90 indicated by the three-dimensional posture information that is the projection source when generating the pseudo label 44n is included in a predetermined range, the selection unit 14 selects the pseudo label 44n. Select 44n. Further, the selection unit 14 selects the pseudo label 44n when the photographing time of the image for which the pseudo label 44n has been generated is included in a predetermined time range.

Next, in step S16, the selection unit 14 evaluates the quality of the pseudo label 44n selected in step S15, and if the evaluation result satisfies the criteria, selects it as the pseudo label 44n to be used for machine learning of the detector 22. do. Next, in step S17, the machine learning unit 15 uses the pseudo labeled image obtained by adding the pseudo label 44n selected in step S16 above to the image 40n and the correct answer image as training data to train the detector 22. Run machine learning.

Next, in step S18, the machine learning unit 15 determines whether the end condition of the machine learning of the detector 22 is satisfied. For example, when the number of repetitions reaches a predetermined number, when the detection accuracy of the detector 22 reaches a predetermined value, when the detection accuracy of the detector 22 converges, etc., the machine learning unit 15 determines that the termination condition is satisfied. judge. If the termination condition is not satisfied, the process returns to step S11, and if the termination condition is satisfied, the information processing is terminated.

As described above, the information processing apparatus according to the present embodiment determines the three-dimensional position of the object based on the two-dimensional position information of the object detected from each image included in the multi-view image and the camera parameters. Estimate information. Then, the information processing device projects the three-dimensional position information of the object onto each image based on the camera parameters, and calculates refined two-dimensional position information of the object. Thereby, the positional information of the object in the image can be calculated with high accuracy. Furthermore, by generating a pseudo label based on this two-dimensional position information, it is possible to reduce false negatives of the pseudo label and correct the bias in the position of the pseudo label.

Furthermore, the information processing device according to the present embodiment selects a pseudo label to be used for machine learning of a detector from generated pseudo labels based on spatio-temporal restrictions, thereby reducing the diversity of poses of the target object. It is possible to generate well-balanced pseudo-labels according to the

Here, FIG. 14 shows an example of a pseudo label generation result by the information processing device according to the present embodiment. The left three diagrams in FIG. 14 schematically show an example of the detection results obtained by the method of detecting 2D-BBOX using the detector before applying semi-supervised learning in this embodiment (hereinafter referred to as the "comparison method"). FIG. In addition, the three diagrams on the right in FIG. 14 schematically show an example of detection results obtained by a method of detecting 2D-BBOX using a detector applying semi-supervised learning in this embodiment (hereinafter referred to as "this method"). FIG.

As shown in the upper diagram of FIG. 14, it can be seen that the 2D-BBOX, which was inaccurate in the comparison method, is improved by the present method. Furthermore, as shown in the middle diagram of FIG. 14, it can be seen that the 2D-BBOX, which was missing in the comparison method, is detected in the present method. Additionally, as shown in the bottom diagram of Figure 14, the comparison method incorrectly detected a 2D-BBOX indicating a person other than the player, which is the original target, but this method eliminates the incorrect detection. I understand that.

Furthermore, the information processing device according to the above embodiment can be applied to, for example, a scoring system for gymnastics competitions. Here, with reference to FIG. 15, an overview of the processing of the gymnastics scoring system will be described.

When a multi-view image is input, the scoring system detects a region of a person from each image included in the multi-view image. Next, the scoring system determines whether the person indicated by the detected area is a player or a non-player based on whether the position where the person is present is in the competition area, etc., and identifies the area indicating the player. The scoring system tracks players by associating regions representing the same player in time-series multi-view images. The scoring system recognizes the player's two-dimensional skeletal information from each of the series of tracked images using a recognition model or the like. The scoring system estimates three-dimensional skeletal information from two-dimensional skeletal information using camera parameters. Then, the scoring system performs post-processing such as smoothing on the time-series three-dimensional skeletal information, estimates the phases (breaks) of the performance, and then recognizes the techniques.

In the scoring system described above, it is possible to apply a detector in which machine learning is performed using the pseudo labels generated by the information processing device according to the embodiment described above to the process of detecting a region of a person.

Note that in the above embodiment, a case has been described in which three-dimensional posture information, which is estimated three-dimensional position information, is projected onto all images included in a multi-view image, but the present invention is not limited to this. The image may be projected onto at least one of the multi-view images, such as by targeting an image in which the 2D-BBOX is not detected by the detector.

Furthermore, the disclosed technology is not limited to cases where the object is a gymnast, but can be applied to various people such as athletes of other sports and ordinary pedestrians. Furthermore, it is also possible to apply the present invention to objects other than people, such as animals and vehicles.

Furthermore, in the above embodiments, the information processing program is stored (installed) in the storage device in advance, but the information processing program is not limited thereto. The program according to the disclosed technology may be provided in a form stored in a storage medium such as a CD-ROM, DVD-ROM, or USB memory.

10 Information processing device 11 Acquisition unit 12 Estimation unit 13 Generation unit 14 Selection unit 15 Machine learning unit 22 Detector 24 Camera parameter DB
30n Camera 40n Image 42n 2D-BBOX
44n Pseudo label 50 Computer 51 CPU
52 Memory 53 Storage device 54 Input/output device 55 R/W device 56 Communication I/F
57 Bus 59 Storage medium 60 Information processing program 61 Acquisition process control instruction 62 Estimation process control instruction 63 Generation process control instruction 64 Selection process control instruction 65 Machine learning process control instruction 70 Information storage area 90 Object

Claims

Obtain multiple images taken by each of multiple cameras that photograph the object from multiple different viewpoints,
Estimating three-dimensional position information of the object based on two-dimensional position information of the object detected from each of the plurality of images and camera parameters of each of the plurality of cameras,
Projecting three-dimensional position information of the object onto the at least one image based on camera parameters of a camera that captured at least one image of the plurality of images, and projecting three-dimensional position information of the object on the at least one image, Calculate location information,
An information processing method in which a computer performs processing that includes
The information processing method according to claim 1, wherein the process of calculating the two-dimensional position information of the target object includes calculating information indicating a region of the target object.
When the target object is a person, the process of estimating the three-dimensional position information of the target object includes detecting two-dimensional posture information of the target object from each of the plurality of images as the two-dimensional position information of the target object. 3. The information processing method according to claim 1, further comprising estimating three-dimensional posture information of the object based on two-dimensional posture information of the object.
The method includes executing machine learning of a machine learning model for detecting the two-dimensional position information of the target object from the image using an image to which the calculated two-dimensional position information of the target object is attached as a pseudo label as training data. The information processing method according to claim 1 or 2, wherein the processing is executed by the computer.
5. The training data is an image to which, among the pseudo labels, the three-dimensional positional information of the object as a projection source is included in a predetermined range in a three-dimensional space. Information processing method.
The information processing method according to claim 5, wherein when the target object is a gymnast, the predetermined range is a competition area according to a competition event.
5. The information processing method according to claim 4, wherein, among the pseudo-labels, images to which the pseudo-labels are attached, the photographing times of the corresponding images of which are included in a predetermined time range, are used as the training data.
8. The information processing method according to claim 7, wherein when the target object is a gymnast, the predetermined time range is a time range corresponding to a time period from the start to the end of a performance.
The information processing method according to claim 4, wherein generation of the pseudo label and machine learning of the machine learning model using the image to which the pseudo label is attached as training data are repeatedly performed.
Exclude from the training data images to which the pseudo label is attached, in which the degree of overlap between the area indicated by the detected two-dimensional position information of the object and the area indicated by the generated pseudo label is less than a predetermined threshold. The information processing method according to claim 9.
an acquisition unit that acquires a plurality of images taken by each of a plurality of cameras that photograph a target object from a plurality of different viewpoints;
an estimation unit that estimates three-dimensional position information of the target object based on two-dimensional position information of the target object detected from each of the plurality of images and camera parameters of each of the plurality of cameras;
Projecting three-dimensional position information of the object onto the at least one image based on camera parameters of a camera that captured at least one image of the plurality of images, and projecting three-dimensional position information of the object on the at least one image, a calculation unit that calculates location information;
Information processing equipment including.
The information processing device according to claim 11, wherein the calculation unit calculates information indicating a region of the object as the two-dimensional position information of the object.
When the target object is a person, the process of estimating the three-dimensional position information of the target object is performed by estimating the two-dimensional posture information of the target object as the two-dimensional position information of the target object. The information processing apparatus according to claim 11 or 12, wherein three-dimensional posture information of the object is estimated based on two-dimensional posture information of the object detected from each of the images.
a machine learning unit that executes machine learning of a machine learning model for detecting two-dimensional position information of the target object from an image using an image to which the calculated two-dimensional position information of the target object is attached as a pseudo label as training data; The information processing device according to claim 11 or claim 12, comprising:
Among the pseudo labels, the machine learning unit sets, as the training data, an image to which the pseudo label is attached, in which three-dimensional position information of the object as a projection source is included in a predetermined range in a three-dimensional space. The information processing device according to claim 14.
The information processing device according to claim 15, wherein when the target object is a gymnast, the predetermined range is a competition area according to a competition event.
15. The information processing apparatus according to claim 14, wherein the machine learning unit uses, as the training data, an image to which the pseudo label is attached, the shooting time of the corresponding image being included in a predetermined time range, among the pseudo labels. .
18. The information processing device according to claim 17, wherein when the target object is a gymnast, the predetermined time range is a time range corresponding to from the start to the end of the performance.
The information processing apparatus according to claim 14, wherein the generation of the pseudo label and the machine learning of the machine learning model using the image to which the pseudo label is attached as training data are repeatedly executed.
Obtain multiple images taken by each of multiple cameras that photograph the object from multiple different viewpoints,
Estimating three-dimensional position information of the object based on two-dimensional position information of the object detected from each of the plurality of images and camera parameters of each of the plurality of cameras,
Projecting three-dimensional position information of the object onto the at least one image based on camera parameters of a camera that captured at least one image of the plurality of images, and projecting three-dimensional position information of the object on the at least one image, Calculate location information,
An information processing program that causes a computer to perform processing that includes