WO2023175727A1 - 情報処理システム、情報処理方法及びプログラム - Google Patents

情報処理システム、情報処理方法及びプログラム Download PDF

Info

Publication number
WO2023175727A1
WO2023175727A1 PCT/JP2022/011645 JP2022011645W WO2023175727A1 WO 2023175727 A1 WO2023175727 A1 WO 2023175727A1 JP 2022011645 W JP2022011645 W JP 2022011645W WO 2023175727 A1 WO2023175727 A1 WO 2023175727A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
output
training data
model
reliability
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/JP2022/011645
Other languages
English (en)
French (fr)
Japanese (ja)
Inventor
祥悟 佐藤
徹悟 稲田
博之 勢川
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sony Interactive Entertainment Inc
Original Assignee
Sony Interactive Entertainment Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sony Interactive Entertainment Inc filed Critical Sony Interactive Entertainment Inc
Priority to JP2024507259A priority Critical patent/JP7724361B2/ja
Priority to PCT/JP2022/011645 priority patent/WO2023175727A1/ja
Priority to US18/841,737 priority patent/US20250165862A1/en
Publication of WO2023175727A1 publication Critical patent/WO2023175727A1/ja
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • the present invention relates to an information processing system, an information processing method, and a program.
  • a large amount of training data is required to train a machine learning model, but preparing that data takes a lot of effort. On the other hand, if the amount of training data is reduced, the accuracy of the machine learning model may not be guaranteed.
  • the present invention has been made in view of the above circumstances, and its purpose is to provide a technology that improves the accuracy of machine learning models while suppressing the effort required to prepare training data.
  • an information processing system uses a machine learning model learned using training data and an output of the machine learning model when the input data is input, to determine the input data.
  • a reliability output means for outputting the reliability of the output;
  • a generation means for generating new training data based on the input data when the reliability satisfies a predetermined condition;
  • a learning control means for learning the learning model.
  • the information processing system includes a learned estimation model that outputs an estimation result based on the output of the machine learning model, and the reliability output means outputs the estimation result based on the output of the estimation model.
  • the reliability of the output of the machine learning model with respect to input data may be output.
  • the input data includes an image of the target object
  • the estimation model indicates key points for estimating the pose of the target object based on the output of the machine learning model.
  • An image may be output, and the reliability output means may output the reliability based on the image.
  • the estimated model outputs an image in which each point indicates a positional relationship with a key point
  • the reliability output means outputs a plurality of key point position candidates, each of which is output by the estimated model.
  • the reliability may be output based on variations in key point position candidates generated from mutually different points included in the image output by.
  • the reliability output means includes an output of the estimation model when an input image in which the target object is photographed is input to the estimation model, and an output of the estimation model when the input image is processed by a predetermined processing process.
  • the reliability may be output based on information indicating a difference from an output when the processed image is input to the estimation model.
  • the machine learning model may output information indicating whether the input data includes the target object.
  • the estimated model is trained using estimated training data
  • the generation means generates new estimated training data based on the input data when the reliability satisfies the predetermined condition.
  • the learning control means may cause the machine learning model to learn using the new training data.
  • the input data includes an image in which the target object is photographed, and the machine learning model outputs an image indicating key points for pose estimation of the target object based on the input data.
  • the reliability output means may output the reliability based on the image.
  • the training data may include a plurality of learning images rendered from a three-dimensional shape model and correct images each of which is correct data for the learning image.
  • the generating means includes a first additional image in which the input data is processed by a first processing process, and a first additional image in which the input data is processed by a second processing process different from the first processing process.
  • the learning control means generates new training data including a processed second additional image, and the learning control means generates new training data including an output when the first additional image is input to the machine learning model, and
  • the machine learning model may be trained based on the difference from the output when the second additional image is added.
  • the information processing method outputs the reliability of the output with respect to the input data based on the output of the machine learning model when the input data is input to the machine learning model learned by training data. If the reliability satisfies a predetermined condition, the method includes the steps of: generating new training data based on the input data; and learning a machine learning model using the new training data.
  • the program according to the present invention outputs the reliability of the output with respect to the input data, based on the output of the machine learning model when the input data is input to the machine learning model learned by training data,
  • a computer is caused to execute a process of generating new training data based on the input data and learning a machine learning model using the new training data.
  • FIG. 1 is a diagram showing an example of the configuration of an information processing system according to an embodiment of the present invention.
  • FIG. 1 is a functional block diagram showing an example of functions implemented in an information processing system according to an embodiment of the present invention.
  • FIG. 3 is a diagram showing an example of an input image.
  • FIG. 3 is a diagram illustrating an example of key points of a target object.
  • FIG. 3 is a diagram schematically showing an example of a position image in a target area.
  • FIG. 3 is a flow diagram mainly showing an example of processing by a target area acquisition unit and a posture estimation unit.
  • FIG. 3 is a diagram illustrating the posture of a detected target object.
  • FIG. 2 is a flowchart schematically explaining learning of a discrimination model and an estimation model.
  • FIG. 2 is a flow diagram illustrating an example of a process for generating initial training data.
  • FIG. 3 is a diagram illustrating photographing a target object.
  • FIG. 3 is a flow diagram illustrating
  • This information processing system includes a machine learning model that determines whether at least a portion of a captured image includes an object, and a machine learning model that outputs information indicating the estimated pose of the object from the image that includes the object. Contains. Furthermore, the information processing system is configured to complete its learning in a short time. It is assumed that the required time is, for example, several tens of seconds to grasp and rotate an object, and several minutes for machine learning.
  • FIG. 1 is a diagram showing an example of the configuration of an information processing system according to an embodiment of the present invention.
  • the information processing system includes an information processing device 10.
  • the information processing device 10 is, for example, a computer such as a game console or a personal computer.
  • the information processing device 10 includes, for example, a processor 11, a storage section 12, a communication section 14, an operation section 16, a display section 18, and a photographing section 20.
  • the information processing system may be composed of one information processing device 10 or may be composed of a plurality of devices including the information processing device 10.
  • the processor 11 is, for example, a program-controlled device such as a CPU that operates according to a program installed in the information processing device 10.
  • the storage unit 12 is made up of at least a portion of a storage element such as ROM or RAM, or an external storage device such as a solid state drive.
  • the storage unit 12 stores programs executed by the processor 11 and the like.
  • the communication unit 14 is a communication interface for wired or wireless communication, such as a network interface card, and exchanges data with other computers and terminals via a computer network such as the Internet.
  • the operation unit 16 is, for example, an input device such as a keyboard, a mouse, a touch panel, a game console controller, etc., and receives a user's operation input and outputs a signal indicating the content to the processor 11.
  • the display unit 18 is a display device such as a liquid crystal display, and displays various images according to instructions from the processor 11.
  • the display unit 18 may be a device that outputs a video signal to an external display device.
  • the photographing unit 20 is a photographing device such as a digital camera.
  • the photographing unit 20 according to the present embodiment is, for example, a camera capable of photographing moving images.
  • the photographing unit 20 may be a camera capable of acquiring visible RGB images.
  • the photographing unit 20 may be a camera capable of acquiring a visible RGB image and depth information synchronized with the RGB image.
  • the imaging unit 20 may be located outside the information processing device 10, and in this case, the information processing device 10 and the imaging unit 20 may be connected via the communication unit 14 or an input/output unit described below.
  • the information processing device 10 may include audio input/output devices such as a microphone and a speaker.
  • the information processing device 10 also includes, for example, a communication interface such as a network board, an optical disk drive that reads optical disks such as DVD-ROM and Blu-ray (registered trademark) disk, and an input/output device for inputting and outputting data with external devices. (USB (Universal Serial Bus) port).
  • USB Universal Serial Bus
  • FIG. 2 is a functional block diagram showing an example of functions implemented in an information processing system according to an embodiment of the present invention.
  • the information processing system functionally includes a target area acquisition section 21, a posture estimation section 25, a captured image acquisition section 33, a discrimination training data generation section 34, a discrimination learning section 35, and a shape model acquisition section 36. , an estimated training data generation section 37, an estimated learning section 38, and a reliability acquisition section 39.
  • the target region acquisition section 21 functionally includes a region extraction section 22, a feature extraction section 23, and a discrimination model 24.
  • the posture estimation section 25 functionally includes an estimation model 26, a key point determination section 27, and a posture calculation section 28. Both the identification model 24 and the estimation model 26 are types of machine learning models.
  • These functions are mainly implemented by the processor 11 and the storage unit 12. More specifically, these functions may be implemented by having the processor 11 execute a program installed in the information processing device 10, which is a computer, and including execution instructions corresponding to the above functions. Further, this program may be supplied to the information processing apparatus 10 via a computer-readable information storage medium such as an optical disk, a magnetic disk, or a flash memory, or via the Internet.
  • a computer-readable information storage medium such as an optical disk, a magnetic disk, or a flash memory, or via the Internet.
  • the target area acquisition unit 21 acquires the input image photographed by the photographing unit 20, and determines whether each of the one or more candidate areas 56 (see FIG. 3) included in the acquired input image corresponds to the image of the target object 51. Determine whether it is included.
  • the region extracting section 22 extracts the one or more candidate regions 56, and the feature extracting section 23 extracts a feature amount representing the feature of the image from each of the candidate regions 56.
  • the identification model 24 receives the feature amount as an image of the candidate area 56 and outputs information indicating whether the candidate area 56 includes the image of the target object 51 or not.
  • the target area acquisition unit 21 acquires the target area 55 including the image of the target object 51 extracted from the input image.
  • the target object 51 is an object whose orientation is to be estimated in the information processing device 10 .
  • the target object 51 is the target of prior learning.
  • FIG. 3 is a diagram showing an example of an input image.
  • the target object 51 is a power tool, and unless otherwise specified in the subsequent figures, examples of the target object 51 are assumed to be power tools.
  • the input image is photographed by the photographing unit 20, and the target area 55 is a rectangular area including the target object 51 and its vicinity. Note that in the process of acquiring the target area 55, one or more candidate areas 56 including areas that do not include the target object 51 are also extracted as candidates for the area including the target object 51.
  • the region extraction unit 22 extracts an image of a candidate region 56 that is a target of determination by the identification model 24 from the input image. More specifically, the region extraction unit 22 identifies one or more candidate regions 56 in which some object has been photographed from the input image using the well-known Region Proposal technique, and identifies each of the one or more candidate regions 56. Extract.
  • the identification model 24 is a machine learning model, and is trained using training data. When input data is input, the learned identification model 24 outputs data as a result of identification.
  • the input data input to the identification model 24 is information indicating an image of the candidate area 56, and is, for example, a feature quantity extracted from the image by the feature extraction unit 23. Further, when input data is input, the identification model 24 outputs information indicating whether the image of the candidate area 56 includes the image of the target object 51 or not.
  • the training data for the identification model 24 includes data indicating each of learning images including a plurality of positive example images including an image in which the target object 51 is photographed and a plurality of negative example images not including the target object 51. Details of the identification model 24 and its learning will be described later.
  • Each of the learning images may be an image of a region in which the target object 51 is present among the captured images. The region may be extracted using the same method as the region extracting section 22. Note that the discrimination model 24 is trained not only by the above-mentioned training data but also by additional training data.
  • the image of the candidate area 56 may be directly input to the identification model 24 without going through the feature extraction unit 23.
  • the region extraction unit 22 may not be present, although there is a risk that accuracy may be reduced.
  • the feature extraction unit 23 may extract features from the input image itself, and the identification model 24 may determine whether the target object 51 exists in the input image, or the input image may be directly input to the identification model 24. may be done.
  • the posture estimation unit 25 estimates the posture of the target object 51 based on information output when the target region 55 is input to the estimation model 26.
  • the estimation model 26 is a machine learning model, and is trained using training data. When input data is input, the learned estimation model 26 outputs data as an estimation result.
  • the training data includes a plurality of learning images rendered by the three-dimensional shape model of the target object 51 and correct data that is information regarding the posture of the target object 51 in the learning images.
  • Information indicating the image of the target area 55 is input to the trained estimation model 26, and the estimation model 26 outputs information indicating the position of a key point for estimating the pose of the target object.
  • the target area 55 is an image based on a candidate area 56 selected based on the output of the identification model 24.
  • the training data for the estimation model 26 includes a plurality of learning images rendered by the three-dimensional shape model of the target object 51 and correct data indicating the positions of key points of the target object 51 in the learning images.
  • a key point is a virtual point within the target object 51, and is a point used to calculate the posture.
  • the estimation model 26 is trained not only by the above-mentioned training data but also by additional training data.
  • the additional training data includes an image generated based on the input image, and whether or not to add training data based on the input image is determined based on the output of the estimation model 26.
  • FIG. 4 is a diagram showing an example of key points of the target object 51.
  • the three-dimensional positions of the key points of the target object 51 are determined from the three-dimensional shape model of the target object 51 (more specifically, the information on the vertices included in the three-dimensional shape model), for example, by the well-known Farthest Point algorithm.
  • three key points K1 to K3 are shown in FIG. 4 for ease of explanation, the actual number of key points may be larger.
  • the actual number of key points of the target object 51 is eight.
  • the trained estimation model 26 outputs information indicating the two-dimensional position of the key point of the target object 51 in the target area 55 when the target area 55 is input.
  • the two-dimensional position of the key point in the input image is determined from the two-dimensional position of the key point in the target area 55 and the position of the target area 55 in the input image.
  • the data indicating the position of the key point may be a position image in which each point indicates the positional relationship (for example, direction) between that point and the key point.
  • FIG. 5 is a diagram schematically showing an example of a position image in the target area 55.
  • a position image may be generated for each type of key point.
  • the position image shows the relative orientation of each point to the keypoint.
  • a pattern is described according to the value of each point, and the value of each point indicates the direction of the coordinates of that point and the coordinates of the key point.
  • FIG. 5 is only a schematic diagram, and the actual value of each point changes continuously.
  • the position image is a Vector Field image that shows the relative direction of the key point at each point with respect to that point.
  • the key point determining unit 27 determines the two-dimensional position of the key point in the target area 55 and the input image based on the output of the estimation model 26. More specifically, for example, the key point determining unit 27 calculates candidates for two-dimensional positions of key points in the target area 55 based on the position images output from the estimation model 26, and calculates candidates for two-dimensional positions of key points in the target area 55. The two-dimensional position of the key point in the input image is determined from the candidates. For example, the key point determining unit 27 calculates key point candidate points from each combination of arbitrary two points in the position image, and matches the direction indicated by each point in the position image with respect to the plurality of candidate points. Generates a score indicating whether the The key point determination unit 27 may estimate the candidate point with the highest score as the key point position. Further, the key point determination unit 27 repeats the above process for each key point.
  • the posture calculation unit 28 estimates the posture of the target object 51 based on information indicating the two-dimensional position of the key point in the input image and information indicating the three-dimensional position of the key point in the three-dimensional shape model of the target object 51. , output posture data indicating the estimated posture.
  • the pose of the target object 51 is estimated using a known algorithm. For example, it may be estimated by a method of solving a Perspective-n-Point (PNP) problem for pose estimation (eg, EPnP). Further, the posture calculation unit 28 may estimate not only the posture of the target object 51 but also the position of the target object 51 in the input image, and the posture data may include information indicating the position.
  • PNP Perspective-n-Point
  • the details of the estimation model 26, the key point determination unit 27, and the pose calculation unit 28 may be those described in the paper PVNet: Pixel-Wise Voting Network for 6DoF Pose Estimation.
  • the captured image acquisition section 33, the discrimination training data generation section 34, the discrimination learning section 35, the shape model acquisition section 36, the estimation training data generation section 37, the estimation learning section 38, and the reliability acquisition section 39 are configured to acquire the identification model 24 and the estimation model.
  • the identification model 24 and the estimation model 26 are trained in a short period of time, such as several seconds and several minutes, respectively, based on an image in which the target object 51 is photographed. After the operation of the target area acquisition unit 21 and posture estimation unit 25 based on 26, the identification model 24 and the estimation model 26 are trained again.
  • the photographed image acquisition section 33 acquires a photographed image of the target object 51 taken by the photographing section 20 in order to make the estimation model 26 included in the posture estimation section 25 and/or the identification model 24 included in the target area acquisition section 21 learn. get. It is assumed that the camera internal parameters of the photographing unit 20 have been acquired in advance through calibration. This parameter is used when solving the PnP problem.
  • the identification training data generation unit 34 generates positive example training data based on images that include the target object 51 and negative example training data based on images that do not include the target object 51.
  • the image including the target object 51 may be acquired by the photographed image acquisition unit 33.
  • the discrimination learning unit 35 causes the discrimination model 24 included in the target area acquisition unit 21 to learn, based on the training data generated by the discrimination training data generation unit 34.
  • the shape model acquisition unit 36 extracts a plurality of feature vectors representing local features for each of the plurality of photographed images of the target object 51 acquired by the photographed image acquisition unit 33, and The three-dimensional position of the point where the feature vector was extracted is determined from a plurality of mutually corresponding feature vectors and the position where the feature vector was extracted in the photographed image, and the three-dimensional shape of the target object 51 is determined based on the three-dimensional position. Get the model. This method is a well-known method that is also used in software that implements so-called SfM and Visual SLAM, so detailed explanation will be omitted.
  • the estimated training data generation unit 37 generates training data for learning the estimated model 26. More specifically, the estimated training data generation unit 37 generates, as initial training data, training data that includes a rendered training image and correct data indicating the positions of key points from the three-dimensional shape model of the target object 51. generate.
  • the estimation learning unit 38 causes the estimation model 26 included in the posture estimation unit 25 to learn using the training data generated by the estimation training data generation unit 37.
  • the reliability acquisition unit 39 acquires the reliability of the output of the machine learning model for the input data, based on the output of the machine learning model when the input data is input.
  • Obtaining the reliability based on the output of the machine learning model means, for example, calculating the reliability based on the output of the identification model 24, which is a machine learning model, and more specifically, the result of the subsequent processing that receives that output. This is to calculate the reliability based on the position image output by the estimation model 26.
  • FIG. 6 is a flow diagram mainly showing an example of processing by the target area acquisition unit 21 and the posture estimation unit 25. The process shown in FIG. 6 may be repeatedly executed periodically.
  • the region extraction section 22 included in the target region acquisition section 21 acquires an input image photographed by the photographing section 20 (S101).
  • the area extraction unit 22 may obtain the input image by directly receiving the input image from the imaging unit 20 or may obtain the input image received from the imaging unit 20 and stored in the storage unit 12. good.
  • the region extraction unit 22 extracts one or more candidate regions 56 in which some object is captured from the input image (S102).
  • the region extraction unit 22 may include a previously learned RPN (Regional Proposal Network).
  • the RPN may be learned using training data that is not related to images in which the target object 51 is photographed. This process reduces computational waste and ensures a certain degree of robustness to the environment.
  • the region extraction unit 22 may further perform processing processing, such as background removal processing (mask processing) and size adjustment, on the image of the extracted candidate region 56. Further, the processed image of the candidate area 56 may be used for subsequent processing. Through this processing, it becomes possible to reduce the domain gap caused by the background and illumination conditions, and to learn the identification model 24 with a small amount of training data.
  • processing processing such as background removal processing (mask processing) and size adjustment
  • the target area acquisition unit 21 determines whether each of the candidate areas 56 includes an image of the target object 51 (S103). This processing includes a process in which the feature extraction unit 23 extracts a feature amount from the image of the candidate region 56, and a process in which the identification model 24 outputs information indicating whether the candidate region 56 includes the target object 51 from the feature amount. including.
  • the feature extraction unit 23 outputs feature amounts corresponding to the image of the candidate area 56.
  • the feature extraction unit 23 includes a trained CNN (Convolutional Neural Network). This CNN outputs feature amount data (input feature amount data) indicating the feature amount corresponding to the image in response to the input of the image.
  • the feature extraction unit 23 may extract feature amounts from the image of the candidate region 56 extracted by RPN, or may obtain feature amounts extracted during RPN processing, such as Faster R-CNN. good.
  • the identification model 24 is an SVM (Support Vector Machine) or the like, and is a type of machine learning model.
  • the identification model 24 generates an identification score indicating the probability that an object appearing in the candidate area 56 belongs to the normal class in the identification model 24, in response to input feature amount data indicating the feature amount corresponding to the image of the candidate area 56. Output.
  • the discrimination model 24 is trained using a plurality of positive example training data for positive examples and a plurality of negative example training data for negative examples.
  • the positive example training data is generated from a learning image including a photographed image of the target object 51
  • the negative example training data is an image of an object different from the target object 51, and is generated from an image prepared in advance.
  • the negative example training data may be generated by photographing the environment of the photographing section 20, which is photographed by the photographing section 20.
  • this CNN is used to generate feature amount data indicating the feature amount corresponding to the image that has undergone the normalization process.
  • the feature extraction unit 23 may output feature amount data indicating the feature amount corresponding to the image in accordance with the input of the image using another known algorithm that calculates the feature amount indicating the feature of the image. .
  • the target area acquisition unit 21 determines that the candidate area 56 includes the image of the target object 51, for example, when the identification score is greater than the threshold value.
  • the target region acquisition unit 21 determines the target region 55 based on the determination result (S104). More specifically, the target area acquisition unit 21 acquires a rectangular area including a region near the target object 51 as the target area 55 based on the candidate area 56 determined to include the target object 51. The target area acquisition unit 21 may acquire a square area including the area near the target object 51 as the target area 55, or may simply acquire the candidate area 56 as the target area 55. Note that the target area acquisition unit 21 does not always have to acquire the target area 55 through the processes of S102 and S103. For example, the target area acquisition unit 21 may acquire the target area 55 by performing a known time-series tracking process on the input image acquired after acquiring the target area 55.
  • the posture estimation unit 25 inputs the image of the target area 55 to the learned estimation model 26 (S105).
  • the image of the target region 55 input here may be an image whose size has been adjusted (enlarged or reduced) according to the size of the input image of the estimation model 26. By adjusting (normalizing) the size, the learning efficiency of the estimation model 26 is improved.
  • the posture estimation unit 25 may mask the background of the image of the target region 55 and input the image of the target region 55 with the background masked to the estimation model 26.
  • the key point determining unit 27 included in the posture estimating unit 25 determines the two-dimensional position of the key point in the target area 55 and the input image based on the output of the estimation model 26 (S106).
  • the key point determining unit 27 calculates candidates for key point positions from each point of the position image, and determines the position of the key points based on the candidates.
  • the output of the estimation model 26 is the position of a key point in the target region 55, the position of the key point in the input image may be calculated from that position. Note that the processes in S105 and S106 are performed for each type of key point.
  • the posture calculation section 28 included in the posture estimation section 25 calculates the estimated posture of the target object 51 based on the determined two-dimensional position of the key point (S107).
  • the posture calculation unit 28 may calculate the position of the target object 51 along with the posture.
  • the pose and position may be calculated by solving the PNP problem described above.
  • FIG. 7 is a diagram illustrating the posture of the detected target object 51.
  • the posture of the target object 51 is represented by a local coordinate axis 59 indicating the local coordinate system of the target object 51.
  • the position of the origin of the local coordinate axis 59 indicates the position of the target object 51, and the direction of the line of the local coordinate axis 59 indicates the posture.
  • the reliability acquisition unit 39 calculates the reliability of the output of the estimation model 26 for the target region 55 (S108). Then, when the reliability satisfies a predetermined condition, the identification training data generation unit 34 and the estimation training data generation unit 37 perform additional training on the identification model 24 and the estimation model 26, respectively, based on the target area. Data is generated (S109). The process in S109 is to generate additional training data based on the data input to the machine learning model after learning the machine learning model (at the time of inference). Details of the processing in S108 and S109 will be described later.
  • the estimated posture and position of the target object 51 may be used in various ways.
  • the operation information may be input to application software such as a game instead of the operation information input by the controller.
  • the processor 11 that executes the execution code of the application software may generate image data based on the attitude (and position) and cause the display unit 18 to output the image.
  • the processor 11 may cause the information processing device 10 or an audio output device connected to the information processing device 10 to output a sound based on its posture (and position).
  • the processor 11 may control the operation of the AI agent such as a robot by notifying the position and orientation of the object to the AI agent, for example, causing the AI agent to grasp the object.
  • FIG. 8 is a flow diagram schematically explaining learning of the identification model 24 and estimation model 26.
  • the identification training data generation unit 34 acquires initial training data for the identification model 24, and the estimation training data generation unit 37 acquires initial training data for the estimation model 26 (S201).
  • FIG. 9 is a flow diagram illustrating an example of a process for generating initial training data.
  • the photographed image acquisition unit 33 acquires a plurality of photographed images in which the target object 51 is photographed (S301).
  • FIG. 10 is a diagram illustrating photographing of the target object 51.
  • the target object 51 is held, for example, by a hand 53, and is photographed by the photographing unit 20.
  • the photographing unit 20 changes the direction in which the target object 51 is photographed while periodically photographing images as in video photographing.
  • the photographing direction of the target object 51 may be changed by changing the posture of the target object 51 using the hand 53.
  • the target object 51 may be placed on the AR marker and the shooting direction may be changed by moving the shooting unit 20.
  • the acquisition interval of photographed images used in the processing described below may be wider than the interval of photographing a moving image.
  • the photographed image acquisition unit 33 masks the image of the hand 53 from these photographed images (S302).
  • the image of the hand 53 may be masked by a known method.
  • the photographed image acquisition unit 33 may mask the image of the hand 53 by detecting a skin-colored region included in the photographed image.
  • the shape model acquisition unit 36 calculates the three-dimensional shape model of the target object 51 and the posture in each of the captured images from the plurality of captured images (S303). This process may be performed using the above-mentioned known method that is also used in software that implements so-called SfM and Visual SLAM.
  • the shape model acquisition unit 36 may calculate the posture of the target object 51 based on the calculation logic of the camera photographing direction using this method.
  • the shape model acquisition unit 36 determines the three-dimensional positions of a plurality of key points used for estimating the posture of the three-dimensional shape model (S304).
  • the shape model acquisition unit 36 may determine the three-dimensional positions of the plurality of key points using, for example, a known Farthest Point algorithm.
  • the estimated training data generation unit 37 generates training data including a plurality of training images and a plurality of position images for the estimation model 26 (S305). More specifically, the estimated training data generation unit 37 generates a plurality of training images rendered from the three-dimensional shape model, and generates a position image indicating the position of a key point in the plurality of training images.
  • the plurality of training images are rendered images of the target object 51 viewed from a plurality of different directions, and a position image is generated for each combination of a training image and a key point.
  • the estimated training data generation unit 37 virtually projects the position of the key point onto the rendered training image, and generates a position image based on the relative position of the projected key point position and each point in the image.
  • the training data used for learning the estimation model 26 includes training images and position images.
  • the training images included in the initial training data are rendered images. This is because while it is difficult to obtain images taken from various photographing directions in a short period of time, it is possible to easily generate images viewed from various photographing directions using a three-dimensional shape model.
  • the initial training data may include live training images.
  • the identification training data generation unit 34 generates positive example training data from the plurality of captured images acquired by the captured image acquisition unit 33, more specifically, from the image including the target object 51, and stores the data in the storage unit 12, for example. Negative example training data is obtained from images that do not include the target object (S306).
  • the positive example training data and the negative example training data are training data for the identification model 24.
  • the identification training data generation unit 34 performs processing according to the image input to the identification model 24, such as cutting out a region including the target object 51, normalizing the size, masking the background, and extracting feature amounts.
  • Positive example training data may be generated from the images.
  • the identification training data generation section 34 inputs the negative example sample images stored in the storage section 12 in advance to the feature extraction section 23, and generates a plurality of negative example training data by acquiring the output feature amount data.
  • the feature amount is extracted by the same process as the feature extraction unit 23 included in the identification model 24.
  • the negative example sample image may be, for example, an image photographed in advance by the photographing unit 20, an image collected from the Web, or a positive example image of another object.
  • the negative example training data may be generated in advance and stored in the storage unit 12.
  • identification model 24 is not limited to those described above, and may be one that determines whether the target object 51 exists directly from the image.
  • the identification learning unit 35 trains the identification model 24 using the initial training data for the identification model, and the estimation learning unit 38 converts the estimation model 26 into the identification model. Learning is performed using initial training data for the target (S202).
  • the discrimination model 24 is, for example, an SVM, and the discrimination learning unit 35 may cause the SVM to learn using positive example training data and negative example training data.
  • the information processing system performs so-called inference processing using these models, and learns the identification model 24 and the estimation model 26 according to the reliability.
  • the information processing system inputs the photographed image as an input image to the target area acquisition unit 21, and the target area acquisition unit 21 and posture estimation unit 25 extract the target area 55 and determine the target area included in the target area 55. Processing for estimating the posture of the object 51 is executed.
  • the process in S203 corresponds to the processes in S101 to S107 in FIG.
  • the reliability acquisition unit 39 calculates the reliability of the output based on the output of the estimation model 26 included in the posture estimation unit 25 (S204). This process corresponds to the process of S108 in FIG.
  • the reliability obtaining unit 39 calculates the reliability using the following procedure, for example.
  • the reliability acquisition unit 39 selects a plurality of groups each including two points from the position image output by the estimation model 26.
  • the reliability acquisition unit 39 calculates key point candidate positions for each group based on the direction of the key point indicated by each point included in the group.
  • a candidate position corresponds to the intersection of a straight line extending from a certain point in the direction indicated by that point and a straight line extending from another point in the direction indicated by that point.
  • the reliability acquisition unit 39 calculates a value indicating the dispersion of candidate positions as the reliability.
  • the reliability obtaining unit 39 may take, for example, the average value of the distance from the center of gravity of the candidate position as the reliability, or may calculate the standard deviation of the candidate position in any direction as the reliability.
  • the reliability obtaining unit 39 may calculate the reliability from values other than values indicating variations in candidate positions. For example, the reliability acquisition unit 39 receives the output of the estimation model 26 when an input image such as an image of a target region is input to the estimation model 26, and the processing that the input image has been processed by a predetermined processing process. The reliability may be calculated based on information indicating a difference from the output when the image is input to the estimation model 26.
  • the reliability acquisition unit 39 performs predetermined processing (Augmentation) on the image of the target area. This processing may be, for example, changing brightness or adding noise.
  • the reliability acquisition unit 39 inputs the processed image to the estimation model 26 and acquires the position image as its output.
  • the reliability acquisition unit 39 calculates a value indicating the difference between the position image output for the original target area image (initial output) and the position image output from the processed image as the reliability. do.
  • This value may be a statistic of the difference in value at each point between the original output and the output for the processed image, or the position of the key point calculated from the original output and the output for the processed image. It may be the distance to the key point position calculated by .
  • an output obtained when an image obtained by processing the image of the target area differently from a predetermined process is input to the estimation model 26 may be used.
  • the processing (Augmentation) performed here may be different from the processing performed by the estimated training data generation unit 37, which will be described later. Due to the different methods, the accuracy of the reliability that occurs when calculating the reliability using 26 for the estimated model learned using additional training data is suppressed.
  • the reliability obtaining unit 39 obtains the final reliability by combining the reliability (an element) calculated from the value indicating the variation of candidate positions and the value indicating the difference between the initial output and the output for the processed image. You can output the degree.
  • the reliability obtaining unit 39 may output, for example, a value obtained by weighting and adding the former and the latter as the reliability.
  • the reliability acquisition unit 39 determines whether the calculated reliability satisfies the additional conditions for adding training data (S205).
  • the additional condition may be, for example, that the value of the variation calculated as the reliability is smaller than a threshold value.
  • the identification training data generation unit 34 and the estimation training data generation unit 37 generate additional training data to be added to the training data of the identification model 24 and estimation model 26, respectively.
  • the identification training data generation unit 34 determines an image corresponding to the target area 55 that is the source of the position image (for example, an image of the corresponding candidate area 56) as a positive example image, and data to the training data of the discriminative model.
  • the identification training data generation unit 34 generates positive example training data from the photographed image by processing the image determined as the positive example image according to the image input to the identification model 24, for example, extracting a feature amount. It's fine.
  • the estimation training data generation unit 37 also generates a set of a first additional image and a second additional image based on the image of the target area that is the source of the position image, and uses the set as additional training data for the estimation model. Add to.
  • the estimated training data generation unit 37 performs first processing (Augmentation) on the image of the target area, and obtains the processed image as a first additional image.
  • the estimated training data generation unit 37 also performs second processing (Augmentation) on the image of the target area, and obtains the processed image as a second additional image.
  • the first processing and the second processing are different from each other, and may each include, for example, changing brightness or adding noise. Further, one of the first processing and the second processing may not involve substantial processing. A method of learning the estimation model (consistency loss) using the first additional image and the second additional image set will be described later.
  • the estimated training data generation unit 37 adds a set of the image of the target area that is the source of the position image and correct data indicating the posture calculated by the posture estimation unit 25 for the image to the additional training data. It's okay.
  • the estimated model 26 may be trained using the same method as the initial learning.
  • part of the input data used for inference using a trained machine learning model is added to the training data.
  • input data for inference is usually not added to training data. This is because, for example, if the output for input data is incorrect, adding that input data may cause a decline in the quality of the training data.
  • the reliability of the output of the machine learning model is calculated, and the quality of the data to be added is ensured by filtering whether or not to add it to the training data using the reliability. This makes it possible to improve the accuracy of machine learning models while reducing the effort required to generate them.
  • the reliability calculated in this embodiment can be considered as the reliability of the identification model 24 and the estimation model 26. From the perspective of the reliability of the output of the identification model 24, it can be said that the reliability is determined to indicate whether the position image output by the estimation model 26 which is located after the identification model 24 is in a state where key points can be accurately determined. In this way, by using the ability to appropriately perform processing including subsequent machine learning models as an indicator of reliability, reliability can be calculated easily and effectively. In terms of the reliability of the output of the estimation model 26, it can be said that the reliability is determined by the fact that the subsequent process of determining key points from the output position image can be performed appropriately.
  • the processes from S203 onward are repeatedly executed unless the conditions for starting relearning are met (N at S207).
  • the condition for starting relearning may be that the number of acquired additional training data reaches a threshold value, or that an operation to end so-called iterative estimation processing is input.
  • the identification learning unit 35 and the estimation learning unit 38 retrain the identification model 24 and the estimation model 26, respectively (S208).
  • relearning refers to learning a machine learning model using training data including additional training data.
  • the machine learning models (identification model 24 and estimation model 26) that are the targets of learning may be different instances from the identification model 24 and estimation model 26 that are the machine learning models that are performing inference, or may be different instances from the machine learning models that are performing inference. It may be the same instance as the machine learning model being used. In the former case, the instances of the identification model 24 and estimation model 26 used for inference may be switched after learning is completed. Furthermore, instead of switching instances, parameters of a newly learned machine learning model may be copied to instances of the identification model 24 and estimation model 26 used for inference.
  • the discriminative learning unit 35 may add additional training data to the initial training data, and learn the discriminative model 24 using the added training data. Further, the training data used for learning the discrimination model 24 may be all of the initial training data and additional training data, or may be some of them. Some of the training data used for learning the discriminative model 24 may be selected such that the number thereof is less than or equal to the maximum value of the total number of samples, or may be determined to be of low quality by some method. Samples may also be excluded.
  • FIG. 11 is a flowchart illustrating an example of a process of relearning the estimated model. In the relearning, the process shown in FIG. 11 is repeatedly executed multiple times.
  • the estimation learning unit 38 trains the estimation model 26 using initial training data for the estimation model 26 (S501). This learning is performed using the same method as the learning of the estimation model 26 in step S202. More specifically, the estimation learning unit 38 calculates the difference (L1 loss) between the position image output by the estimation model 26 and the correct data using a teaching signal. As such, the parameters of the estimation model 26 are adjusted.
  • the estimation learning unit 38 obtains one of the unobtained sets included in the additional training data for the estimation model 26 (S502).
  • the estimation learning unit 38 inputs the first additional image included in the set to the estimation model 26 and obtains the output (first output) of the estimation model 26 (S503).
  • the estimation learning unit 38 inputs the second additional image included in the set to the estimation model 26, and obtains the output (second output) of the estimation model 26 (S504).
  • the estimation learning unit 38 calculates information indicating the difference (consistency loss) between the first output and the second output (S505), and adjusts the parameters of the estimation model 26 based on the information indicating the difference (S506). ).
  • the information indicating the difference may be a statistic (for example, an average) of the difference in values at each point of the first output and the second output.
  • the additional training data learning is performed according to the difference between the first output and the second output, so if learning is mainly done using this method, for example, the same position image will be output regardless of the input. There is a possibility that the parameters of the estimated model 26 may converge. In order to avoid this situation, it is desirable to suppress the ratio of the number of additional training data to the number of all training data including the initial training data within a predetermined value (for example, 20%).
  • the processing of the target area acquisition unit 21 determines that the image input to the estimation model 26 is an image of a region of the captured image where the target object 51 exists, and the target object 51 is located in the center. Images are limited to images with a sufficiently high probability. Furthermore, the estimation model 26 of the posture estimation unit 25 is trained based on training data generated by a three-dimensional shape model, while the identification model 24 of the target area acquisition unit 21 is trained based on the image in which the target object 51 is photographed. has been done.
  • the accuracy of the output of the estimation model 26 is improved, and the accuracy of the estimated posture of the target object 51 is improved. Furthermore, by learning the identification model 24 based on captured images rather than images based on a three-dimensional shape model, it becomes possible to select the target region 55 more accurately, and the accuracy of the estimation model 26 is improved. be able to.
  • the photographed image for generating the three-dimensional shape model for learning the estimation model 26 of the posture estimation unit 25 is also used when learning the identification model 24. This reduces the effort required to photograph the target object 51 and the time required to learn the estimation model 26 and the identification model 24.
  • the identification model 24 may be an SVM of any kernel.
  • the discrimination model 24 may be a discriminator using a method such as the K-nearest neighbor method, logistic regression, or a boosting method such as AdaBoost.
  • Discrimination model 24 may also be implemented by a neural network, a Naive Bayes classifier, a random forest, a decision tree, or the like.
  • the output of the estimation model 26 may be a position image such as a heat map indicating the positions of key points.
  • the reliability obtaining unit 39 may obtain the number of peaks that the position image output by the estimation model 26 has as the reliability. If this number of peaks is less than a threshold, the input data may be added to the training data.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)
PCT/JP2022/011645 2022-03-15 2022-03-15 情報処理システム、情報処理方法及びプログラム Ceased WO2023175727A1 (ja)

Priority Applications (3)

Application Number Priority Date Filing Date Title
JP2024507259A JP7724361B2 (ja) 2022-03-15 2022-03-15 情報処理システム、情報処理方法及びプログラム
PCT/JP2022/011645 WO2023175727A1 (ja) 2022-03-15 2022-03-15 情報処理システム、情報処理方法及びプログラム
US18/841,737 US20250165862A1 (en) 2022-03-15 2022-03-15 Information processing system, information processing method, and program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2022/011645 WO2023175727A1 (ja) 2022-03-15 2022-03-15 情報処理システム、情報処理方法及びプログラム

Publications (1)

Publication Number Publication Date
WO2023175727A1 true WO2023175727A1 (ja) 2023-09-21

Family

ID=88022493

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2022/011645 Ceased WO2023175727A1 (ja) 2022-03-15 2022-03-15 情報処理システム、情報処理方法及びプログラム

Country Status (3)

Country Link
US (1) US20250165862A1 (https=)
JP (1) JP7724361B2 (https=)
WO (1) WO2023175727A1 (https=)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP7756230B1 (ja) * 2024-12-26 2025-10-17 ソフトバンク株式会社 学習装置、学習方法および学習プログラム

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180137644A1 (en) * 2016-11-11 2018-05-17 Qualcomm Incorporated Methods and systems of performing object pose estimation
JP2020160804A (ja) * 2019-03-27 2020-10-01 三菱電機株式会社 情報処理装置、プログラム及び情報処理方法
US20200380723A1 (en) * 2019-05-30 2020-12-03 Seiko Epson Corporation Online learning for 3d pose estimation
US20200401793A1 (en) * 2020-06-26 2020-12-24 Intel Corporation Apparatus and methods for determining multi-subject performance metrics in a three-dimensional space
CN112907583A (zh) * 2021-03-29 2021-06-04 苏州科达科技股份有限公司 目标对象姿态选择方法、图像评分方法及模型训练方法
JP2021137747A (ja) * 2020-03-06 2021-09-16 株式会社明電舎 水処理施設の操作量導出装置

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180137644A1 (en) * 2016-11-11 2018-05-17 Qualcomm Incorporated Methods and systems of performing object pose estimation
JP2020160804A (ja) * 2019-03-27 2020-10-01 三菱電機株式会社 情報処理装置、プログラム及び情報処理方法
US20200380723A1 (en) * 2019-05-30 2020-12-03 Seiko Epson Corporation Online learning for 3d pose estimation
JP2021137747A (ja) * 2020-03-06 2021-09-16 株式会社明電舎 水処理施設の操作量導出装置
US20200401793A1 (en) * 2020-06-26 2020-12-24 Intel Corporation Apparatus and methods for determining multi-subject performance metrics in a three-dimensional space
CN112907583A (zh) * 2021-03-29 2021-06-04 苏州科达科技股份有限公司 目标对象姿态选择方法、图像评分方法及模型训练方法

Also Published As

Publication number Publication date
JP7724361B2 (ja) 2025-08-15
JPWO2023175727A1 (https=) 2023-09-21
US20250165862A1 (en) 2025-05-22

Similar Documents

Publication Publication Date Title
CN110175558B (zh) 一种人脸关键点的检测方法、装置、计算设备及存储介质
EP4459554A1 (en) Information processing system, information processing method, and program
WO2018028546A1 (zh) 一种关键点的定位方法及终端、计算机存储介质
CN102831439A (zh) 手势跟踪方法及系统
JP2015522200A (ja) 人顔特徴点の位置決め方法、装置及び記憶媒体
CN111931654B (zh) 一种人员跟踪智能监测方法、系统和装置
WO2024012333A1 (zh) 位姿估计方法及相关模型的训练方法、装置、电子设备、计算机可读介质和计算机程序产品
CN103425964A (zh) 图像处理设备、图像处理方法及计算机程序
CN111860133A (zh) 无种族偏见的识别人类的人工智能伦理方法和机器人
JPWO2019003973A1 (ja) 顔認証装置、顔認証方法およびプログラム
CN111553310A (zh) 基于毫米波雷达的安检图像获取方法、系统及安检设备
CN108090451A (zh) 一种人脸识别方法及系统
CN116959073B (zh) 一种方向自适应的多姿态点云人脸识别方法及系统
JP7724361B2 (ja) 情報処理システム、情報処理方法及びプログラム
CN116403268B (zh) 图像库构建方法、电子设备及存储介质
CN115223196B (zh) 手势识别方法、电子设备和计算机可读存储介质
JP6763408B2 (ja) 情報処理装置、情報処理方法、及び、プログラム
CN113822122A (zh) 具有低空间抖动、低延迟和低功耗的对象和关键点检测系统
JP7765611B2 (ja) 情報処理装置、情報処理方法及びプログラム
JP7577526B2 (ja) 画像処理装置、画像処理方法、及びプログラム
CN117058736A (zh) 基于关键点检测的人脸误检识别方法、装置、介质和设备
CN111461971B (zh) 图像处理方法、装置、设备及计算机可读存储介质
WO2024047715A1 (ja) 機械学習プログラム、機械学習方法および情報処理装置
JP2010055395A (ja) 画像処理装置及び方法、プログラム、記憶媒体
CN118430053B (zh) 一种基于对比学习的视角无关面部表情识别方法及系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22932011

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2024507259

Country of ref document: JP

Kind code of ref document: A

WWE Wipo information: entry into national phase

Ref document number: 18841737

Country of ref document: US

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 22932011

Country of ref document: EP

Kind code of ref document: A1

WWP Wipo information: published in national office

Ref document number: 18841737

Country of ref document: US