WO2022167298A1 - Annotation d'images bidimensionnelles - Google Patents

Annotation d'images bidimensionnelles Download PDF

Info

Publication number
WO2022167298A1
WO2022167298A1 PCT/EP2022/051810 EP2022051810W WO2022167298A1 WO 2022167298 A1 WO2022167298 A1 WO 2022167298A1 EP 2022051810 W EP2022051810 W EP 2022051810W WO 2022167298 A1 WO2022167298 A1 WO 2022167298A1
Authority
WO
WIPO (PCT)
Prior art keywords
model
landmarks
dimensional images
collection
landmark
Prior art date
Application number
PCT/EP2022/051810
Other languages
English (en)
Inventor
Constantin Cosmin ATANASOAEI
Daniel Milan LÜTGEHETMANN
Dimitri Zaganidis
John RAHMON
Michele DE GRUTTOLA
Original Assignee
Inait Sa
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US17/179,596 external-priority patent/US20220245860A1/en
Application filed by Inait Sa filed Critical Inait Sa
Priority to KR1020237029993A priority Critical patent/KR20230138011A/ko
Priority to CN202280012892.XA priority patent/CN116868240A/zh
Priority to EP22708359.9A priority patent/EP4288941A1/fr
Publication of WO2022167298A1 publication Critical patent/WO2022167298A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/64Three-dimensional objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/10Image acquisition
    • G06V10/12Details of acquisition arrangements; Constructional details thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30244Camera pose

Definitions

  • This specification relates to image processing, more specifically, to image processing that involves annotation of landmarks on two-dimensional images.
  • Image processing is a type of signal processing in which the processed signal is an image.
  • An input image can be processed, e.g., to produce an output image or a characterization of the image.
  • annotation of an image can facilitate processing of the image, especially in image processing techniques that rely upon machine learning.
  • Annotation can label entities or parts of entities in images with structured information or metadata.
  • the label can indicate, e.g., a class (e.g., cat, dog, arm, leg), a boundary, a corner, a location, or other information.
  • the labels can be used in a variety of contexts, including those that rely upon machine learning and/or artificial intelligence.
  • a collection of annotated images can form a training dataset for pose estimation, image classification, feature extraction, and pattern recognition in contexts as diverse as medical imaging, self-driving vehicles, damage assessment, facial recognition, and agriculture.
  • machine learning and artificial intelligence models require large datasets that are customized to the particular task performed by the model.
  • This specification describes technologies relating to image processing that involves annotation of landmarks on two-dimensional images.
  • the subject matter described in this specification can be embodied in methods performed by data processing apparatus for training a device for estimating the relative pose of an imaging device and an object in a two-dimensional image.
  • the methods include identifying a 3D model of the object, identifying landmarks on the 3D model of the object, projecting the 3D model into a collection of two- dimensional images with knowledge of the location of the landmarks from the 3D model on the projection, and training a landmark-detection machine learning model to identify the landmarks in the collection of two-dimensional images.
  • the landmark-detection machine learning model is part of a device for estimating the relative pose of an imaging device.
  • the methods can include estimating relative poses of the object in two-dimensional images using the device that includes the landmark-detection machine learning model, determining a correctness of the estimates of the relative poses, and further training the landmark-detection machine learning model based on the correctness of the estimates of the relative poses.
  • the relative poses of the object can be estimated in the collection of two-dimensional images into which the 3D model is projected.
  • the correctness of the estimates of the relative poses can be determined by constraining relative poses of the projections of the 3D model into the collection of two-dimensional images, and classifying any estimate of the relative pose that does not satisfy the constraints as incorrect.
  • Identifying the landmarks on the 3D model of the object can include rendering a collection of two-dimensional images of the object by projecting the 3D model of the object onto two dimensions, assigning different regions of the object in the two- dimensional images to respective parts of the object, determining distinguishable regions of the parts of the object using the assigned regions, and projecting the distinguishable regions back onto the 3D model of the object to identify the landmarks on the 3D model of an object.
  • the subject matter described in this specification can be embodied in methods performed by data processing apparatus for estimating the relative pose of an imaging device and an object in a two-dimensional image of the object.
  • the methods include detecting landmarks on the object in the two-dimensional image, filtering the plurality of landmarks to establish a plurality of subsets of the detected landmarks, calculating, using each of the respective subsets of the detected landmarks, candidate relative poses of the object in the two-dimensional image, and estimating the relative pose of an imaging device and an object based on at least one of the candidate relative poses.
  • the methods can include filtering the candidate relative poses of the object.
  • the criteria for filtering the candidate relative poses can reflect real-world conditions in which a real image is likely to be taken.
  • Estimating the relative pose of the imaging device and the object can include averaging multiple of the candidate relative poses.
  • Detecting the landmarks on the object can include detecting the landmarks using a landmark-detection machine learning model.
  • the landmark-detection machine learning model can have been trained by a process that includes identifying a 3D model of the object, identifying landmarks on the 3D model of the object, projecting the 3D model into a collection of two-dimensional images with knowledge of the location of the landmarks from the 3D model on the projection, and training the landmark-detection machine learning model to identify the landmarks in the collection of two-dimensional images.
  • the subject matter described in this specification can be embodied in methods performed by data processing apparatus for identifying landmarks on a 3D model of an object.
  • the methods include rendering a collection of two-dimensional images of an object by projecting the 3D model of the object onto two dimensions, assigning different regions of the object in the two-dimensional images to respective parts of the object, determining distinguishable regions of the parts of the object using the assigned regions, and projecting the distinguishable regions back onto the 3D model of the object to identify the landmarks on the 3D model of an object.
  • Determining the distinguishable regions of the parts comprises detecting corners of projections of the parts in the two-dimensional images.
  • the method can include reducing a number of the distinguishable regions prior to projection back onto the 3D model.
  • the number of the distinguishable regions can be reduced by filtering distinguishable regions that are close to an outer boundary of the object.
  • the number of the distinguishable regions can be reduced by clustering back-projections of the distinguishable regions onto the 3D model from different of the two-dimensional images and discarding outliers of the distinguishable regions.
  • Rendering the collection of two-dimensional images of the object can include permuting the object and projecting the permutations of the 3D model onto two dimensions.
  • Rendering the collection of two-dimensional images of the object can include varying to rendering to mimic variation in a characteristic of an imaging apparatus, to mimic variation in a characteristic of image processing applicable to two- dimensional images, or to mimic variation in an imaging condition.
  • FIG. l is a schematic representation of the acquisition of a collection of different images of an object.
  • FIG. 2 is a schematic representation of a collection of two-dimensional images acquired by one or more cameras.
  • FIG. 3 is a flowchart of a computer-implemented process for processing photographic images of an object.
  • FIG. 4 is a flowchart of a computer-implemented process for annotating landmarks that appear on a 3D model.
  • FIGS. 5 A, 5B, 5C, 5D show example results from the annotation of landmarks on a 3D model of an automobile.
  • FIG. 6 is a flow chart of a process for producing landmark detector that is capable of detecting landmarks in real two-dimensional images using an annotated 3D model.
  • FIG. 7 is a flowchart of a process for recognizing relative poses between an imaging device and an object using a machine-learning model for landmark detection.
  • FIG. 8 is a histogram that represents the accuracy of an example machine learning model for landmark detection that has been produced using the process of FIG. 6.
  • FIG. 9 is a histogram that represents the accuracy of relative pose predictions made using the process of FIG. 7.
  • FIG. l is a schematic representation of the acquisition of a collection of different images of an object 100.
  • object 100 is shown as an assembly of ideal, unmarked geometric parts (e.g., cubes, polyhedrons, parallelepipeds, etc.).
  • objects will generally have a more complicated shape and be textured or otherwise marked, e.g., with ornamental decoration, wear marks, or other markings upon the underlying shape.
  • a collection of one or more imaging devices can be disposed successively or simultaneously at different relative positons around object 100 and oriented at different relative angles with respect to object 100.
  • the positions can be distributed in 3-dimensional space around object 100.
  • the orientations can also vary in 3-dimensions, i.e., the Euler angles (or yaw, pitch, and roll) can all vary.
  • the relative positioning and orientation of a camera 105, 110, 115, 120, 125 with respect to object 100 can be referred to as the relative pose. Since cameras 105, 110, 115, 120, 125 have different relative poses, cameras 105, 110, 115, 120, 125 will each acquire different images of object 100.
  • a landmark is a position of interest on object 100.
  • Landmarks can be positioned at geometric locations on an object or at a marking upon the underlying geometric shape. As discussed further below, landmarks can be used for determining the pose of the object. Landmarks can also be used for other types of image processing, e.g., for classifying the object, for extracting features of the object, for locating other structures on the object (geometric structures or markings), for assessing damage to the object, and/or for serving as point of origin from which measurements can be made in these and other image processing techniques.
  • FIG. 2 is a schematic representation of a collection 200 of two-dimensional images acquired by one or more cameras, such as cameras 105, 110, 115, 120, 125 (FIG. 1).
  • the images in collection 200 show object 100 at different relative poses.
  • Landmarks like landmarks 130, 131, 132, 133, 134, 135, 136, ... appear at different locations in different images — if they appear at all.
  • landmarks 133, 134 are obscured by the remainder of object 100.
  • landmarks 131, 135, 137 are obscured by the remainder of object 100.
  • FIG. 3 is a flowchart of a computer-implemented process 300 for processing photographic images of an object, such as images 205, 210, 215, 220 (FIG. 2).
  • Process 300 can be performed by one or more data processing devices that perform data processing activities. The activities of process 300 can be performed in accordance with the logic of a set of machine-readable instructions, a hardware assembly, or a combination of these and/or other instructions.
  • Process 300 produces a landmark detector that has been trained using machine learning techniques to identify landmarks in photographic images of an object.
  • the identified landmarks can be used in a variety of different image processing applications, including pose estimation, image classification, feature extraction, pattern recognition, and others.
  • Process 300 can thus be performed independently or as part of a larger collection of activities. For example, process 300 can be performed in conjunction with process 400 (FIG. 4).
  • the present specification refers to photographic or real “images of an object,” these images are generally not images of a single physical instance of an object. Rather, the images of an object are generally images of several different instances of different objects that share common visually-identifiable characteristics. Examples include different instances of a make and model of a car or of an appliance, different instances of an animal taxonomic group (e.g., instances of a species or of a gender of a species), and different instances of an organ (e.g., x-ray images of femurs from 100 different humans).
  • the photographic or real images can be, e.g., digital photographic images or, alternatively, can be formed using X-rays, sound, or other imaging modality. The images can either be in digital or in analog format.
  • the device performing process 300 identifies a 3D model of a physical object that appears in one or more images that are to be processed.
  • the 3D model can represent the object in three-dimensional space, generally divorced from any frame of reference.
  • 3D models can be created manually, algorithmically (procedural modeling), or by scanning real objects. Surfaces in a 3D model may be defined with texture mapping.
  • a single 3D model will include several different constituent parts.
  • Parts of an object are pieces or volumes of the object and are generally distinguished from other pieces or volumes of the object, e.g., on the basis of function and/or structure.
  • the parts of an automobile can include, e.g., bumpers, wheels, body panels, the hood, windshields, and hoods.
  • the parts of an organ can include, e.g., chambers, valves, cavities, lobes, canals, membranes, vasculature, and the like.
  • the parts of a plant can include roots, stems, leaves, and flowers.
  • the 3D model may itself be divided into 3D models of the constituent parts.
  • a 3D model of an automobile generated using computer-aided design (CAD) software may be an assembly of 3D CAD models of the constituent parts.
  • CAD computer-aided design
  • a 3D model can start as a unitary whole that is subdivided into constituent parts.
  • a 3D model of an organ can be divided into various constituent parts under the direction of a medical or other professional.
  • data that identifies the object that appears in the image(s) can be received from a human user.
  • a human user can indicate the make, model, and year of an automobile that appears in the image(s).
  • a human user can indicate the identity of an human organ or the species of a plant that appears in the image(s).
  • the object can be identified using image classification techniques. For example, convolutional neural network can be trained to output a classification label for an object or a part of an object in an image.
  • a 3D model of the object can be identified in a variety of different ways. For example, a pre-existing library of 3D models can be searched using data that identifies the object. Alternatively, a manufacturer of a product can be requested to provide a 3D model or a physical object can be scanned.
  • the device performing process 300 annotates landmarks that appear on the 3D model. As discussed above, these landmarks are positions of interest on the 3D model and can be identified and annotated on the 3D model.
  • FIG. 4 is a flowchart of a computer-implemented process 400 for annotating landmarks that appear on a 3D model.
  • Process 400 can be performed by one or more data processing devices that perform data processing activities, e.g., in accordance with the logic of a set of machine-readable instructions, a hardware assembly, or a combination of these and/or other instructions.
  • Process 400 can be performed in isolation or in conjunction with other activities. For example, process 400 can be performed at 310 in process 300 (FIG. 3).
  • the system performing process 400 renders a collection of two- dimensional images of the object using a 3D model of the object that is formed of constituent parts.
  • the two-dimensional images are not actual images of a real-world object. Rather, the two-dimensional images can be thought of as surrogates for images of the real world object. These surrogate two-dimensional images show the object from a variety of different angles and orientations — as if a camera were imaging the object from a variety of different relative poses.
  • the two-dimensional images can be rendered using the 3D model in a number of ways. For example, ray tracing or other computer graphic techniques can be used. In general, the 3D model of the object is perturbed for rendering the surrogate two- dimensional images.
  • the perturbations can mimic real -world variations in the objects — or parts of the objects — that are represented by the 3D model.
  • the colors of the exterior paint and the interior decor can be perturbed.
  • parts tires, hubcaps, and features like roof carriers
  • physiologically relevant size and relative size variations can be used to perturb the 3D model.
  • aspects other than the 3D model can be perturbed to further vary the two-dimensional images.
  • the perturbations can mimic real- world variations including, e.g.,
  • imaging devices e.g., camera resolution, zoom, focus, aperture speed
  • image processing e.g., digital data compression, chroma sub sampling
  • imaging conditions e.g., lighting, weather, background colors and shapes.
  • the two-dimensional images are rendered in a frame of reference.
  • the frame of reference can include background features that appear behind the object and foreground features that appear in front of — and possibly obscure part of — the object.
  • the frame of reference will reflect the real -world environment in which the object is likely to be found. For example, an automobile may be rendered in a frame of reference that resembles a parking lot, whereas an organ may be rendered in a physiologically relevant context.
  • the frame of reference can also be varied to further vary the two-dimensional images.
  • the two-dimensional images are highly variable.
  • the number of surrogate two-dimensional images — and the extent of the variations— can depend on the complexity of the object and the image processing that is ultimately to be performed using the annotated landmarks on the 3D model.
  • 2000 or more highly variable (in relative pose and permutation) surrogate two- dimensional images of an automobile can be rendered. Because the two-dimensional images are rendered from a 3D model, perfect knowledge about the position of the object in the two-dimensional images can be retained regardless of the number of two- dimensional images and the extent of variation.
  • the system performing process 400 assigns each region of an object shown in the two-dimensional images to a part of the object.
  • a 3D model of an object can be divided into distinguishable constituent parts on the basis of function and/or structure.
  • the part to which each region in the two-dimensional image belongs can be preserved.
  • the regions — which can be pixels or other areas in the two-dimensional image — can thus be assigned to corresponding constituent parts of the 3D model with perfect knowledge derived from the 3D model.
  • the system performing process 400 determines distinguishable regions of the parts in the two-dimensional images.
  • a distinguishable region of a part is an area (e.g., a pixel or group of pixels) that can identified in the surrogate two-dimensional images using one or more image processing techniques. For example, in some implementations, corners of the regions in each image that are assigned to the same part are detected using, e.g., a Moravec comer detector or a Harris Comer Detector (https://en.wikipedia.org/wikiZHarris_CornemDetector). As another example, an image feature detection algorithm such as, e.g. SIFT/SURF/HOG/ (https://en.wikipedia.org/wiki/Scale-invariant feature transform) can be used to define distinguishable regions.
  • SIFT/SURF/HOG/ https://en.wikipedia.org/wiki/Scale-invariant feature transform
  • the system performing process 400 identifies a collection of landmarks in the 3D model by projecting the distinguishable regions in the two-dimensional images back onto the 3D model. Volumes on the 3D model that correspond to the distinguishable regions in the two-dimensional images are identified as landmarks on the 3D model.
  • one or more filtering techniques can be applied to reduce the number of these landmarks and to ensure quality— either before or after back- projection onto the 3D model. For example, in some implementations, regions that are close to an outer boundary of the object in the surrogate two-dimensional image can be discarded prior to back-projection. As another example, back-projections of regions that are too distant from a corresponding part in the 3D model can be discarded. In some implementations, only volumes on the 3D model that satisfy a threshold standard are identified as landmarks. The threshold can be determined in a number of ways. For example, the volumes that are candidate landmarks on the 3D model and identified by back-projection from different two-dimensional images rendered with different relative poses and perturbations can be collected.
  • Clusters of candidate landmarks can be identified and outlier candidate landmarks can be discarded.
  • clustering techniques such as the OPTICS algorithm (https://en.wikipedia.org/wiki/OPTICS algorithm, a variation of DB SCAN https://en.wikipedia.org/wiki/DBSCAN) can be used to identify clusters of candidate landmarks.
  • the effectiveness of the clustering can be evaluated using, e.g., Calinski- Harabasz index (i.e., the Variance Ratio Criterion) or other criterion.
  • the clustering techniques can be selected and/or tailored (e.g., by tailoring hyper-parameters of the clustering algorithm) to improve the effectiveness of clustering.
  • candidate landmarks that are in a cluster and closer together than a threshold can be merged.
  • candidate landmarks clusters that are on different parts of the 3D model can also be merged into a single cluster.
  • the barycenters of several candidate landmarks in a cluster can be designated as a single landmark.
  • the landmarks in the 3D model can be filtered on the basis of the accuracy with which their position in surrogate two-dimensional images rendered from the 3D model can be predicted. For example, if the position of 3D landmark in a two-dimensional image is too difficult to predict (e.g., incorrectly predicted above a threshold percent of the time or predicted only with a poor accuracy), then that 3D landmark can be discarded. As a result, only 3D landmarks with positions in two- dimensional images that the landmark predictor can be predict relatively easily will remain.
  • the number of landmarks that are identified can be tailored to a particular data processing activity.
  • the number of landmarks can be tailored in a number of ways, including, e.g.,:
  • FIGS. 5 A, 5B, 5C, 5D show example results from the annotation of landmarks on a 3D model, namely, a 3D model of an automobile.
  • FIGS. 5A, 5C show side and front views of a 3D model of an automobile
  • FIGS. 5B, 5D show side and front views of the same 3D model, but with a collection of landmark annotations 505.
  • each landmark annotation 505 is schematically represented as a white dot on the 3D model.
  • landmark annotations 505 tend to be positioned at the comers of different parts of the automobile, including the corners of the windshield, side windows, and grillwork.
  • FIG. 6 is a flow chart of a process for producing landmark detector that is capable of detecting landmarks in real two- dimensional images using an annotated 3D model.
  • Process 600 can be performed by one or more data processing devices that perform data processing activities, e.g., in accordance with the logic of a set of machine-readable instructions, a hardware assembly, or a combination of these and/or other instructions.
  • Process 600 can be performed in isolation or in conjunction with other activities. For example, process 600 can be performed after 310 in process 300 (FIG. 3).
  • the system performing process 600 renders a collection of two- dimensional images of the object using an annotated 3D model of the object. Ray tracing or other computer graphic techniques can be used. As before, it is generally desirable that the two-dimensional images are as variable as possible. A variety of different relative poses and/or perturbations in the object, the imaging device, image processing, and imaging conditions can be used to generate a diverse collection of two-dimensional images. In implementations where process 600 is performed in conjunction with process 400, new renderings need not be generated. Rather, existing renderings can simply be annotated by adding appropriate annotations from the 3D model with perfect knowledge derived from the 3D model..
  • the system performing process 600 trains a machine learning model for landmark detection in real-world two-dimensional images using the two-dimensional images rendered using the annotated 3D model of the object.
  • An example machine learning model for landmark detection is the detectron2 available at http s : //github .com/ faceb ookresearch/ detectron2.
  • the system performing process 600 applies the machine learning model for two-dimensional landmark detection that has been trained using the surrogate two- dimensional images in a particular type of image processing. Further, the same machine learning model can be further trained by rejecting certain results of the image processing as incorrect.
  • landmark detection can be used, e.g., in image classification, feature extraction, pattern recognition, pose estimation, and projection.
  • a training set that is developed using the surrogate two-dimension images rendered from the 3D model can be used to further train the machine learning model for landmark detection to the particular image processing.
  • the two-dimensional landmark detection machine learning model can be applied to pose recognition.
  • the correspondences between landmarks detected on the surrogate two-dimensional images and landmarks on the 3D model can be used to determine the relative poses of the object in the surrogate two- dimensional images.
  • Those pose predictions can be reviewed to invalidate poses that do not satisfy certain criteria.
  • the criteria for invalidating a pose prediction are established based on the criteria that used when rendering the surrogate two-dimensional images from the 3D model.
  • predicted poses that fall outside those constraints can be labeled as incorrect and used, e.g., as negative examples in further training of the machine learning model for landmark detection.
  • the predicted poses can be limited to criteria that are independent of any criteria that are used when rendering the surrogate two-dimensional images from the 3D model.
  • the predicted poses can be limited to poses that are likely to be found in real -world pose prediction. Poses that are rejected under such criteria would not necessarily be useful as negative examples, but rather simply omitted since landmark detection need not be performed outside of realistic conditions.
  • predicted poses can be constrained, e.g., to a defined range of distances between the camera and the object (e.g., between 1-20 meters) and/or a defined range of roll along the axis between the camera and the center of the object (e.g., less than +/- 10 degrees).
  • computer-implemented techniques can be used to reject pose predictions as incorrect.
  • a variety of computer-implemented techniques including computer graphic techniques (e.g., ray tracing) and computer vision techniques (e.g., semantic segmentation and active contours models) can be used to identify the boundary of an object. If the boundary of the object identified by such a technique does not match the boundary of the object that would result from the predicted pose, the predicted pose can be rejected as incorrect.
  • Process 600 can thus further tailor landmark detection machine learning model to a particular type of image processing without reliance on real images during training.
  • FIG. 8 is a histogram that represents the accuracy of an example machine learning model for landmark detection that has been produced using process 600 (FIG. 6).
  • position along the y-axis indicates the number count of landmarks.
  • Position along the x-axis indicates the average distance over all images between a) the position of each two-dimensional landmark in a surrogate two-dimensional image — as predicted by the machine learning model and b) the actual position of the corresponding two- dimensional landmark in the surrogate two-dimensional image — as calculated by raytracing from the corresponding 3D model.
  • the distance is normalized by the diagonal length of a rectangle that fully contains the automobile at the same relative pose as the surrogate two-dimensional image.
  • a distance of 0.1 indicates that the predicted position of the two-dimensional landmark is 10% of the size of the automobile away from the actual position of that landmark, as calculated from the 3D model.
  • FIG. 7 is a flowchart of a process 700 for recognizing relative poses between an imaging device and an object using a machine-learning model for landmark detection.
  • Process 700 can be performed by one or more data processing devices that perform data processing activities, e.g., in accordance with the logic of a set of machine-readable instructions, a hardware assembly, or a combination of these and/or other instructions.
  • Process 700 can be performed in isolation or in conjunction with other activities.
  • process 700 can be performed after process 600 (FIG.6), using a machine learning model for landmark detection that is produced in that process and tailored to pose recognition.
  • the pose recognition implemented by process 700 provides a high-quality estimation of the relative pose of the camera and an object that is at least partially visible in a real-world two-dimensional image.
  • the system performing process 700 detects landmarks on a real two- dimensional image of an object using a machine-learning model for landmark detection.
  • the machine learning model for landmark detection is produced using process 600 (FIG.6).
  • the landmarks in the real, two-dimensional image will be two-dimensional landmarks.
  • the system performing process 700 filters the detected two-dimensional landmarks to yield one or more subsets of the detected landmarks.
  • the filtering can include determining a correspondence between:
  • a collection of pairs of two-dimensional landmarks (detected in the real image) and three-dimensional landmarks (present on the 3D model of the object) can be determined.
  • Various filtering operations can be used to prefilter these pairs and yield subset(s) of the detected landmarks and corresponding landmarks on the 3D model. For example, two-dimensional landmarks from the real image that are close to the outer boundary of the object in the real image can be removed.
  • the boundary of the object can be identified in a variety of different ways, including, e.g., computer vision techniques. In some instances, the boundary of the object can be detected using the same landmarks detected by the machine-learning model for landmark detection at 705.
  • two-dimensional landmarks detected by the machine-learning model and that are close to one another in the real two- dimensional image can be filtered at random so that at least one landmark remains in the vicinity.
  • the distance between two-dimensional landmarks can be measured, e.g., in pixels.
  • two-dimensional landmarks are designated as close if their distance is, e.g., 2% the width or height of the image or less or 1% the width or height of the image or less.
  • one or more landmarks on the 3D model can be swapped with other symmetric landmarks on the 3D model.
  • landmarks on the 3D model at the passenger’s side of the automobile can be swapped with landmarks at the driver’s side.
  • correspondingly tailored swapping of landmarks can be used.
  • the system performing process 700 calculates one or more candidate relative poses for the camera and the object using the subset(s) of the detected landmarks.
  • the relative poses can be calculated in a variety of different ways.
  • a computer vision approach such as SolvePnP with random sample consensus (available at the OpenCV library https://docs.opencv.org 4 4 0 d9/d0c/group calib3d.html#ga549c2075facl4829ff4a58bc 931 cO33d) can be used to solve the so-called “perspective-n-point problem” and calculate a relative pose based on pairs of two-dimensional and three-dimensional landmarks.
  • Such computer vision approaches tend to be resilient to outliers, i.e., pairs of landmarks where the detected 2D landmark location is far from the actual location.
  • computer vision approaches are often not resilient enough to consistently overcome common imperfections in landmark detectors, including, e.g., two-dimensional landmarks that are invisible in a real image but are predicted to be in the comers of the real image or at the edges of the object, landmarks that cannot reliably be identified as either visible or hidden behind the object, predictions of two-dimensional landmarks that are either unreliable or inaccurate, symmetric landmarks that are exchanged for one another, visually similar landmarks that are detected at the same location, and detection of multiple, clustered landmarks in regions with a complex local structures.
  • the system performing process 700 can avoid these issues.
  • the system performing process 700 filters the candidate relative pose(s) calculated using the subset(s) of the detected landmarks.
  • the filtering can be based on a set of criteria that define potentially acceptable poses for the object in the real image.
  • the criteria reflect real-world conditions in which the real image is likely to be taken and can be tailored according to the nature of the object. For example, for candidate relative poses in which the object is an automobile:
  • the camera should be at an altitude of between 0 meters and 5 meter relative to the ground under the automobile
  • the camera should be within 20 m of the automobile
  • the position of two-dimensional landmarks in the estimated pose should be consistent with the positons of the corresponding landmarks on the 3D model, e.g., as determined by back-projection of the two-dimensional landmarks in the real image onto the 3D model, and
  • the boundary of the object identified by another technique should largely match the boundary of the object that would result from the predicted pose.
  • a candidate relative pose does not satisfy such criteria, then it can be discarded or otherwise excluded from subsequent data processing activities.
  • the system performing process 700 estimates the relative pose of the object in the real image based on the remaining (unfiltered) candidate relative poses. For example, if only a single candidate relative pose remains, it can be considered to be final estimate of the relative pose. As another example, if multiple candidate relative poses remain, a difference between the candidate relative poses can be determined and used to conclude that the relative pose has been reasonably estimated. The remaining candidate relative poses can then be averaged or otherwise combined to estimate the relative pose.
  • FIG. 9 is a histogram that represents the accuracy of relative pose predictions made using process 700 (FIG. 7).
  • position along the y-axis indicates the number count of pose predictions.
  • Position along the x-axis indicates the error in the each of the pose predictions as a distance, in cm, between the predicted relative camera position and the ground-truth camera position. In this histogram, the angle of the camera is not taken into account.
  • Embodiments of the subject matter and the operations described in this specification can be implemented in digital electronic circuitry, or in computer software, firmware, or hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them.
  • Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions, encoded on computer storage medium for execution by, or to control the operation of, data processing apparatus.
  • the program instructions can be encoded on an artificially-generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode information for transmission to suitable receiver apparatus for execution by a data processing apparatus.
  • a computer storage medium can be, or be included in, a computer-readable storage device, a computer-readable storage substrate, a random or serial access memory array or device, or a combination of one or more of them.
  • a computer storage medium is not a propagated signal, a computer storage medium can be a source or destination of computer program instructions encoded in an artificially-generated propagated signal.
  • the computer storage medium can also be, or be included in, one or more separate physical components or media (e.g., multiple CDs, disks, or other storage devices).
  • the operations described in this specification can be implemented as operations performed by a data processing apparatus on data stored on one or more computer- readable storage devices or received from other sources.
  • the term “data processing apparatus” encompasses all kinds of apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, a system on a chip, or multiple ones, or combinations, of the foregoing
  • the apparatus can include special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).
  • the apparatus can also include, in addition to hardware, code that creates an execution environment for the computer program in question, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, a crossplatform runtime environment, a virtual machine, or a combination of one or more of them.
  • the apparatus and execution environment can realize various different computing model infrastructures, such as web services, distributed computing and grid computing infrastructures.
  • a computer program (also known as a program, software, software application, script, or code) can be written in any form of programming language, including compiled or interpreted languages, declarative or procedural languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, object, or other unit suitable for use in a computing environment.
  • a computer program may, but need not, correspond to a file in a file system.
  • a program can be stored in a portion of a file that holds other programs or data (e.g., one or more scripts stored in a markup language document), in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).
  • a computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a communication network.
  • the processes and logic flows described in this specification can be performed by one or more programmable processors executing one or more computer programs to perform actions by operating on input data and generating output.
  • the processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).
  • processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer.
  • a processor will receive instructions and data from a read-only memory or a random access memory or both.
  • the essential elements of a computer are a processor for performing actions in accordance with instructions and one or more memory devices for storing instructions and data.
  • a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks.
  • mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks.
  • a computer need not have such devices.
  • a computer can be embedded in another device, e.g., a mobile telephone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device (e.g., a universal serial bus (USB) flash drive), to name just a few.
  • Devices suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto-optical disks; and CD-ROM and DVD-ROM disks.
  • the processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.
  • a computer having a display device, e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor, for displaying information to the user and a keyboard and a pointing device, e.g., a mouse or a trackball, by which the user can provide input to the computer.
  • a display device e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor
  • a keyboard and a pointing device e.g., a mouse or a trackball
  • Other kinds of devices can be used to provide for interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input.
  • a computer can interact with a user by sending documents to and receiving documents from a device that is used by the user; for example, by sending web pages to

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

L'invention concerne des procédés, des systèmes et un appareil, y compris des programmes informatiques codés sur un support de stockage informatique, pour traiter des images qui impliquent l'annotation de points de repère sur des images bidimensionnelles. Dans un aspect, des procédés sont réalisés par un appareil de traitement de données pour entraîner un dispositif pour estimer la pose relative d'un dispositif d'imagerie et d'un objet dans une image bidimensionnelle. Les procédés comprennent l'identification d'un modèle 3D de l'objet, l'identification de points de repère sur le modèle 3D de l'objet, la projection du modèle 3D dans une collection d'images bidimensionnelles avec connaissance de l'emplacement des points de repère à partir du modèle 3D sur la projection, et l'entraînement d'un modèle d'apprentissage machine de détection de points de repère pour identifier les points de repère dans la collecte d'images bidimensionnelles. Le modèle d'apprentissage machine de détection de points de repère fait partie d'un dispositif d'estimation de la pose relative d'un dispositif d'imagerie.
PCT/EP2022/051810 2021-02-02 2022-01-26 Annotation d'images bidimensionnelles WO2022167298A1 (fr)

Priority Applications (3)

Application Number Priority Date Filing Date Title
KR1020237029993A KR20230138011A (ko) 2021-02-02 2022-01-26 2차원 이미지들의 주석
CN202280012892.XA CN116868240A (zh) 2021-02-02 2022-01-26 对二维图像的标注
EP22708359.9A EP4288941A1 (fr) 2021-02-02 2022-01-26 Annotation d'images bidimensionnelles

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
GR20210100068 2021-02-02
GR20210100068 2021-02-02
US17/179,596 2021-02-19
US17/179,596 US20220245860A1 (en) 2021-02-02 2021-02-19 Annotation of two-dimensional images

Publications (1)

Publication Number Publication Date
WO2022167298A1 true WO2022167298A1 (fr) 2022-08-11

Family

ID=80683242

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2022/051810 WO2022167298A1 (fr) 2021-02-02 2022-01-26 Annotation d'images bidimensionnelles

Country Status (3)

Country Link
EP (1) EP4288941A1 (fr)
KR (1) KR20230138011A (fr)
WO (1) WO2022167298A1 (fr)

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
XIANG YU ET AL: "Beyond PASCAL: A benchmark for 3D object detection in the wild", IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION, IEEE, 24 March 2014 (2014-03-24), pages 75 - 82, XP032609963, DOI: 10.1109/WACV.2014.6836101 *
ZHANG SHANXIN ET AL: "Vehicle global 6-DoF pose estimation under traffic surveillance camera", ISPRS JOURNAL OF PHOTOGRAMMETRY AND REMOTE SENSING, AMSTERDAM [U.A.] : ELSEVIER, AMSTERDAM, NL, vol. 159, 26 November 2019 (2019-11-26), pages 114 - 128, XP085961741, ISSN: 0924-2716, [retrieved on 20191126], DOI: 10.1016/J.ISPRSJPRS.2019.11.005 *

Also Published As

Publication number Publication date
KR20230138011A (ko) 2023-10-05
EP4288941A1 (fr) 2023-12-13

Similar Documents

Publication Publication Date Title
Menze et al. Object scene flow
Menze et al. Object scene flow for autonomous vehicles
Dame et al. Dense reconstruction using 3D object shape priors
Martin et al. Real time head model creation and head pose estimation on consumer depth cameras
US9523772B2 (en) Object removal using lidar-based classification
US8175412B2 (en) Method and apparatus for matching portions of input images
US20230085384A1 (en) Characterizing and improving of image processing
WO2018177159A1 (fr) Procédé et système de détermination de position d'objet mobile
CN111328396A (zh) 用于图像中的对象的姿态估计和模型检索
CN114022830A (zh) 一种目标确定方法以及目标确定装置
Bhuyan Computer vision and image processing: Fundamentals and applications
Choi et al. Multi-view reprojection architecture for orientation estimation
JP6052533B2 (ja) 特徴量抽出装置および特徴量抽出方法
Cui et al. Dense depth-map estimation based on fusion of event camera and sparse LiDAR
US20220245860A1 (en) Annotation of two-dimensional images
Lee et al. independent object detection based on two-dimensional contours and three-dimensional sizes
Swadzba et al. Tracking objects in 6D for reconstructing static scenes
EP4288941A1 (fr) Annotation d'images bidimensionnelles
Konno et al. Incremental multi-view object detection from a moving camera
Walczak et al. Locating occupants in preschool classrooms using a multiple RGB-D sensor system
CN116868240A (zh) 对二维图像的标注
Liu et al. A study of chained stochastic tracking in RGB and depth sensing
US20230206493A1 (en) Processing images of objects and object portions, including multi-object arrangements and deformed objects
WO2023036631A1 (fr) Caractérisation et amélioration du traitement d'image
Nair A voting algorithm for dynamic object identification and pose estimation

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22708359

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 202280012892.X

Country of ref document: CN

ENP Entry into the national phase

Ref document number: 20237029993

Country of ref document: KR

Kind code of ref document: A

WWE Wipo information: entry into national phase

Ref document number: 1020237029993

Country of ref document: KR

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2022708359

Country of ref document: EP

Effective date: 20230904