US20190026896A1 - Segmentation and representation network for pose discrimination - Google Patents

Segmentation and representation network for pose discrimination Download PDF

Info

Publication number
US20190026896A1
US20190026896A1 US15/652,593 US201715652593A US2019026896A1 US 20190026896 A1 US20190026896 A1 US 20190026896A1 US 201715652593 A US201715652593 A US 201715652593A US 2019026896 A1 US2019026896 A1 US 2019026896A1
Authority
US
United States
Prior art keywords
image
network
images
segmentation
descriptor
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
US15/652,593
Other versions
US10186038B1 (en
Inventor
Stefan Kluckner
Vivek Kumar Singh
Shanhui Sun
Oliver Lehmann
Kai Ma
Jiangping Wang
Terrence Chen
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Siemens Healthineers AG
Original Assignee
Siemens Healthcare GmbH
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Siemens Healthcare GmbH filed Critical Siemens Healthcare GmbH
Priority to US15/652,593 priority Critical patent/US10186038B1/en
Assigned to SIEMENS MEDICAL SOLUTIONS USA, INC. reassignment SIEMENS MEDICAL SOLUTIONS USA, INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: SUN, SHANHUI, MA, KAI, SINGH, VIVEK, WANG, JIANGPING, CHEN, TERRENCE, LEHMANN, OLIVER, KLUCKNER, STEFAN
Assigned to SIEMENS HEALTHCARE GMBH reassignment SIEMENS HEALTHCARE GMBH ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: SIEMENS MEDICAL SOLUTIONS USA, INC.
Application granted granted Critical
Publication of US10186038B1 publication Critical patent/US10186038B1/en
Publication of US20190026896A1 publication Critical patent/US20190026896A1/en
Assigned to Siemens Healthineers Ag reassignment Siemens Healthineers Ag ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: SIEMENS HEALTHCARE GMBH
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/11Region-based segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/75Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10028Range image; Depth image; 3D point clouds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30244Camera pose
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/42Document-oriented image-based pattern recognition based on the type of document
    • G06V30/422Technical drawings; Geographical maps

Definitions

  • Depth cameras are used in many applications, including but not limited to gaming, manufacturing and medical imaging.
  • Conventional depth cameras provide the ability to acquire a detailed representation of a scene in a compact and easy-to-implement manner.
  • image data which consists of a two-dimensional image (e.g., a two-dimensional RGB image, in which each pixel is assigned a Red, a Green and a Blue value), and a depth image, in which the value of each pixel corresponds to a depth or distance of the pixel from the depth camera.
  • This image data consisting of a two-dimensional image and a depth image, will be referred to herein as a two-dimensional depth image.
  • Registration may facilitate the association of portions of a two-dimensional depth image with features of a corresponding model of an imaged object, the tracking of an imaged object through multiple successively-acquired two-dimensional depth images, and many other use cases.
  • An object of interest may be embedded in a cluttered environment, such as an operating/examination room or a production floor, and two-dimensional depth images thereof may therefore include many background structures. These structures hinder the ability to identify the object of interest and perform accurate registration of the two-dimensional depth image with other image data (e.g., a computer-aided design (CAD) model) of the object.
  • CAD computer-aided design
  • FIG. 1 is a block diagram of a system to determine a camera pose
  • FIG. 2 is a block diagram illustrating a system to determine a camera pose according to some embodiments
  • FIG. 3 is a flow diagram of a process to create a combined image segmentation and representation network according to some embodiments
  • FIG. 4 is a diagram illustrating training of a segmentation network according to some embodiments.
  • FIG. 5 illustrates pairs of images and segmentation masks according to some embodiments
  • FIG. 6 is a diagram illustrating training of a descriptor generation network according to some embodiments.
  • FIG. 7 illustrates a segmentation network according to some embodiments
  • FIG. 8 illustrates a descriptor generation network according to some embodiments
  • FIG. 9 illustrates fine-tuning of a combined segmentation and representation network according to some embodiments.
  • FIG. 10 illustrates fine-tuning of a combined segmentation and representation network according to some embodiments
  • FIG. 11 is a block diagram illustrating generation of a database of poses and corresponding image descriptors according to some embodiments
  • FIG. 12 is a flow diagram of a process to register images according to some embodiments.
  • FIG. 13 illustrates a system according to some embodiments.
  • FIG. 1 illustrates system A for determining a camera pose associated with an input query image image q .
  • Image q is an image of an object
  • a camera pose is a position at a particular distance vector and orientation with respect to the object.
  • the camera pose associated with image q is therefore the position and orientation of a hypothetical camera which captured image q .
  • image q can be registered against other image data of the object, such as a CAD drawing.
  • System A uses database B, which was pre-populated with descriptors generated from each of many images of the object, where each image represents a different camera pose.
  • Each descriptor is stored in association with the camera pose represented by the image from which the descriptor was generated.
  • the different images may be rendered from a three-dimensional CAD model of the object as is known, and the descriptors may be generated by descriptor network C.
  • descriptor network C In operation, descriptor network C generates descriptor D q based on Image q .
  • descriptor D q In the present example, it is assumed that descriptor D q most closely resembles descriptor D 2 of database B. Since descriptor D 2 is associated with pose Pose 2 in database B Pose 2 is output. As described above, system A is susceptible to errors caused by background structures present within image q .
  • a compact representation (e.g., a descriptor) of an image is generated which is primarily influenced by foreground elements of the image.
  • a network includes a trained segmentation network to segment an image into a foreground region of interest, and a trained representation network to generate a representation based on the segmented image.
  • the representation enables an efficient identification of a camera pose using a database which associates such representations with corresponding camera poses.
  • FIG. 2 illustrates system 100 according to some embodiments.
  • the goal of system 100 is to determine a camera pose based on input image image q .
  • Segmentation network 110 receives image image q and generates mask q based thereon.
  • Mask q is intended to distinguish between the foreground and background of image image q .
  • Segmentation masks as described herein are not necessarily binary and may comprise a likelihood map where each pixel stores the probability of that pixel being a foreground or background pixel. Training of segmentation network 110 to achieve this function will be described in detail below.
  • Image 130 primarily includes one or more foreground objects of interest.
  • Image 130 is received by descriptor generation network 140 , which generates descriptor Dq based thereon. Training of descriptor generation network 140 according to some embodiments will be described below.
  • Descriptor D q is compared against the descriptors of database 150 to determine a match. Generation of database 150 according to some embodiments will be described below. In the illustrated example, matches are determined with descriptors D 3 and D 2 , with descriptor D 3 being a “closer” match. Accordingly, system 100 outputs Pose 3 and Pose 2 , the camera poses associated with the determined descriptors. As described above, a determined camera pose may be used to register other image data (presumably of an object depicted in image q ) with image q .
  • FIG. 3 is a flow diagram of process 300 according to some embodiments.
  • Process 300 and the other processes described herein may be performed using any suitable combination of hardware, software or manual means.
  • Software embodying these processes may be stored by any non-transitory tangible medium, including a fixed disk, a floppy disk, a CD, a DVD, a Flash drive, or a magnetic tape. Embodiments are not limited to the examples described below.
  • two networks are trained to perform image segmentation and representation.
  • one of the networks is trained separately for segmentation and the other network is trained separately for representation (i.e., generation of a representation of an image).
  • the trained networks are combined to learn segmentation and representation jointly using two different loss functions.
  • FIG. 4 is a block diagram illustrating system 400 to perform S 310 according to some embodiments.
  • System 400 includes database 410 storing pairs of two-dimensional depth images I 1 through In and associated masks M 1 through Mn.
  • Segmentation network 430 is configured to generate a foreground mask based on a received image.
  • images I 1 through In are each processed by segmentation network 430 in order to generate a respective mask SMn corresponding to each image.
  • Loss layer 440 determines the cumulative difference between each generated mask SMn and its corresponding “ground truth” mask Mn. Segmentation layer 430 is modified based on the cumulative difference as is known in the art, and the process repeats until the cumulative difference is below a threshold or some other criteria (e.g., number of iterations) is met.
  • Segmentation network 430 may implement a fully convolutional network architecture which performs a semantic segmentation on pixel level for the entire image domain.
  • the first part of network 430 may be similar to an AlexNet structure, but embodiments are not limited thereto.
  • the second part of network 430 may include a deconvolution step where individual responses are up-sampled to full image resolution.
  • a cross-entropy loss function, aggregated over the pixels, may be used to optimize the segmentation mask based on the input information.
  • segmentation network 430 may implement an encoder-decoder network, such as but not limited to SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation.
  • the training data of database 410 may be generated by rendering two-dimensional depth images from three-dimensional CAD data of an object which is intended to be imaged, such as a piece of equipment. Rendering quality and characteristics should correspond to the quality and characteristics (e.g. the noise model and reconstruction process) of the sensor type of the depth camera which is expected to be used.
  • the two-dimensional depth images of the training data may include random objects in the scene to simulate various realistic setups.
  • background structures are further away from the camera than the objects of interest in the foreground.
  • a foreground mask Mn is also generated for each training image In.
  • each training image is rendered from a same camera pose twice; without background structures and with background structures.
  • a binary segmentation mask may be generated on the pixel level from these two renderings.
  • FIG. 5 shows two rendered two-dimensional depth images including background data (left) and their corresponding binary segmentation masks (right) generated according to some embodiments.
  • the generated depth images may be rendered over an expected space of camera poses.
  • the virtual camera poses used for rendering are located in a realistic way to simulate the target scenario (e.g., how a user, a moving vehicle or a static camera may observe the scene).
  • virtual camera poses are derived from prior information, such as from a real test scenario in which an operator is asked to capture images as it would be performed during an inspection task. Similar poses can be generated based on this reconstructed prior and augmented by perturbations.
  • many potential virtual camera poses are determined. Based on defined requirements such as minimal and maximum distance, visibility of particular parts, etc., invalid poses may be identified and discarded from the final view prior to computation. Prior information may be introduced to reduce the search space based on the expected camera setup.
  • a segmented two-dimensional depth image is generated based on each segmentation mask and two-dimensional depth image pair.
  • FIG. 6 illustrates the above-described training data of database 410 .
  • each mask is applied to its paired image to result in corresponding segmented images SIn.
  • a descriptor generation network is trained based on the segmented two-dimensional depth images and on proximities between poses associated with the segmented two-dimensional depth images. Training initially consists of generating an m-dimensional descriptor for each input segmented image.
  • Sampling/loss layer 620 samples several triplets consisting of a descriptor of a first segmented image representing a first camera pose, a descriptor of a second segmented image known to represent a camera pose which is similar in location and/or orientation to the first camera pose, and a descriptor of a third segmented image known to represent a camera pose which is dissimilar in location and/or orientation to the first camera pose.
  • These relative proximities are known and may be determined from training data of database 410 . Similarities and dissimilarities in camera poses may be based on spatial distance in camera position, overlap scoring of image content, distance computations taking into account the six degrees of freedom of the camera pose space, etc.
  • Sampling/loss layer 620 evaluates each triplet to ensure that the descriptors of the first and second segmented images are “closer” to each other in the m-dimensional space than the descriptors of the first and third segmented images.
  • This evaluation may comprise evaluation of a loss function (e.g., ⁇ ⁇ x L(x, p(x), n(x)), and layer 620 back-propagates the determined loss to descriptor generation network 610 to minimize the loss. The foregoing iterates until the loss reaches an acceptable level, at which point descriptor generation network 610 is considered trained.
  • the loss function is represented as:
  • L triplet denotes the triplet loss function and L pairwise represents the pairwise loss function.
  • the third term is a regularization term to enforce a smooth solution.
  • a triplet is defined as (( ⁇ , ⁇ + , ⁇ ⁇ ), where ⁇ is one camera pose sampling point, ⁇ + is a camera pose close to pose ⁇ , and ⁇ ⁇ is not close to pose Tr.
  • FIG. 7 illustrates an implementation of a segmentation network according to some embodiments
  • FIG. 8 illustrates an implementation of a descriptor generation network according to some embodiments.
  • each network's task i.e., segmentation or representation
  • each network's task could be handled as a (1) classification problem where each pose defines a particular object class or (2) as a data reconstruction problem where the input is abstracted to unique signatures.
  • Approach (2) is described herein as it may exhibit better scaling capabilities with high numbers of potential virtual viewpoints.
  • a combined network is created including the trained segmentation network and the trained descriptor generation network.
  • the trained segmentation network and the trained descriptor generation network of the combined network are trained based on the two-dimensional depth images and on segmented two-dimensional depth images.
  • the combined training at S 350 benefits the learning of a robust representation (i.e., descriptor) which is suitably invariant to background structures. Difficulties in the representation problem may be addressed by the segmentation network and errors in the segmentation problem may be compensated by the representation network.
  • a robust representation i.e., descriptor
  • FIG. 9 illustrates combined network 900 according to some embodiments.
  • a known image q is passed through segmentation network 430 to generate a mask as described above and a segmented image is obtained by element-wise operation 910 on the query image and the mask.
  • Descriptor generation network 610 generates a descriptor x based on the segmented image and uses pre-segmented triplet image pair p(x) (“close” descriptor), n(x) (“far” descriptor) and pre-segmented “close” image s(x) and t(x) to generate a loss term which back-propagates to fine-tune network 610 to segmentation errors in the query image.
  • the segmentation loss is also determined and back-propagated to segmentation network 430 by layer 930 as described above to improve segmentation network 430 as well.
  • System 900 may be beneficial in a use case in which the representation is to be used to match against a database of segmented images as described above.
  • the segmented query image can also be passed along a pair1 stream with its ground truth segmented image as a pair2 stream. This arrangement enforces the regularization loss term to further emphasize the network to map the ground truth segmented image and noisy segmented image to a similar representation.
  • the combined network receives a segmented image through an additional input channel, as opposed to using segmented images generated by applying the segmentation mask on the input image.
  • System 1000 of FIG. 10 is one implementation of such a combined network, in which concatenation layer 1010 substitutes for element-wise operation 910 .
  • Network 610 is trained with two channel inputs (i.e., a two-dimensional depth image and a segmentation mask) instead of the only the two-dimensional depth image.
  • system 1000 does not require early decision-making on the segmentation mask, which typically removes the pixels predicted as background from further consideration and therefore is not tolerant to segmentation errors. Rather, system 1000 enables providing segmentation as a likelihood map and allows representation network 610 to capture relevant information jointly from the segmentation map and input image.
  • FIG. 11 illustrates use of the trained networks to create a camera pose database for use, for example, as described above with respect to FIG. 2 .
  • input data for system 1100 may be generated by rendering two-dimensional depth images from three-dimensional CAD data 1110 of an object which is intended to be imaged using rendering engine 1120 .
  • the two-dimensional depth images are segmented by network 1130 and a descriptor is generated therefrom by network 1140 .
  • the descriptor generated from an input depth image is stored in database 1150 in association with an indicator of the camera pose of the depth image.
  • FIG. 12 is a flow diagram of process 1200 to utilize a combined trained network as described herein.
  • Process 1200 will be described with respect to system 200 FIG. 2 , but embodiments are not limited thereto.
  • a two-dimensional depth image is acquired at S 1210 .
  • the two-dimensional depth image may be acquired using a depth camera and the image may represent an examination room, an operating room, an imaging or treatment room, a manufacturing space, or any other suitable scene.
  • the scene may include one or more pieces of equipment in view of which system 200 was trained. For example, if the scene comprises a CT-scanning room, the training data may have included two-dimensional depth images of a CT scanner.
  • System 200 generates descriptor D q at S 1220 based on the acquired image, segmentation network 110 and descriptor generation network 140 .
  • a corresponding descriptor of database 150 is identified.
  • database 150 may be traversed to find the closest neighbor in a nearest neighbor search method which generates closest matches.
  • a corresponding camera pose is determined for each identified descriptor at S 1240 .
  • a user or further algorithm may then select a camera pose from the identified matches.
  • S 1250 An image of an object which corresponds to the identified camera pose is determined at S 1250 .
  • S 1250 may comprise rendering an image of the object (i.e., an object located in the originally-acquired image) from the viewpoint of the camera pose based on a CAD model of the object. The image may then be registered against the original image at S 1260 , in order to generate a composite image at S 1270 and to display the composite image at S 1280 .
  • the image determined at S 1250 may comprise a two-dimensional map of part labels of an object of interest.
  • the map may be generated based on the determined camera pose and overlaid on the original image to assist a user in identifying parts of the object.
  • the part labels may be selectable and may encode metadata such as an index to a database entry associated with the part.
  • Some embodiments may be used to identify objects by registration in real time. Such identification may assist in understanding the location and movement of objects during routine processes. Robust identification of objects may assist in collision avoidance or navigation.
  • FIG. 13 illustrates system 1 according to some embodiments.
  • System 1 may be operated to acquire two-dimensional depth images, determine a camera pose and/or register images according to some embodiments. Embodiments are not limited to system 1 to perform either function.
  • System 1 includes x-ray imaging system 10 , scanner 20 a , control and processing system 30 , and operator terminal 50 . According to some embodiments, system 1 includes two or more scanners, and example locations and orientations thereof are illustrated as scanner 20 b and scanner 20 c.
  • X-ray imaging system 10 acquires two-dimensional X-ray images of a patient volume and scanner 20 a acquires two-dimensional depth images of a patient.
  • Control and processing system 30 controls X-ray imaging system 10 and scanner 20 a , and receives the acquired images therefrom. Control and processing system 30 processes the depth images to determine a camera pose and to register an image against the acquired image as described above. Such images may be presented to a user by terminal 50 .
  • Imaging system 10 comprises a CT scanner including X-ray source 11 for emitting X-ray beam 12 toward opposing radiation detector 13 .
  • X-ray source 11 and radiation detector 13 are mounted on gantry 14 such that they may be rotated about a center of rotation of gantry 14 while maintaining the same physical relationship therebetween.
  • Radiation source 11 may comprise any suitable radiation source, including but not limited to a GigalixTM x-ray tube. In some embodiments, radiation source 11 emits electron, photon or other type of radiation having energies ranging from 50 to 150 keV. Radiation detector 13 may comprise any system to acquire an image based on received x-ray radiation.
  • patient 15 is positioned on bed 16 to place a portion of patient 15 between X-ray source 11 and radiation detector 13 .
  • X-ray source 11 and radiation detector 13 are moved to various projection angles with respect to patient 15 by using rotation drive 17 to rotate gantry 14 around cavity 18 in which patient 15 is positioned.
  • X-ray source 11 is powered by high-voltage generator 19 to transmit X-ray radiation 12 toward detector 13 .
  • Detector 13 receives the radiation and produces a set of data (i.e., a raw X-ray image) for each projection angle.
  • Scanner 20 a may comprise a depth camera. Scanner 20 a may acquire depth images as described above.
  • a depth camera may comprise a structured light-based camera (e.g., Microsoft Kinect or ASUS Xtion), a stereo camera, or a time-of-flight camera (e.g., Creative TOF camera) according to some embodiments.
  • System 30 may comprise any general-purpose or dedicated computing system. Accordingly, system 30 includes one or more processors 31 configured to execute processor-executable program code to cause system 30 to operate as described herein, and storage device 40 for storing the program code.
  • Storage device 40 may comprise one or more fixed disks, solid-state random access memory, and/or removable media (e.g., a thumb drive) mounted in a corresponding interface (e.g., a USB port).
  • Storage device 40 stores program code of system control program 41 .
  • One or more processors 31 may execute system control program 41 to move gantry 14 , to move table 16 , to cause radiation source 11 to emit radiation, to control detector 13 to acquire an image, and to control scanner 20 to acquire an image.
  • system 30 includes gantry interface 32 , radiation source interface 33 and depth scanner interface 35 for communication with corresponding units of system 10 .
  • Two-dimensional X-ray data acquired from system 10 may be stored in data storage device 40 as CT images 43 , in DICOM or another data format.
  • Each image 43 may be further associated with details of its acquisition, including but not limited to time of acquisition, imaging plane position and angle, imaging position, radiation source-to-detector distance, patient anatomy imaged, patient position, contrast medium bolus injection profile, x-ray tube voltage, image resolution and radiation dosage.
  • CT images 43 may also include three-dimensional CT images reconstructed from corresponding two-dimensional CT images as is known in the art.
  • Device 40 also stores two-dimensional depth images 44 acquired by scanner 20 .
  • a two-dimensional depth image 44 may be associated with a set of CT images 42 , in that the associated image/frames were acquired at similar times while patient 15 was lying in substantially the same position.
  • One or more processors 31 may execute system control program 41 to determine a camera pose based on a received image as described above.
  • System control program 41 may therefore implement the trained segmentation and representation networks described above, and may utilize pose database 45 to identify camera poses based on generated descriptors.
  • Terminal 50 may comprise a display device and an input device coupled to system 30 .
  • Terminal 50 may display any of CT images 43 , two-dimensional depth images 44 , or images registered as described herein, and may receive user input for controlling display of the images, operation of imaging system 10 , and/or the processing described herein.
  • terminal 50 is a separate computing device such as, but not limited to, a desktop computer, a laptop computer, a tablet computer, and a smartphone.
  • Each of system 10 , scanner 20 , system 30 and terminal 40 may include other elements which are necessary for the operation thereof, as well as additional elements for providing functions other than those described herein.
  • system 30 controls the elements of system 10 .
  • System 30 also processes images received from system 10 .
  • system 30 receives input from terminal 50 and provides images to terminal 50 .
  • Embodiments are not limited to a single system performing each of these functions.
  • system 10 may be controlled by a dedicated control system, with the acquired frames and images being provided to a separate image processing system over a computer network or via a physical storage medium (e.g., a DVD).
  • Embodiments are not limited to a CT scanner and a depth scanner as described above with respect to FIG. 13 .
  • embodiments may employ any other imaging modalities (e.g., a magnetic resonance scanner, a positron-emission scanner, etc.) for acquiring surface data.
  • imaging modalities e.g., a magnetic resonance scanner, a positron-emission scanner, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Image Analysis (AREA)

Abstract

A system and method includes creation of a combined network comprising an image segmentation network and an image representation network, the combined network to generate an image descriptor based on an input query image, training of the combined network based on a plurality of first images and a segmentation mask associated with each of the plurality of first images, reception of a first input query image, use of the combined network to generate an image descriptor based on the first input query image, determination of a matching image descriptor from a plurality of stored image descriptors, determination of a camera pose associated with the matching image descriptor, registration of the first input query image with image data based on the determined camera pose, generation of a composite image based on the registered first input query image and image data, and presentation of the composite image.

Description

    BACKGROUND
  • Depth cameras are used in many applications, including but not limited to gaming, manufacturing and medical imaging. Conventional depth cameras provide the ability to acquire a detailed representation of a scene in a compact and easy-to-implement manner. From a single stationary position, a depth camera acquires image data which consists of a two-dimensional image (e.g., a two-dimensional RGB image, in which each pixel is assigned a Red, a Green and a Blue value), and a depth image, in which the value of each pixel corresponds to a depth or distance of the pixel from the depth camera. This image data, consisting of a two-dimensional image and a depth image, will be referred to herein as a two-dimensional depth image.
  • It is often desirable to register two-dimensional depth images with one another. Registration may facilitate the association of portions of a two-dimensional depth image with features of a corresponding model of an imaged object, the tracking of an imaged object through multiple successively-acquired two-dimensional depth images, and many other use cases.
  • An object of interest may be embedded in a cluttered environment, such as an operating/examination room or a production floor, and two-dimensional depth images thereof may therefore include many background structures. These structures hinder the ability to identify the object of interest and perform accurate registration of the two-dimensional depth image with other image data (e.g., a computer-aided design (CAD) model) of the object.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • FIG. 1 is a block diagram of a system to determine a camera pose;
  • FIG. 2 is a block diagram illustrating a system to determine a camera pose according to some embodiments;
  • FIG. 3 is a flow diagram of a process to create a combined image segmentation and representation network according to some embodiments;
  • FIG. 4 is a diagram illustrating training of a segmentation network according to some embodiments;
  • FIG. 5 illustrates pairs of images and segmentation masks according to some embodiments;
  • FIG. 6 is a diagram illustrating training of a descriptor generation network according to some embodiments;
  • FIG. 7 illustrates a segmentation network according to some embodiments;
  • FIG. 8 illustrates a descriptor generation network according to some embodiments;
  • FIG. 9 illustrates fine-tuning of a combined segmentation and representation network according to some embodiments;
  • FIG. 10 illustrates fine-tuning of a combined segmentation and representation network according to some embodiments;
  • FIG. 11 is a block diagram illustrating generation of a database of poses and corresponding image descriptors according to some embodiments;
  • FIG. 12 is a flow diagram of a process to register images according to some embodiments; and
  • FIG. 13 illustrates a system according to some embodiments.
  • DETAILED DESCRIPTION
  • The following description is provided to enable any person in the art to make and use the described embodiments and sets forth the best mode contemplated for carrying out the described embodiments. Various modifications, however, will remain apparent to those in the art.
  • FIG. 1 illustrates system A for determining a camera pose associated with an input query image imageq. Imageq is an image of an object, and a camera pose is a position at a particular distance vector and orientation with respect to the object. The camera pose associated with imageq is therefore the position and orientation of a hypothetical camera which captured imageq. By knowing the camera pose, imageq can be registered against other image data of the object, such as a CAD drawing.
  • System A uses database B, which was pre-populated with descriptors generated from each of many images of the object, where each image represents a different camera pose. Each descriptor is stored in association with the camera pose represented by the image from which the descriptor was generated. The different images may be rendered from a three-dimensional CAD model of the object as is known, and the descriptors may be generated by descriptor network C.
  • In operation, descriptor network C generates descriptor Dq based on Imageq. In the present example, it is assumed that descriptor Dq most closely resembles descriptor D2 of database B. Since descriptor D2 is associated with pose Pose2 in database B Pose2 is output. As described above, system A is susceptible to errors caused by background structures present within imageq.
  • According to some embodiments, a compact representation (e.g., a descriptor) of an image is generated which is primarily influenced by foreground elements of the image. A network according to some embodiments includes a trained segmentation network to segment an image into a foreground region of interest, and a trained representation network to generate a representation based on the segmented image. The representation enables an efficient identification of a camera pose using a database which associates such representations with corresponding camera poses.
  • FIG. 2 illustrates system 100 according to some embodiments. Again, the goal of system 100 is to determine a camera pose based on input image imageq. Segmentation network 110 receives image imageq and generates maskq based thereon. Maskq is intended to distinguish between the foreground and background of image imageq. Segmentation masks as described herein are not necessarily binary and may comprise a likelihood map where each pixel stores the probability of that pixel being a foreground or background pixel. Training of segmentation network 110 to achieve this function will be described in detail below.
  • Operator 120 applies maskq to imageq to generate masked image 130. According to some embodiments, image 130 primarily includes one or more foreground objects of interest. Image 130 is received by descriptor generation network 140, which generates descriptor Dq based thereon. Training of descriptor generation network 140 according to some embodiments will be described below.
  • Descriptor Dq is compared against the descriptors of database 150 to determine a match. Generation of database 150 according to some embodiments will be described below. In the illustrated example, matches are determined with descriptors D3 and D2, with descriptor D3 being a “closer” match. Accordingly, system 100 outputs Pose3 and Pose2, the camera poses associated with the determined descriptors. As described above, a determined camera pose may be used to register other image data (presumably of an object depicted in imageq) with imageq.
  • FIG. 3 is a flow diagram of process 300 according to some embodiments. Process 300 and the other processes described herein may be performed using any suitable combination of hardware, software or manual means. Software embodying these processes may be stored by any non-transitory tangible medium, including a fixed disk, a floppy disk, a CD, a DVD, a Flash drive, or a magnetic tape. Embodiments are not limited to the examples described below.
  • According to some embodiments, two networks are trained to perform image segmentation and representation. In order to avoid local minima and for weight initialization, one of the networks is trained separately for segmentation and the other network is trained separately for representation (i.e., generation of a representation of an image). The trained networks are combined to learn segmentation and representation jointly using two different loss functions.
  • Referring to process 300, a segmentation network is trained at S310 based on a plurality of segmentation mask and two-dimensional depth image pairs. FIG. 4 is a block diagram illustrating system 400 to perform S310 according to some embodiments. System 400 includes database 410 storing pairs of two-dimensional depth images I1 through In and associated masks M1 through Mn.
  • Segmentation network 430 is configured to generate a foreground mask based on a received image. To train segmentation network 430, images I1 through In are each processed by segmentation network 430 in order to generate a respective mask SMn corresponding to each image. Loss layer 440 determines the cumulative difference between each generated mask SMn and its corresponding “ground truth” mask Mn. Segmentation layer 430 is modified based on the cumulative difference as is known in the art, and the process repeats until the cumulative difference is below a threshold or some other criteria (e.g., number of iterations) is met.
  • Segmentation network 430 may implement a fully convolutional network architecture which performs a semantic segmentation on pixel level for the entire image domain. The first part of network 430 may be similar to an AlexNet structure, but embodiments are not limited thereto. The second part of network 430 may include a deconvolution step where individual responses are up-sampled to full image resolution. A cross-entropy loss function, aggregated over the pixels, may be used to optimize the segmentation mask based on the input information. Alternatively, segmentation network 430 may implement an encoder-decoder network, such as but not limited to SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation.
  • The training data of database 410 may be generated by rendering two-dimensional depth images from three-dimensional CAD data of an object which is intended to be imaged, such as a piece of equipment. Rendering quality and characteristics should correspond to the quality and characteristics (e.g. the noise model and reconstruction process) of the sensor type of the depth camera which is expected to be used.
  • The two-dimensional depth images of the training data may include random objects in the scene to simulate various realistic setups. Typically, background structures are further away from the camera than the objects of interest in the foreground. A foreground mask Mn is also generated for each training image In. According to some embodiments, each training image is rendered from a same camera pose twice; without background structures and with background structures. A binary segmentation mask may be generated on the pixel level from these two renderings. To increase robustness to background structures, several masks be generated for a same camera pose using images including different background arrangements. FIG. 5 shows two rendered two-dimensional depth images including background data (left) and their corresponding binary segmentation masks (right) generated according to some embodiments.
  • The generated depth images may be rendered over an expected space of camera poses. For example, the virtual camera poses used for rendering are located in a realistic way to simulate the target scenario (e.g., how a user, a moving vehicle or a static camera may observe the scene). According to some embodiments, virtual camera poses are derived from prior information, such as from a real test scenario in which an operator is asked to capture images as it would be performed during an inspection task. Similar poses can be generated based on this reconstructed prior and augmented by perturbations.
  • In some embodiments, many potential virtual camera poses are determined. Based on defined requirements such as minimal and maximum distance, visibility of particular parts, etc., invalid poses may be identified and discarded from the final view prior to computation. Prior information may be introduced to reduce the search space based on the expected camera setup.
  • At S320, a segmented two-dimensional depth image is generated based on each segmentation mask and two-dimensional depth image pair. FIG. 6 illustrates the above-described training data of database 410. In some embodiments of S320, each mask is applied to its paired image to result in corresponding segmented images SIn.
  • Next, at S330, a descriptor generation network is trained based on the segmented two-dimensional depth images and on proximities between poses associated with the segmented two-dimensional depth images. Training initially consists of generating an m-dimensional descriptor for each input segmented image.
  • Sampling/loss layer 620 samples several triplets consisting of a descriptor of a first segmented image representing a first camera pose, a descriptor of a second segmented image known to represent a camera pose which is similar in location and/or orientation to the first camera pose, and a descriptor of a third segmented image known to represent a camera pose which is dissimilar in location and/or orientation to the first camera pose. These relative proximities are known and may be determined from training data of database 410. Similarities and dissimilarities in camera poses may be based on spatial distance in camera position, overlap scoring of image content, distance computations taking into account the six degrees of freedom of the camera pose space, etc.
  • Sampling/loss layer 620 evaluates each triplet to ensure that the descriptors of the first and second segmented images are “closer” to each other in the m-dimensional space than the descriptors of the first and third segmented images. This evaluation may comprise evaluation of a loss function (e.g., Σ∀xL(x, p(x), n(x)), and layer 620 back-propagates the determined loss to descriptor generation network 610 to minimize the loss. The foregoing iterates until the loss reaches an acceptable level, at which point descriptor generation network 610 is considered trained. According to some embodiments, the loss function is represented as:

  • L=L triplet +L pairwise +∥|w∥ 2 2
  • where Ltriplet denotes the triplet loss function and Lpairwise represents the pairwise loss function. The third term is a regularization term to enforce a smooth solution. A triplet is defined as ((π, π+, π), where π is one camera pose sampling point, π+ is a camera pose close to pose π, and π is not close to pose Tr.
  • FIG. 7 illustrates an implementation of a segmentation network according to some embodiments, and FIG. 8 illustrates an implementation of a descriptor generation network according to some embodiments. After each convolution layer and full connected layer, a ReLu layer is also applied. The output layer is a group of nodes which are fully connected to the previous layer.
  • In some embodiments, each network's task (i.e., segmentation or representation) could be handled as a (1) classification problem where each pose defines a particular object class or (2) as a data reconstruction problem where the input is abstracted to unique signatures. Approach (2) is described herein as it may exhibit better scaling capabilities with high numbers of potential virtual viewpoints.
  • At S340, a combined network is created including the trained segmentation network and the trained descriptor generation network. Next, at S350, the trained segmentation network and the trained descriptor generation network of the combined network are trained based on the two-dimensional depth images and on segmented two-dimensional depth images.
  • According to some embodiments, the combined training at S350 benefits the learning of a robust representation (i.e., descriptor) which is suitably invariant to background structures. Difficulties in the representation problem may be addressed by the segmentation network and errors in the segmentation problem may be compensated by the representation network.
  • FIG. 9 illustrates combined network 900 according to some embodiments. As shown, a known imageq is passed through segmentation network 430 to generate a mask as described above and a segmented image is obtained by element-wise operation 910 on the query image and the mask. Descriptor generation network 610 generates a descriptor x based on the segmented image and uses pre-segmented triplet image pair p(x) (“close” descriptor), n(x) (“far” descriptor) and pre-segmented “close” image s(x) and t(x) to generate a loss term which back-propagates to fine-tune network 610 to segmentation errors in the query image. The segmentation loss is also determined and back-propagated to segmentation network 430 by layer 930 as described above to improve segmentation network 430 as well.
  • System 900 may be beneficial in a use case in which the representation is to be used to match against a database of segmented images as described above. In some embodiments, the segmented query image can also be passed along a pair1 stream with its ground truth segmented image as a pair2 stream. This arrangement enforces the regularization loss term to further emphasize the network to map the ground truth segmented image and noisy segmented image to a similar representation.
  • According to some embodiments, the combined network receives a segmented image through an additional input channel, as opposed to using segmented images generated by applying the segmentation mask on the input image. System 1000 of FIG. 10 is one implementation of such a combined network, in which concatenation layer 1010 substitutes for element-wise operation 910. Network 610 is trained with two channel inputs (i.e., a two-dimensional depth image and a segmentation mask) instead of the only the two-dimensional depth image.
  • In some embodiments, system 1000 does not require early decision-making on the segmentation mask, which typically removes the pixels predicted as background from further consideration and therefore is not tolerant to segmentation errors. Rather, system 1000 enables providing segmentation as a likelihood map and allows representation network 610 to capture relevant information jointly from the segmentation map and input image.
  • FIG. 11 illustrates use of the trained networks to create a camera pose database for use, for example, as described above with respect to FIG. 2. As described above, input data for system 1100 may be generated by rendering two-dimensional depth images from three-dimensional CAD data 1110 of an object which is intended to be imaged using rendering engine 1120. The two-dimensional depth images are segmented by network 1130 and a descriptor is generated therefrom by network 1140. The descriptor generated from an input depth image is stored in database 1150 in association with an indicator of the camera pose of the depth image. Database 1150 may be indexed and subsequently searched by an indexing algorithm based on K-nearest neighbor search or based on an approximated nearest neighbor method. For example, K=5 ranked images may be utilized for reducing a false negative rate. Several candidate camera poses may be presented to a user, who then provides a final selection.
  • FIG. 12 is a flow diagram of process 1200 to utilize a combined trained network as described herein. Process 1200 will be described with respect to system 200 FIG. 2, but embodiments are not limited thereto. Initially, a two-dimensional depth image is acquired at S1210. The two-dimensional depth image may be acquired using a depth camera and the image may represent an examination room, an operating room, an imaging or treatment room, a manufacturing space, or any other suitable scene. The scene may include one or more pieces of equipment in view of which system 200 was trained. For example, if the scene comprises a CT-scanning room, the training data may have included two-dimensional depth images of a CT scanner.
  • System 200 generates descriptor Dq at S1220 based on the acquired image, segmentation network 110 and descriptor generation network 140. Next, at S1230, a corresponding descriptor of database 150 is identified. According to some embodiments, database 150 may be traversed to find the closest neighbor in a nearest neighbor search method which generates closest matches. A corresponding camera pose is determined for each identified descriptor at S1240. A user or further algorithm may then select a camera pose from the identified matches.
  • An image of an object which corresponds to the identified camera pose is determined at S1250. S1250 may comprise rendering an image of the object (i.e., an object located in the originally-acquired image) from the viewpoint of the camera pose based on a CAD model of the object. The image may then be registered against the original image at S1260, in order to generate a composite image at S1270 and to display the composite image at S1280.
  • According to some embodiments, the image determined at S1250 may comprise a two-dimensional map of part labels of an object of interest. For example, the map may be generated based on the determined camera pose and overlaid on the original image to assist a user in identifying parts of the object. The part labels may be selectable and may encode metadata such as an index to a database entry associated with the part.
  • Some embodiments may be used to identify objects by registration in real time. Such identification may assist in understanding the location and movement of objects during routine processes. Robust identification of objects may assist in collision avoidance or navigation.
  • FIG. 13 illustrates system 1 according to some embodiments. System 1 may be operated to acquire two-dimensional depth images, determine a camera pose and/or register images according to some embodiments. Embodiments are not limited to system 1 to perform either function.
  • System 1 includes x-ray imaging system 10, scanner 20 a, control and processing system 30, and operator terminal 50. According to some embodiments, system 1 includes two or more scanners, and example locations and orientations thereof are illustrated as scanner 20 b and scanner 20 c.
  • Generally, and according to some embodiments, X-ray imaging system 10 acquires two-dimensional X-ray images of a patient volume and scanner 20 a acquires two-dimensional depth images of a patient. Control and processing system 30 controls X-ray imaging system 10 and scanner 20 a, and receives the acquired images therefrom. Control and processing system 30 processes the depth images to determine a camera pose and to register an image against the acquired image as described above. Such images may be presented to a user by terminal 50.
  • Imaging system 10 comprises a CT scanner including X-ray source 11 for emitting X-ray beam 12 toward opposing radiation detector 13. Embodiments are not limited to CT data or to CT scanners. X-ray source 11 and radiation detector 13 are mounted on gantry 14 such that they may be rotated about a center of rotation of gantry 14 while maintaining the same physical relationship therebetween.
  • Radiation source 11 may comprise any suitable radiation source, including but not limited to a Gigalix™ x-ray tube. In some embodiments, radiation source 11 emits electron, photon or other type of radiation having energies ranging from 50 to 150 keV. Radiation detector 13 may comprise any system to acquire an image based on received x-ray radiation.
  • To generate X-ray images, patient 15 is positioned on bed 16 to place a portion of patient 15 between X-ray source 11 and radiation detector 13. Next, X-ray source 11 and radiation detector 13 are moved to various projection angles with respect to patient 15 by using rotation drive 17 to rotate gantry 14 around cavity 18 in which patient 15 is positioned. At each projection angle, X-ray source 11 is powered by high-voltage generator 19 to transmit X-ray radiation 12 toward detector 13. Detector 13 receives the radiation and produces a set of data (i.e., a raw X-ray image) for each projection angle.
  • Scanner 20 a may comprise a depth camera. Scanner 20 a may acquire depth images as described above. A depth camera may comprise a structured light-based camera (e.g., Microsoft Kinect or ASUS Xtion), a stereo camera, or a time-of-flight camera (e.g., Creative TOF camera) according to some embodiments.
  • System 30 may comprise any general-purpose or dedicated computing system. Accordingly, system 30 includes one or more processors 31 configured to execute processor-executable program code to cause system 30 to operate as described herein, and storage device 40 for storing the program code. Storage device 40 may comprise one or more fixed disks, solid-state random access memory, and/or removable media (e.g., a thumb drive) mounted in a corresponding interface (e.g., a USB port).
  • Storage device 40 stores program code of system control program 41. One or more processors 31 may execute system control program 41 to move gantry 14, to move table 16, to cause radiation source 11 to emit radiation, to control detector 13 to acquire an image, and to control scanner 20 to acquire an image. In this regard, system 30 includes gantry interface 32, radiation source interface 33 and depth scanner interface 35 for communication with corresponding units of system 10.
  • Two-dimensional X-ray data acquired from system 10 may be stored in data storage device 40 as CT images 43, in DICOM or another data format. Each image 43 may be further associated with details of its acquisition, including but not limited to time of acquisition, imaging plane position and angle, imaging position, radiation source-to-detector distance, patient anatomy imaged, patient position, contrast medium bolus injection profile, x-ray tube voltage, image resolution and radiation dosage. CT images 43 may also include three-dimensional CT images reconstructed from corresponding two-dimensional CT images as is known in the art.
  • Device 40 also stores two-dimensional depth images 44 acquired by scanner 20. In some embodiments, a two-dimensional depth image 44 may be associated with a set of CT images 42, in that the associated image/frames were acquired at similar times while patient 15 was lying in substantially the same position.
  • One or more processors 31 may execute system control program 41 to determine a camera pose based on a received image as described above. System control program 41 may therefore implement the trained segmentation and representation networks described above, and may utilize pose database 45 to identify camera poses based on generated descriptors.
  • Terminal 50 may comprise a display device and an input device coupled to system 30. Terminal 50 may display any of CT images 43, two-dimensional depth images 44, or images registered as described herein, and may receive user input for controlling display of the images, operation of imaging system 10, and/or the processing described herein. In some embodiments, terminal 50 is a separate computing device such as, but not limited to, a desktop computer, a laptop computer, a tablet computer, and a smartphone.
  • Each of system 10, scanner 20, system 30 and terminal 40 may include other elements which are necessary for the operation thereof, as well as additional elements for providing functions other than those described herein.
  • According to the illustrated embodiment, system 30 controls the elements of system 10. System 30 also processes images received from system 10. Moreover, system 30 receives input from terminal 50 and provides images to terminal 50. Embodiments are not limited to a single system performing each of these functions. For example, system 10 may be controlled by a dedicated control system, with the acquired frames and images being provided to a separate image processing system over a computer network or via a physical storage medium (e.g., a DVD).
  • Embodiments are not limited to a CT scanner and a depth scanner as described above with respect to FIG. 13. For example, embodiments may employ any other imaging modalities (e.g., a magnetic resonance scanner, a positron-emission scanner, etc.) for acquiring surface data.
  • Those in the art will appreciate that various adaptations and modifications of the above-described embodiments can be configured without departing from the scope and spirit of the claims. Therefore, it is to be understood that the claims may be practiced other than as specifically described herein.

Claims (17)

1. A system comprising:
a data store storing a plurality of first images and a segmentation mask associated with each of the plurality of first images; and
a processor to:
create a combined network comprising an image segmentation network and an image representation network, the combined network to generate an image descriptor based on an input query image;
train the combined network based on a plurality of first images and a segmentation mask associated with each of the plurality of first images;
receive a first input query image;
use the combined network to generate an image descriptor based on the first input query image;
determine a matching image descriptor from a plurality of stored image descriptors;
determine a camera pose associated with the matching image descriptor;
register the first input query image with image data based on the determined camera pose; and
generate a composite image based on the registered first input query image and image data; and
a display to present the composite image
wherein training the combined network comprises:
for each of the plurality of first images:
generation of a generated segmentation mask using the image segmentation network; evaluating a loss function comprising a comparison between the segmentation mask associated with the first image and the generated segmentation mask; and modifying the image segmentation network based on the evaluated loss function; and
generation of an image descriptor based on each generated segmentation mask, evaluating a second loss function based on the image descriptor and on a plurality of descriptors having a known relationship to the first image; and modifying the image representation network based on the evaluated second loss function.
2. A system according to claim 1, wherein the first images and the first input query image are two-dimensional depth images.
3. A system according to claim 1, the processor further to:
prior to training the combined network, train the image segmentation network based on the plurality of first images and the segmentation mask associated with each of the plurality of first images, the image segmentation network to identify a foreground of a segmentation input image; and
prior to training the combined network, train the image representation network based on second images and on differences between camera poses associated with the second images, the image representation network to generate an image descriptor based on a representation input image.
4. (canceled)
5. A system according to claim 1, wherein generation of an image descriptor based on each generated segmentation mask comprises:
generation of a segmented image based on the generated segmentation mask and the first image; and
generation of an image descriptor based on the segmented image.
6. A system according to claim 1, wherein generation of an image descriptor based on each generated segmentation mask comprises:
concatenation of the generated segmentation mask and the first image; and
generation of the image descriptor based on the concatenated generated segmentation mask and first image.
7. A method, comprising:
creating a combined network comprising an image segmentation network to segment an image and an image representation network to generate an image descriptor based on an image;
training the combined network based on a plurality of first images, a segmentation mask associated with each of the plurality of first images, and a plurality of image descriptors of the first images;
receiving a first input query image; and
using the combined network to generate an image descriptor based on the first input query image
wherein training the combined network comprises:
for each of the plurality of first images:
generating a generated segmentation mask using the image segmentation network; evaluating a loss function comprising a comparison between the segmentation mask associated with the first image and the generated segmentation mask; and modifying the image segmentation network based on the evaluated loss function; and
generating an image descriptor based on each generated segmentation mask, evaluating a second loss function based on the image descriptor and on a plurality of descriptors having a known relationship to the first image; and modifying the image representation network based on the evaluated second loss function.
8. A method according to claim 7, further comprising;
determining a matching image descriptor from a plurality of stored image representations;
determining a camera pose associated with the matching image descriptor;
registering the first input query image with image data based on the determined camera pose;
generating a composite image based on the registered first input query image and image data; and
presenting the composite image.
9. A method according to claim 7, wherein the first images and the first input query image are two-dimensional depth images.
10. A method according to claim 7, further comprising:
prior to training the combined network, training the image segmentation network based on the plurality of first images and the segmentation mask associated with each of the plurality of first images, the image segmentation network to identify a foreground of a segmentation input image; and
prior to training the combined network, training the image representation network based on second images and on differences between camera poses associated with the second images, the image representation network to generate an image descriptor based on a representation input image.
11. (canceled)
12. A method according to claim 7, wherein generating an image descriptor based on each generated segmentation mask comprises:
generating a segmented image based on the generated segmentation mask and the first image; and
generating an image descriptor based on the segmented image.
13. A non-transitory computer-readable medium storing processor-executable process steps, the process steps executable by a processor to cause a system to:
train an image segmentation network based on a plurality of first images and a segmentation mask associated with each of the plurality of first images, the image segmentation network to identify a foreground of a segmentation input image;
train an image representation network based on second images and on differences between camera poses associated with the second images, the image representation network to generate an image descriptor based on a representation input image;
create a combined network comprising the image segmentation network and the image representation network;
train the image segmentation network and the image representation network of the combined network substantially simultaneously based on the plurality of first images, the segmentation mask associated with each of the plurality of first images, and a plurality of image descriptors of the first images;
receive a first input query image; and
use the combined network to generate an image descriptor based on the first input query image
wherein training the combined network comprises:
for each of the plurality of first images:
generating a generated segmentation mask using the image segmentation network; evaluating a loss function comprising a comparison between the segmentation mask associated with the first image and the generated segmentation mask; and modifying the image segmentation network based on the evaluated loss function; and
generating an image descriptor based on each generated segmentation mask, evaluating a second loss function based on the image descriptor and on a plurality of descriptors having a known relationship to the first image; and modifying the image representation network based on the evaluated second loss function.
14. A medium according to claim 13, the process steps executable by a processor to cause a system to:
determine a matching image descriptor from a plurality of stored image representations;
determine a camera pose associated with the matching image descriptor;
register the first input query image with image data based on the determined camera pose;
generate a composite image based on the registered first input query image and image data; and
present the composite image.
15. A medium according to claim 13, wherein the first images and the first input query image are two-dimensional depth images.
16. (canceled)
17. A medium according to claim 13, wherein generating an image descriptor based on each generated segmentation mask comprises:
generating a segmented image based on the generated segmentation mask and the first image; and
generating an image descriptor based on the segmented image.
US15/652,593 2017-07-18 2017-07-18 Segmentation and representation network for pose discrimination Active US10186038B1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US15/652,593 US10186038B1 (en) 2017-07-18 2017-07-18 Segmentation and representation network for pose discrimination

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US15/652,593 US10186038B1 (en) 2017-07-18 2017-07-18 Segmentation and representation network for pose discrimination

Publications (2)

Publication Number Publication Date
US10186038B1 US10186038B1 (en) 2019-01-22
US20190026896A1 true US20190026896A1 (en) 2019-01-24

Family

ID=65011521

Family Applications (1)

Application Number Title Priority Date Filing Date
US15/652,593 Active US10186038B1 (en) 2017-07-18 2017-07-18 Segmentation and representation network for pose discrimination

Country Status (1)

Country Link
US (1) US10186038B1 (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110400360A (en) * 2019-07-25 2019-11-01 北京航空航天大学 A kind of sound wave transition time detection method based on full convolutional neural networks
CN110689527A (en) * 2019-09-18 2020-01-14 北京航空航天大学 Method, device and equipment for detecting installation state of aircraft cable bracket
CN112525145A (en) * 2020-11-30 2021-03-19 北京航空航天大学 Aircraft landing relative attitude dynamic vision measurement method and system
US11116145B2 (en) 2018-03-30 2021-09-14 Greensight Argonomics, Inc. Automated optimization of agricultural treatments based on raster image data system
US20210319539A1 (en) * 2020-04-13 2021-10-14 GE Precision Healthcare LLC Systems and methods for background aware reconstruction using deep learning
US11205073B2 (en) * 2018-03-30 2021-12-21 Greensight Agronomics, Inc. System to automatically detect and report changes over time in a large imaging data set
US11235874B2 (en) 2018-03-30 2022-02-01 Greensight Agronomics, Inc. Automated drone-based spraying system

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10242449B2 (en) * 2017-01-04 2019-03-26 Cisco Technology, Inc. Automated generation of pre-labeled training data
CN109493417B (en) * 2018-10-31 2023-04-07 深圳大学 Three-dimensional object reconstruction method, device, equipment and storage medium
US11164319B2 (en) 2018-12-20 2021-11-02 Smith & Nephew, Inc. Machine learning feature vector generator using depth image foreground attributes
CN110222636B (en) * 2019-05-31 2023-04-07 中国民航大学 Pedestrian attribute identification method based on background suppression
US11557038B2 (en) * 2019-10-17 2023-01-17 Electronics And Telecommunications Research Institute Apparatus and method for x-ray data generation
US11100372B2 (en) 2019-11-08 2021-08-24 Ford Global Technologies, Llc Training deep neural networks with synthetic images
GB202002266D0 (en) * 2020-02-19 2020-04-01 Nw Pro Ltd Object identification system and method
CN112270686B (en) * 2020-12-24 2021-03-16 北京达佳互联信息技术有限公司 Image segmentation model training method, image segmentation device and electronic equipment
CN114979470A (en) * 2022-05-12 2022-08-30 咪咕文化科技有限公司 Camera rotation angle analysis method, device, equipment and storage medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9524582B2 (en) 2014-01-28 2016-12-20 Siemens Healthcare Gmbh Method and system for constructing personalized avatars using a parameterized deformable mesh
US10192361B2 (en) * 2015-07-06 2019-01-29 Seiko Epson Corporation Head-mounted display device and computer program
WO2017015390A1 (en) * 2015-07-20 2017-01-26 University Of Maryland, College Park Deep multi-task learning framework for face detection, landmark localization, pose estimation, and gender recognition
US10019655B2 (en) * 2016-08-31 2018-07-10 Adobe Systems Incorporated Deep-learning network architecture for object detection
US10109055B2 (en) * 2016-11-21 2018-10-23 Seiko Epson Corporation Multiple hypotheses segmentation-guided 3D object detection and pose estimation

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11116145B2 (en) 2018-03-30 2021-09-14 Greensight Argonomics, Inc. Automated optimization of agricultural treatments based on raster image data system
US11205073B2 (en) * 2018-03-30 2021-12-21 Greensight Agronomics, Inc. System to automatically detect and report changes over time in a large imaging data set
US11235874B2 (en) 2018-03-30 2022-02-01 Greensight Agronomics, Inc. Automated drone-based spraying system
CN110400360A (en) * 2019-07-25 2019-11-01 北京航空航天大学 A kind of sound wave transition time detection method based on full convolutional neural networks
CN110689527A (en) * 2019-09-18 2020-01-14 北京航空航天大学 Method, device and equipment for detecting installation state of aircraft cable bracket
US20210319539A1 (en) * 2020-04-13 2021-10-14 GE Precision Healthcare LLC Systems and methods for background aware reconstruction using deep learning
CN112525145A (en) * 2020-11-30 2021-03-19 北京航空航天大学 Aircraft landing relative attitude dynamic vision measurement method and system

Also Published As

Publication number Publication date
US10186038B1 (en) 2019-01-22

Similar Documents

Publication Publication Date Title
US10186038B1 (en) Segmentation and representation network for pose discrimination
US10769411B2 (en) Pose estimation and model retrieval for objects in images
Bergmann et al. The mvtec 3d-ad dataset for unsupervised 3d anomaly detection and localization
Žbontar et al. Stereo matching by training a convolutional neural network to compare image patches
US10373380B2 (en) 3-dimensional scene analysis for augmented reality operations
Candemir et al. Lung segmentation in chest radiographs using anatomical atlases with nonrigid registration
US10489673B2 (en) System and method for detecting landmarks in a three-dimensional image volume
US20180174311A1 (en) Method and system for simultaneous scene parsing and model fusion for endoscopic and laparoscopic navigation
US8811699B2 (en) Detection of landmarks and key-frames in cardiac perfusion MRI using a joint spatial-temporal context model
US11657497B2 (en) Method and apparatus for registration of different mammography image views
Biresaw et al. Tracker-level fusion for robust Bayesian visual tracking
JP2017174039A (en) Image classification device, method, and program
US10740901B2 (en) Encoder regularization of a segmentation model
US10706534B2 (en) Method and apparatus for classifying a data point in imaging data
CN113826143A (en) Feature point detection
JP2008080132A (en) System and method for detecting object in high-dimensional image space
TW201839721A (en) Computer-implemented 3d model analysis method, electronic device, and non-transitory computer readable storage medium
US20200057778A1 (en) Depth image pose search with a bootstrapped-created database
Miao et al. Convolutional neural networks for robust and real-time 2-D/3-D registration
CN113902932A (en) Feature extraction method, visual positioning method and device, medium and electronic equipment
Mane et al. Computational intelligence based model detection of disease using chest radiographs
Jin et al. DOPE++: 6D pose estimation algorithm for weakly textured objects based on deep neural networks
US20220198707A1 (en) Method and apparatus with object pose estimation
Conze et al. Unsupervised learning-based long-term superpixel tracking
US10425629B2 (en) System for dense registration of two-dimensional depth images

Legal Events

Date Code Title Description
STCF Information on status: patent grant

Free format text: PATENTED CASE

MAFP Maintenance fee payment

Free format text: PAYMENT OF MAINTENANCE FEE, 4TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1551); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

Year of fee payment: 4

AS Assignment

Owner name: SIEMENS HEALTHINEERS AG, GERMANY

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:SIEMENS HEALTHCARE GMBH;REEL/FRAME:066267/0346

Effective date: 20231219