CN111191492A

CN111191492A - Information estimation, model retrieval and model alignment methods and apparatus

Info

Publication number: CN111191492A
Application number: CN201811359461.2A
Authority: CN
Inventors: 考月英; 李炜明; 刘洋; 汪昊; 王强; 朴升仁; 李炯旭
Original assignee: Beijing Samsung Telecommunications Technology Research Co Ltd; Samsung Electronics Co Ltd
Current assignee: Beijing Samsung Telecom R&D Center; Beijing Samsung Telecommunications Technology Research Co Ltd; Samsung Electronics Co Ltd
Priority date: 2018-11-15
Filing date: 2018-11-15
Publication date: 2020-05-22
Also published as: KR20200056905A; KR102608473B1

Abstract

The present disclosure provides an information estimation method, comprising acquiring a first image; and performing pose and/or keypoint estimation on the object in the first image using a neural network. The present disclosure also provides a model retrieval method, including obtaining an image, and performing model retrieval according to the image to obtain a target model matched with an object in the image. The present disclosure also provides a model alignment method, comprising acquiring a first image; and associating the object in the first image with a target model.

Description

Information estimation, model retrieval and model alignment methods and apparatus

Technical Field

The present disclosure relates to the field of augmented reality technologies, and in particular, to methods and apparatuses for information estimation, model retrieval, and model alignment.

Background

In recent years, Augmented Reality (AR) technology has received increasing attention. The augmented reality technology is a technology for integrating real world information and virtual world information, the AR not only displays the real world information, but also displays the virtual information at the same time, and the two kinds of information are mutually supplemented and superposed, so that brand new experience is brought to a user.

One basic function of augmented reality technology is three-dimensional interaction, in which augmented information (virtual objects or characters, etc.) is displayed superimposed on a three-dimensional object in the real world. Advanced augmented reality functions require three-dimensional information of objects. However, the existing object detector or SLAM (simultaneous localization and mapping) system cannot provide such information.

Disclosure of Invention

In view of the above, embodiments of the present disclosure provide new information estimation apparatus and method, new model retrieval apparatus and method, new model alignment apparatus and method, and apparatus and method of any combination thereof.

According to an aspect of the present disclosure, there is provided an information estimation method including acquiring a first image; and performing pose and/or keypoint estimation on the object in the first image using a neural network.

The neural network comprises a posture and/or key point estimation module and a field self-adaptive module, and the network parameters of the neural network are obtained by training the posture and/or key point estimation module and the field self-adaptive module by adopting a synthetic image and a real image.

In an embodiment, the information estimation method may further include retrieving, according to the first image and the at least one second image, a target model, where a first posture of the object in the first image and a second posture of the object in the at least one second image are different from each other.

In an embodiment, the information estimation method may further include: acquiring image characteristics according to the first image and the at least one second image, acquiring model characteristics of each model according to the image of each model in the first posture and the image of each model in the second posture, calculating the similarity between the image characteristics and the model characteristics of each model, and determining a target model in each model according to the similarity.

In an embodiment, the information estimation method may further include any one of: aligning a target model with the object by using the attitude information of the object, and calibrating the alignment by using the key point information of the object and the key point information of the target model; and aligning a target model with the object by using the key point information of the object, and calibrating the alignment by using the attitude information of the object and the target model.

The information estimation method may further include performing any of the following functions using at least one of pose information and a target model of the object, keypoint information and a target model of the object, and a target model aligned with the object: manipulating the object; predicting a behavior of the object; drawing predefined content on the surface of the object; the pose of the virtual object is updated and/or controlled.

According to another aspect of the present disclosure, a model retrieval method is provided, which includes acquiring an image, and performing model retrieval based on the image to obtain a target model matching an object in the image.

The images include a first image and at least one second image, a first pose of the object in the first image and a second pose of the object in the at least one second image being different.

In an embodiment, the model retrieval method may further include obtaining image features from the first image and the at least one second image, obtaining model features of each model from the image of each model in the first posture and the image of each model in the second posture, calculating similarity between the image features and the model features of each model, and determining a target model in each model according to the similarity.

In an embodiment, the model retrieval method may further include any one of: aligning a target model with the object by using the attitude information of the object, and calibrating the alignment by using the key point information of the object and the key point information of the target model; and aligning a target model with the object by using the key point information of the object, and calibrating the alignment by using the attitude information of the object and the target model.

In an embodiment, the model retrieval method may further include performing pose and/or keypoint estimation on the object by using a neural network, where the neural network includes a pose and/or keypoint estimation module and a domain adaptation module, and network parameters of the neural network are obtained by training the pose and/or keypoint estimation module and the domain adaptation module by using a synthetic image and a real image.

The model retrieval method may further perform any of the following functions using at least one of pose information and a target model of the object, keypoint information and a target model of the object, and a target model aligned with the object: manipulating the object; predicting a behavior of the object; drawing predefined content on the surface of the object; the pose of the virtual object is updated and/or controlled.

According to yet another aspect of the present disclosure, there is provided a model alignment method, including acquiring a first image; and aligning an object in the first image with a target model.

In an embodiment, the model alignment method may specifically be performed in any one of the following manners: aligning a target model with the object by using the attitude information of the object, and calibrating the alignment by using the key point information of the object and the key point information of the target model; and aligning a target model with the object by using the key point information of the object, and calibrating the alignment by using the attitude information of the object and the target model.

In an embodiment, the model alignment method may further include obtaining image features from the first image and at least one second image, obtaining model features of each model from the image of each model in the first posture and the image of each model in the second posture, calculating similarity between the image features and the model features of each model, and determining a target model in each model according to the similarity; a first pose of the object in the first image and a second pose of the object in the at least one second image are different.

In an embodiment, the model alignment method may further include performing pose and/or keypoint estimation on the object by using a neural network, where the neural network includes a pose and/or keypoint estimation module and a domain adaptation module, and network parameters of the neural network are obtained by training the pose and/or keypoint estimation module and the domain adaptation module by using a synthetic image and a real image.

In an embodiment, the model alignment method may further comprise performing any of the following functions using at least one of pose information and a target model of the object, keypoint information and a target model of the object, and a target model aligned with the object: manipulating the object; predicting a behavior of the object; drawing predefined content on the surface of the object; the pose of the virtual object is updated and/or controlled.

According to still another aspect of the present disclosure, there is provided an information estimation apparatus including: the acquisition module is used for acquiring a first image; and an estimation module for performing pose and/or keypoint estimation on the object in the first image using a neural network.

According to still another aspect of the present disclosure, there is provided a model retrieval apparatus including: the acquisition module is used for acquiring an image; and the retrieval module is used for carrying out model retrieval according to the image to obtain a target model matched with the object in the image.

According to still another aspect of the present disclosure, there is provided a mold alignment apparatus including: an acquisition module for receiving a first image; and an alignment module to align an object in the first image with a target model.

According to yet another aspect of the present disclosure, there is provided an electronic device comprising a processor and a memory, the memory storing computer executable code which, when executed by the processor, performs any of the methods as described in embodiments of the present disclosure.

According to yet another aspect of the present disclosure, there is provided a computer-readable medium having stored thereon computer-executable instructions that, when executed by a processor, perform any of the methods as described in embodiments of the present disclosure.

According to the embodiment of the disclosure, in the information estimation device and method, the information estimation neural network comprises a posture and/or key point estimation module and a field adaptive module, and the posture and/or key point estimation module and the field adaptive module are trained by adopting a synthetic image and a real image to obtain network parameters of the information estimation neural network, so as to improve the precision; in the model retrieval device and method, the closest target model is retrieved by utilizing the characteristic extraction and comparison of the image and each model under multiple postures, thereby improving the accuracy of model retrieval; in the model alignment device and method, a target model is aligned with an object by using attitude information of the object, and then the alignment is calibrated by using key point information of the object and key point information of the target model; or aligning the target model and the object by using the key point information of the object, and calibrating the alignment by using the attitude information of the object and the target model, thereby improving the accuracy of the alignment of the image and the model. Any of the above improvements can improve the application effect in augmented reality applications.

Drawings

The above and other objects, features and advantages of the present disclosure will become more apparent from the following description of embodiments of the present disclosure with reference to the accompanying drawings, in which:

FIG. 1 schematically shows an illustrative diagram based on information estimation, model retrieval and model alignment;

FIG. 2 schematically illustrates a working block diagram of an apparatus for aligning a two-dimensional image and a three-dimensional model of an object according to an embodiment of the present disclosure;

FIG. 3 schematically illustrates a workflow diagram of a method of aligning a two-dimensional image and a three-dimensional model of an object according to an embodiment of the disclosure;

FIG. 4 schematically shows a workflow diagram of an information estimation method according to an embodiment of the disclosure;

fig. 5 schematically shows a schematic diagram of an information estimation apparatus according to an embodiment of the present disclosure;

FIG. 6 schematically illustrates a neural network schematic for information estimation, in accordance with an embodiment of the present disclosure;

FIG. 7 schematically illustrates a workflow diagram of a model retrieval method according to an embodiment of the disclosure;

FIG. 8 schematically shows a schematic diagram of a model retrieval apparatus according to an embodiment of the present disclosure;

FIG. 9 schematically illustrates a neural network schematic for model retrieval in accordance with an embodiment of the present disclosure;

FIG. 10 schematically illustrates a workflow diagram of a model alignment method according to an embodiment of the disclosure;

FIG. 11 schematically illustrates a schematic view of a model alignment apparatus according to an embodiment of the present disclosure;

FIG. 12 schematically shows a schematic diagram of model alignment according to an embodiment of the disclosure; and

fig. 13-16 schematically illustrate application embodiments according to embodiments of the present disclosure.

Detailed Description

Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is illustrative only and is not intended to limit the scope of the present disclosure. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present disclosure.

All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It is noted that the terms used herein should be interpreted as having a meaning that is consistent with the context of this specification and should not be interpreted in an idealized or overly formal sense.

The "models" herein may be from an available model library comprising a large number of models, such as a shared model library or other private model library available over a network.

The domain adaptation means that the difference between the real image domain and the synthetic image domain is considered in the training network model so that the influence on the trained network model is minimal.

An object herein may refer to any object, animal, plant, or person, etc. in an image that is available for searching.

The steps in the method embodiments of the present disclosure are not limited to be executed in the manner or order given herein, and may be executed separately or after being combined as appropriate according to the application purpose. Accordingly, the modules in the device embodiments of the present disclosure are not limited to be implemented in the manner given herein, and may be implemented by being appropriately combined according to the application purpose.

Specifically, the information estimation method, the model retrieval method, and the model alignment method mentioned in the present disclosure may be used alone or in combination with each other. Likewise, the information estimation apparatus or module, the model retrieval apparatus or module, and the model alignment apparatus or module of the present disclosure may be used alone or in combination with each other.

Fig. 1 schematically shows an explanatory diagram of information estimation, model retrieval and model alignment.

As examples, fig. 1 shows an information estimation diagram for estimating a posture of an automobile, a three-dimensional model retrieval diagram for retrieving, for example, a three-dimensional model, and a three-dimensional model alignment diagram for aligning, for example, a three-dimensional model and a two-dimensional image, which are applicable to fields such as augmented reality.

In this embodiment, the estimation of the pose of the object refers to estimating three-dimensional pose information of the object, i.e., six degrees of freedom (6DOF) pose information, from the original input image as shown in fig. 1 (a). The original input image may be, for example, a true two-dimensional image containing the object. The six-degree-of-freedom attitude information, as shown in fig. 1(b), includes an azimuth angle (azimuth) a, an elevation angle (elevation) e, a rotation angle (in-plane rotation) θ, a distance (distance) d, and an origin (principal point) (u, v). The three-dimensional model search is to search for a three-dimensional model that is closest to an object in a two-dimensional image, as shown in fig. 1 (b). Three-dimensional model alignment refers to aligning the retrieved three-dimensional model onto an object in a two-dimensional image, as shown in fig. 1 (c).

Fig. 2 schematically shows a working block diagram of an apparatus for aligning a two-dimensional image and a three-dimensional model of an object according to an embodiment of the present disclosure.

The apparatus of this embodiment includes both object information estimation, three-dimensional model retrieval, and three-dimensional model alignment functions. As shown in fig. 2, a two-dimensional real image including an object is first input to the information estimation module, and the pose and the key points of the object are estimated. In this embodiment, real data and a large amount of synthetic data are used as training data in a method of estimating the pose and the key points of an object using multitasking. The real data is very limited and the labeling cost is high, but the synthesized data with labels is easily available and can reach millions of scales, so the training data of the present disclosure is added into the synthesized data. In addition, since the synthesized data is used, but the synthesized data and the real data have obvious difference and belong to different domains, in order to eliminate the difference, the synthesized data becomes more effective training data, in this embodiment, a network structure for eliminating the difference between the real domain and the synthesized domain is added, such as the domain classifier added in fig. 4 and a gradient inversion layer added in front of the domain classifier.

Next, the three-dimensional model retrieval module performs three-dimensional model retrieval using images generated by a generation countermeasure network (GAN) based on multi-poses (also referred to as multi-views, including, for example, an estimated current pose of the object and several specific poses set in advance).

Finally, the three-dimensional model alignment module uses the retrieved three-dimensional model to perform alignment of the three-dimensional model and the original two-dimensional real image, including initial alignment and further calibration using the estimated keypoints.

Fig. 3 schematically illustrates a workflow diagram of a method of aligning a two-dimensional image and a three-dimensional model of an object according to an embodiment of the disclosure.

The method of this embodiment also involves object information estimation, three-dimensional model retrieval, and three-dimensional model alignment simultaneously. Specifically, in this embodiment, a two-dimensional image is input, an object in the image is detected, and the detected object image is sent to a trained pose estimation and key point positioning model for pose (here, three degrees of freedom of pose, including azimuth, elevation, and in-plane rotation) estimation and key point positioning; then, rendering synthetic images of all three-dimensional models in a three-dimensional model library containing a large number of three-dimensional models in the current posture and other preset fixed postures according to the estimated posture, simultaneously sending the original object image, the image generated by the GAN and the synthetic images of all three-dimensional models into a trained three-dimensional model retrieval network to extract features by utilizing the images generated by the GAN in other fixed postures, and then calculating the similarity between the features of the original object image (including the features extracted from the original object image and the image generated by the GAN) and the features of the synthetic images of all three-dimensional models to carry out three-dimensional model retrieval, wherein the three-dimensional model which is the most similar to the object in the image is retrieved when the similarity is the highest; and finally, aligning the two-dimensional image and the three-dimensional model with the retrieved three-dimensional model, the estimated key points and the detection result (the alignment method is shown below).

The present disclosure can employ existing deep learning network structures, such as fast RCNN, SSD, Yolo, etc., for object detection.

The information estimation apparatus and method, the model retrieval apparatus and method, and the model alignment apparatus and method of the present disclosure will be described below, respectively.

Fig. 4 schematically illustrates a workflow diagram of an information estimation method 400 according to an embodiment of the disclosure.

The information estimation method 400 of fig. 4 includes step S401 and step S402. Acquiring a first image in step 401; pose and/or keypoint estimation of the object in the acquired first image is performed in step 402 using a neural network.

In this embodiment, the neural network used by the information estimation method 400 at least includes a pose and/or keypoint estimation module and a domain adaptive module, and the network parameters of the neural network are obtained by training the pose and/or keypoint estimation module and the domain adaptive module by using a synthetic image and a real image.

The domain adaptation module is to reduce a difference between the synthetic image domain and the real image domain to minimize an effect of the difference between the real image and the synthetic image on the trained network model.

The neural network used by the information estimation method 400 may also include an object class identification module, a keypoint location module, and the like. The keypoint location module is used for performing location estimation on keypoints in the image or the model so as to better align the image and/or the model.

Alternatively, the information estimation method may further comprise a retrieval step, for example, retrieving from the first image and the second image to obtain a target model, wherein a first pose of the object in the first image is different from a second pose of the object in the second image.

The first image and the second image may be a captured real image (as an original image) or an image generated from the real image. For example, two images of different poses may be taken of the same subject, e.g., as a first image and a second image, respectively. Alternatively, the first image is a real image and the second image is an image generated from the first image.

The second image may be a plurality of images in which the pose of the object is also different.

Specifically, in the information estimation method, image features may be obtained from the first image and the second image, and model features corresponding to each model may be obtained from an image of each model in the used model library in the first posture and an image of each model in the second posture. And calculating the similarity of the image characteristics and the model characteristics of the models, and determining the target model closest to the original image in the models according to the similarity.

Alternatively, the information estimation method may further include alignment and calibration of the image and the model.

The alignment and calibration of the image and model may be by any of the following means: aligning a target model with the object by using the attitude information of the object, and calibrating the alignment by using the key point information of the object and the key point information of the target model; or aligning a target model with the object by using the key point information of the object, and calibrating the alignment by using the attitude information of the object and the target model.

Alternatively, in the information estimation method, at least one of pose information and a target model of the object, keypoint information and a target model of the object, and a target model aligned with the object may be utilized to perform any of the following functions: manipulating the object; predicting a behavior of the object; drawing predefined content on the surface of the object; the pose of the virtual object is updated and/or controlled.

Fig. 5 schematically shows a schematic diagram of an information estimation apparatus 500 according to an embodiment of the present disclosure.

The information estimation apparatus 500 includes an acquisition module 501 and an estimation module 502. The acquiring module 501 is configured to acquire a first image; the estimation module 502 is configured to perform pose and/or keypoint estimation on an object in the first image using a neural network.

Various features in the above-described embodiments of the information estimation method may also be applied to the embodiments of the information estimation apparatus herein, and thus are not described again.

In the embodiment of the information estimation device and the method, a new network structure can be constructed by using a deep learning method, and the attitude estimation and the key point positioning of the object are carried out simultaneously.

In addition, a large number of training samples are needed for training the network model, and meanwhile, data with key points and posture labels are very limited, and the labeling cost is very high. Three-dimensional CAD models can render millions of synthetic data that can be used to train a network model. But the synthetic data and the real data have obvious differences, namely belong to different domains (domains). If the difference is not weakened, the model trained by the synthetic data is biased to the synthetic domain, which is very unfavorable for testing the real data. So to be able to make the model better for real data, the present disclosure introduces a specific network structure to attenuate the difference between the synthetic domain and the real domain. In other words, the present disclosure employs supervised elimination of differences, training and attenuating differences simultaneously in data of both real and synthetic domains, testing in the real domain at the time of testing.

Fig. 6 schematically illustrates a neural network schematic for information estimation according to an embodiment of the present disclosure.

The embodiment mainly comprises four stages of training data preparation, network structure design, network training and network application.

Preparing training data

The training data includes real data and synthetic data. The real data is an image with object categories, object poses, and keypoint labels. The synthetic data is a synthetic image rendered with a three-dimensional CAD model with object categories, object poses, and keypoint labels.

Network architecture design

As shown in fig. 6, the input images of the network are all 224 x 224 in size with three channels, red, green and blue. The basic network of the network structure, namely Base Net in fig. 5, can adopt many structures, and the present disclosure adopts network layers before FC7 (full connection layer 7) of VGG16, 13 convolutional layers and 2 full connection layers (the convolutional layers and the full connection layers are all basic units in a convolutional neural network). In addition to VGG16, networks such as Alex Net and ResNet may be used as the base network. The network has four tasks: object class identification, object pose estimation, object keypoint estimation and composite domain and real domain identification. The network can simultaneously carry out the four tasks, and a network module for realizing the four tasks is designed. A loss function is required for each task. The object class identification task adopts a softmax loss function (other loss functions such as a change loss function can also be adopted), and the estimation task of the object posture can be modeled as a regression problem or a classification problem. The regression problem is the estimation of continuous values for which the prediction can be classified as attitude estimation, and the classification problem is the estimation of the attitude class. The present disclosure may employ either of these two types of modeling. Estimation of the continuous values uses smooth _ Ll loss function, while estimation of the pose class uses softmax loss function (other loss functions, such as hinge loss function, may also be used). Regression modeling is taken as an example here. For the estimation task of the key points, when the real values are set, in order to enable the network to identify the key points at different positions of different types of objects, each semantic key point of each type of object is used as a two-dimensional channel. In the corresponding channel, if the corresponding coordinate is the key point, the key point is 1, and if the corresponding coordinate is not the key point, the key point is 0. The loss function of the key point estimation task adopts a cross entropy loss function (the cross entropy loss function and the softmax loss function are common functions, and the applicable problems are different). The task of identifying the synthetic domain and the real domain is a binary problem, but in order to reduce the difference between the synthetic domain and the real domain, different designs can be made according to different network structures even if the network is not aware of which domain the input image is from. (reducing such domain differences may be performed in a variety of ways or in a network structure, and is not limited herein).

One embodiment is to connect the output of Base Net (i.e., the output of fully-connected layer 7) to one fully-connected layer FC _ C and another fully-connected layer FC _ P, when splitting after Base Net. FC _ C corresponds to the class of the object, the number of nodes is the same as the total class number of the object, and taking 10 classes of objects as an example, the number of nodes of the full connection layer FC _ C is set to 10. The output of the full connection layer FC _ C connects the loss function softmax loss function of the object class identification task (this loss function is denoted L1 in fig. 5). And identifying the network module of the task for the object class from the connecting line of the Base Net-FC _ C-softmax loss function. Each node of the full connection layer FC _ P corresponds to the attitude of one degree of freedom of the object, and the estimation of the attitude of six degrees of freedom (or 3 degrees of freedom) by the present disclosure needs to set the node number of the FC _ P to be 6 (or 3). The output of FC _ P connects the penalty function smooth _ L1 penalty function of the estimation task of the object pose (this penalty function is denoted as L2 in fig. 5). And a connecting line of loss functions from the Base Net-FC _ P-smooth _ L1 is a network module for estimating the posture of the object. In addition, the output of the pooling layer 5(pool5) of the Base network Base Net is also simultaneously connected with a convolutional layer Conv _ K, the number of channels of Conv _ K is the same as the total number of all the key points of all the categories corresponding to the key point estimation task, and taking 100 key points in total for 10 categories as an example, the number of channels of Conv _ K is set to 100. Each channel size after convolution with a 3 x 3 convolution kernel is 7 x 7. The output of Conv _ K is 100 × 7 × 7, the loss function of the connection keypoint estimate crosses the entropy loss function (this loss function is shown as L3 in fig. 5). And (3) a network module which is estimated from the connecting line of Base Net (pool5) -Conv 6-cross entropy loss function as the key point of the object. For the task of identifying a synthesis domain and a real domain, in order to make the network unclear which domain the input image comes from, one implementation is to add a gradient reversal layer GRL (gradient reversal layer) after Base Net, then connect FC layers (multilayer full connection layers), then connect a layer of full connection layer FC _ D, where FC _ D corresponds to the number of domains, and only two domains have, and the number of nodes of the full connection layer FC _ D is set to 2. Followed by a two-classifier using the softmax loss function (other loss functions, such as the hinge loss function, shown at L4 in fig. 6, may also be used). The connection line from the Base Net-GRL-FC layers-FC _ D-softmax loss function is a network module for weakening the influence of different domains. There are of course other ways, such as creating a countermeasure process in a countermeasure network (GAN).

Network training and application

In training, the synthetic image and the real image are input in network training, training data is sent to the network, and the weighted sum of the loss functions of all tasks in the network (including class recognition of the object, object pose estimation, key point estimation of the object and domain classification tasks) is used as the final loss function L ═ a × L1+ b × L2+ c × L3+ d × L4. Wherein a, b, c and d are weights, and the sizes of the weights can be set according to network effects and experience, and can also be automatically learned through learning. When the loss function of the network is converged, the training can be completed. The different dashed lines in fig. 6 represent the network structure traversed by the different domains, and the solid lines represent the common paths.

When the method is applied (on a computer and other terminals), a single real color image is input into the trained deep learning network designed above, so that the category (the position of the maximum value in the FC _ C output vector in the network is a predicted category) of an object in the image, the posture (the 6-dimensional vector value output by the FC _ P in the network is a predicted posture value with six degrees of freedom), and the key point information (the length and the width of the Conv _ K output matrix in the network are normalized to be the same as those of an original image, the position coordinate of the maximum value in each channel is a key point, and each channel corresponds to the name of one key point) can be output.

FIG. 7 schematically illustrates a workflow diagram of a model retrieval method 700 according to an embodiment of the disclosure.

In the present embodiment, the model retrieval method 700 includes step S701 and step S702. Acquiring an image in step S701; in step S702, a model search is performed according to the image, and a target model matching the object in the image is obtained.

As an example, the image herein may include a first image and a second image, and a first pose of the object in the first image and a second pose in the second image are different.

The first image and the second image may be a captured real image (as an original image) or an image generated from the real image. For example, two images of different postures may be taken of the same subject as the first image and the second image, respectively. Alternatively, the first image is a real image and the second image is an image generated from the first image.

The pose may be manually marked or estimated.

Specifically, in the model retrieval method 700, image features may be obtained according to a first image and a second image, model features corresponding to each model may be obtained according to an image of each model in a model library in the first posture and an image of each model in the second posture, a similarity between the image features and the model features of each model may be calculated, and a target model closest to an original image in each model may be determined according to the similarity.

Alternatively, the model retrieval method 700 may also include alignment and calibration of the image and the model.

The alignment and calibration of the image and model may be by any of the following means: aligning a target model with the object by using the attitude information of the object, and calibrating the alignment by using the key point information of the object and the key point information of the target model; or aligning a target model with the object by using the key point information of the object, and calibrating the alignment by using the attitude information of the object and the target model. The target model may be a retrieved model or a predetermined model.

Alternatively, in the model retrieval method 700, the object may also be pose and/or keypoint estimated using a neural network. The neural network can comprise a pose and/or keypoint estimation module and a domain adaptation module, and network parameters of the neural network can be obtained by training the pose and/or keypoint estimation module and the domain adaptation module by adopting a synthetic image and a real image.

Alternatively, in the model retrieval method 700, at least one of the pose information and the target model of the object, the keypoint information and the target model of the object, and the target model aligned with the object may be utilized to perform any of the following functions: manipulating the object; predicting a behavior of the object; drawing predefined content on the surface of the object; the pose of the virtual object is updated and/or controlled.

Fig. 8 schematically shows a schematic diagram of a model retrieval apparatus 800 according to an embodiment of the present disclosure.

The model retrieving apparatus 800 includes an obtaining module 801 and a retrieving module 802.

The acquiring module 801 is used for acquiring images; the retrieval module 802 performs model retrieval according to the image to obtain a target model matched with an object in the image.

Various features in the above-described embodiment of the model retrieval method may also be applied to the embodiment of the model retrieval apparatus, and therefore are not described again.

In the embodiment of the model retrieval method and the device, because the estimated posture of the object may have errors, in order to not excessively depend on the estimated posture and simultaneously improve the feature expression capability so as to retrieve the three-dimensional model closest to the object in the two-dimensional image, the three-dimensional model retrieval method of the image under another specific posture (not limited to one specific posture, but also comprising another plurality of specific postures) and the network structure based on multi-posture feature learning is generated by the present disclosure.

Fig. 9 schematically illustrates a neural network schematic for three-dimensional model retrieval according to an embodiment of the present disclosure.

In the embodiment, the method comprises four stages of training data preparation, network structure design and training and retrieval. Here, two postures are taken as an example.

Training data preparation

The training data is a large number of triples including query samples, positive samples, and negative samples. The query sample includes a real image and an image at another particular pose generated using GAN. The pose of the estimated real image is referred to herein as pose 1 and the other particular pose is referred to as pose 2. Each CAD model is then rendered into two composite images in two poses, pose 1 and pose 2. Images rendered by a CAD model that are similar to objects in the real image are referred to herein as positive examples, and vice versa as negative examples. Such triplets are the training data for each input.

Network architecture design and training

As shown in fig. 9, two images of each sample in the triplet are normalized to 224 × 224, input to the base network with the same structure as the base network in the neural network for information estimation, and then two full connection layers FC8_1 and FC8_2 are connected respectively, that is, the image features in two poses, the number of nodes of the full connection layer can be set to 4096, and then the two full connection layers are fused, where the fusion mode can be stitching, or by convolutional layer convolution, or by other network structures. The fused 3 features of the triplet are then fed into a discriminant similarity loss function, such as triplet loss. Through the training, the network learns that the characteristics of the query sample are closer to the characteristics of the positive sample and are further away from the characteristics of the negative sample. In addition, the trained network model parameters in the neural network for information estimation can be used for initialization during training to keep the domain difference between real data and synthetic data to the minimum.

Three-dimensional model retrieval

During retrieval, firstly, the pose 1 of a real input image is estimated through an information estimation network trained in object information estimation, then an image under another specific pose 2 is generated through GAN, then the fused features of the two images are extracted through the trained network to serve as the features of a query image, simultaneously the fused features of the synthetic images of all three-dimensional models under the poses 1 and 2 are extracted to serve as the features of the three-dimensional models (unlimited number of image features can be extracted in sequence), then the similarity of the image features and the three-dimensional model features is calculated, for example, the Euclidean distance is calculated, the smaller the distance is, the greater the similarity is, and the three-dimensional model with the maximum similarity is the three-dimensional model of the object in the retrieved closest image.

FIG. 10 schematically illustrates a workflow diagram of a model alignment method 1000 according to an embodiment of the disclosure.

In this embodiment of the model alignment method, the model alignment method 1000 includes steps S1001 and S1002. In step S1001, a first image is acquired; in step S1002, an object in the first image is aligned with a target model.

Specifically, in this alignment embodiment, the alignment and calibration of the image and model may be accomplished in any of the following ways: aligning a target model with the object by using the attitude information of the object, and calibrating the alignment by using the key point information of the object and the key point information of the target model; and aligning a target model with the object by using the key point information of the object, and calibrating the alignment by using the attitude information of the object and the target model.

Here, the pose information and the key point information may be estimated or manually labeled. Similarly, the target model may be retrieved from a model library or manually labeled.

Alternatively, in the alignment method, model retrieval may also be included. Specifically, for example, image features are obtained from a first image and a second image, model features corresponding to each model are obtained from an image of each model in a first posture and an image of each model in a second posture in a model library, the similarity between the image features and the model features of each model is calculated, and a target model closest to an object in the obtained first image in each model is determined according to the similarity.

A first pose of the object in the first image and a second pose of the object in the at least one second image are different.

The second image may include a plurality of images in which the postures of the objects are also different from each other.

Alternatively, in the alignment method, pose estimation and keypoint estimation may also be performed on the object.

In particular, a neural network may be utilized for pose and/or keypoint estimation of the object. The neural network can comprise a pose and/or keypoint estimation module and a domain adaptation module, and network parameters of the neural network can be obtained by training the pose and/or keypoint estimation module and the domain adaptation module by adopting a synthetic image and a real image.

Alternatively, in the alignment method, at least one of pose information and a target model of the object, keypoint information and a target model of the object, and a target model aligned with the object may be utilized to perform any of the following functions: manipulating the object; predicting a behavior of the object; drawing predefined content on the surface of the object; the pose of the virtual object is updated and/or controlled.

FIG. 11 schematically shows a schematic diagram of a model alignment apparatus 1100 according to an embodiment of the disclosure.

The model alignment apparatus 1100 includes an acquisition module 1101 and an alignment module 1102.

The acquisition module 1101 is configured to receive a first image; an alignment module 1102 is used to align an object in the first image with a target model.

Various features of the above embodiments of the model alignment method can be applied to the embodiments of the model alignment apparatus, and therefore are not described in detail.

An embodiment of the alignment of a two-dimensional image and a three-dimensional model is described below.

Based on the pose and keypoint information of the object, and the three-dimensional model, the present disclosure proposes a method of aligning a two-dimensional image and a three-dimensional model. Wherein, the attitude and the key point information can be estimated or manually marked. The three-dimensional model may be retrieved from a library of models or may be predetermined. To reduce errors, for example: error accumulation of the estimated object pose, the present disclosure proposes to further readjust with keypoints. As shown in fig. 12, specifically, the three degrees of freedom (distance and origin) of the object are calculated by using the detection frame of the object, and the three degrees of freedom postures (azimuth angle, pitch angle and rotation angle) of the object estimated by the network are added to the calculated three degrees of freedom, and then the two-dimensional image under the posture is rendered by the retrieved three-dimensional model and aligned to the original input two-dimensional image, that is, the initial two-dimensional image and three-dimensional model are aligned. For example, the object model and the object may be aligned using pose information of the object, and the alignment may be calibrated using key point information of the object and key point information of the object model. Alternatively, the target model and the object may be aligned using the keypoint information of the object, and the alignment may be calibrated using the pose information of the object and the target model.

The above-described embodiments of the present disclosure may obtain at least one of the following advantageous effects: more accurate object information estimation, more accurate three-dimensional model retrieval, and more accurate two-dimensional image and three-dimensional model alignment.

The three-dimensional shape and pose information of the object in the two-dimensional image obtained by the present disclosure can be applied to more advanced augmented reality applications, such as:

(1) providing three-dimensional information of an object to assist in three-dimensional augmented reality interaction (e.g., intent prediction of a vehicle in augmented reality; adding the effect of three-dimensional augmented reality based on three-dimensional pose information of an object in motion (SLAM does not provide three-dimensional information of an object moving relative to a scene);

(2) a three-dimensional augmented reality effect is rendered on the surface of a three-dimensional object (e.g., changing the surface drawing of a car in an augmented reality video) by ensuring correct virtual reality three-dimensional alignment.

In addition to being applied to augmented reality, the estimated three-dimensional shape and posture information of an object can be applied to many technical fields such as automatic driving, robots, and the like.

By way of example and not limitation, in various applications presented herein, manipulation of an object, predicting behavior of an object, drawing predefined content on a surface of the object, updating a pose of a virtual object, or manipulating a pose of a virtual object, for example, may be performed using at least one of pose information and a target model of the object, keypoint information and a target model of the object, and a target model aligned with the object. The attitude information and the key point information of the object can be estimated or manually marked. The target model may be retrieved or predetermined.

Specifically, as shown in fig. 13, in an automatic driving scenario, the estimated pose or key point of the vehicle and the precise three-dimensional model alignment may be applied to prediction, for example, a camera of the automatic driving vehicle in fig. 13 may capture an image of a vehicle about to enter the main road on the left, and capture the image to estimate the pose or key point information of the vehicle, so as to achieve alignment between the vehicle and the precise three-dimensional model, for example, using the estimated precise pose information of the vehicle and the retrieved target model to achieve precise alignment between the vehicle and its target model, thereby obtaining three-dimensional information of the vehicle; the estimated accurate key point information of the vehicle and the retrieved target model can be utilized to realize the accurate alignment of the vehicle and the target model thereof, so as to obtain the three-dimensional information of the vehicle; in addition, the three-dimensional information of the current vehicle can be obtained by directly utilizing the target model aligned with the object. From this information, the position and direction of travel of the vehicle in the three-dimensional map in the front field of view, vehicle size, etc.) and the direction of travel and the intent of the vehicle can be estimated.

As shown in fig. 14, in the video augmented reality scene, based on the precise three-dimensional model and the alignment of a specific object in a two-dimensional image, once the specific object is detected, a previously defined content may be rendered to the surface of the object. For example, using the estimated accurate pose information of the object and the retrieved target model, accurate alignment of the object with its target model can be achieved, thereby obtaining to which surface of the target model the surface to which the predefined content needs to be rendered belongs, and then rendering the predefined content on the surface of the target model, the rendered content will be seen on the surface of the specific object without displaying the aligned target model; or the drawing content can be firstly superposed on the current surface of the three-dimensional model, then the three-dimensional model with the superposed content is aligned with the object, then the three-dimensional model is removed, only the drawing content is reserved, and the drawing content can be seen on the surface of the specific object; the same can also be done using the estimated keypoint information of the object and the retrieved target model or directly through the target model aligned with the object.

The pose, key points, model retrieval, and three-dimensional model information for alignment obtained in a single image by the present disclosure can be used for rendering a three-dimensional virtual object matched to a real object in augmented reality, such as fig. 15 in an augmented reality video, the three-dimensional pose information of a virtual cylinder can be updated according to the estimated pose information of a table, thereby achieving a real visual effect. Suppose a virtual object, such as the virtual cylinder, is to be superimposed on a table in a real scene with the table. The method can be used for obtaining the posture information of the table and the three-dimensional model of the table to obtain the three-dimensional information of the table, calculating the distance between the three-dimensional table and the three-dimensional virtual cylinder to determine the position of the three-dimensional cylinder in the real three-dimensional space in order to superpose the bottom of the virtual cylinder and the surface of the table in a matching manner, then rendering the three-dimensional cylinder under the current posture, and superposing the rendered cylinder to the original two-dimensional image. The case where the target model is known, or the target model is known aligned with the table, and the keypoint information of the table is similar to the implementation described above.

As shown in fig. 16, the estimated pose or keypoint information of the real object and the three-dimensional model for alignment may be used to control the pose of the virtual object, in which case the real object may be referred to as a three-dimensional augmented reality marker (3D ARmarker). Knowing the exact pose of the real object and the target model, the pose of the virtual object can be set to coincide with the pose of the real object, i.e. changing the pose of the real object can change the pose of the virtual object, which is similar to a remote control. Keypoint information of known real objects, or aligned three-dimensional models, can be used to control the pose of the virtual object.

In an application such as robot arm manipulation, a robot may acquire an image through a camera, and then estimate robot pose information of an object to be grasped in the image or a target model aligned with the image by using the method of the present disclosure, so as to control actions of the robot arm, such as grasping position, gesture, and the like, to complete grasping. For example, by obtaining the posture of the object to be grasped and the target model, three-dimensional information such as the position, size, posture and the like of the target model can be obtained, so that information such as the extension distance and path of the arm, the opening size of the finger and the like can be set, and the grasping can be performed. In addition, the key point information and the target model of the object can be obtained, or the target model aligned with the object can be directly obtained to realize similar operation.

The present disclosure also provides an electronic device comprising a processor and a memory, the memory storing computer executable code that, when executed by the processor, performs any of the methods described in the embodiments of the present disclosure.

The present disclosure also provides a computer readable medium storing instructions that, when executed by a processor, perform any of the methods described herein in the previously described embodiments of the present disclosure.

A "computer-readable medium" herein should be taken to include any medium or combination of media that is capable of storing instructions for execution by a computer, a device that is capable of temporarily or permanently storing instructions and data, and may include, but is not limited to, Random Access Memory (RAM), Read Only Memory (ROM), cache memory, flash memory, optical media, magnetic media, cache memory, other types of memory (e.g., erasable programmable read-only memory (EEPROM)), and/or any suitable combination thereof. "computer-readable medium" may refer to a single storage apparatus or device and/or a "cloud-based" storage system or storage network that includes multiple storage apparatuses or devices.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. Furthermore, the terms "comprises," "comprising," and the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.

Each block in the flowchart or block diagrams of the embodiments of the present disclosure may represent a hardware module, a program segment, or a portion of code, which may include one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the flowchart and block diagrams may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The embodiments of the present disclosure have been described above. However, these examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. Although the embodiments are described separately above, this does not mean that the measures in the embodiments cannot be used in advantageous combination. The scope of the disclosure is defined by the appended claims and equivalents thereof. Various alternatives and modifications can be devised by those skilled in the art without departing from the scope of the present disclosure, and such alternatives and modifications are intended to be within the scope of the present disclosure.

Claims

1. An information estimation method, comprising:

acquiring a first image; and

pose and/or keypoint estimation of an object in the first image is performed using a neural network.

2. The method of any one of claim 1, wherein the neural network comprises a pose and/or keypoint estimation module and a domain adaptation module, and the network parameters of the neural network are trained using synthetic images and real images for the pose and/or keypoint estimation module and the domain adaptation module.

3. The method of any of claims 1-2, further comprising: and retrieving according to the first image and the at least one second image to obtain a target model, wherein a first posture of the object in the first image is different from a second posture of the object in the at least one second image.

4. The method of claim 3, further comprising: acquiring image characteristics according to the first image and the at least one second image, acquiring model characteristics of each model according to the image of each model in the first posture and the image of each model in the second posture, calculating the similarity between the image characteristics and the model characteristics of each model, and determining a target model in each model according to the similarity.

5. The method of any of claims 1-4, further comprising one of:

aligning a target model with the object by using the attitude information of the object, and calibrating the alignment by using the key point information of the object and the key point information of the target model; and

and aligning a target model with the object by using the key point information of the object, and calibrating the alignment by using the attitude information of the object and the target model.

6. The method of any of claims 1-5, further comprising: utilizing at least one of pose information and a target model of the object, keypoint information and a target model of the object, and a target model aligned with the object to perform any of the following functions:

manipulating the object;

predicting a behavior of the object;

drawing predefined content on the surface of the object;

the pose of the virtual object is updated and/or controlled.

7. A model retrieval method, comprising:

acquiring an image, an

And carrying out model retrieval according to the image to obtain a target model matched with the object in the image.

8. The method of claim 7, wherein the images include a first image and at least one second image, a first pose of the object in the first image and a second pose of the object in the at least one second image being different.

9. The method of claim 8, further comprising: acquiring image characteristics according to the first image and the at least one second image, acquiring model characteristics of each model according to the image of each model in the first posture and the image of each model in the second posture, calculating the similarity between the image characteristics and the model characteristics of each model, and determining a target model in each model according to the similarity.

10. The method according to any one of claims 7-9, further comprising one of:

11. The method according to any one of claims 7-10, further comprising pose and/or keypoint estimation of the object using a neural network, wherein the neural network comprises a pose and/or keypoint estimation module and a domain adaptation module, and wherein network parameters of the neural network are trained using synthetic images and real images for the pose and/or keypoint estimation module and the domain adaptation module.

12. The method according to any one of claims 7-11, further comprising utilizing at least one of pose information and a target model of the object, keypoint information and a target model of the object, and a target model aligned with the object to perform any of the following functions:

manipulating the object;

predicting a behavior of the object;

drawing predefined content on the surface of the object;

the pose of the virtual object is updated and/or controlled.

13. A method of model alignment, comprising:

acquiring a first image; and

aligning an object in the first image with a target model.

14. The method of claim 13, further comprising one of:

15. The method of any of claims 13 or 14, further comprising: acquiring image characteristics according to the first image and at least one second image, acquiring model characteristics of each model according to the image of each model in the first posture and the image of each model in the second posture, calculating the similarity between the image characteristics and the model characteristics of each model, and determining a target model in each model according to the similarity; a first pose of the object in the first image and a second pose of the object in the at least one second image are different.

16. The method of any one of claims 13-15, further comprising pose and/or keypoint estimation of the object using a neural network, wherein the neural network comprises a pose and/or keypoint estimation module and a domain adaptation module, and wherein network parameters of the neural network are trained using synthetic images and real images for the pose and/or keypoint estimation module and the domain adaptation module.

17. The method according to any one of claims 13-16, further comprising utilizing at least one of pose information and a target model of the object, keypoint information and a target model of the object, and a target model aligned with the object to perform any of the following functions:

manipulating the object;

predicting a behavior of the object;

drawing predefined content on the surface of the object;

the pose of the virtual object is updated and/or controlled.

18. An information estimation apparatus comprising:

the acquisition module is used for acquiring a first image; and

an estimation module for performing pose and/or keypoint estimation on an object in the first image using a neural network.

19. A model retrieval apparatus comprising:

the acquisition module is used for acquiring an image; and

and the retrieval module is used for carrying out model retrieval according to the image to obtain a target model matched with the object in the image.

20. A model alignment apparatus comprising:

an acquisition module for receiving a first image; and

an alignment module aligns an object in the first image with a target model.

21. An electronic device comprising a processor and a memory, the memory storing computer executable code which, when executed by the processor, performs the method of any of claims 1-17.

22. A computer-readable medium having stored thereon computer-executable instructions that, when executed by a processor, perform the method of any of the preceding claims 1-17.