CN112598718A

CN112598718A - Unsupervised multi-view multi-mode intelligent glasses image registration method and device

Info

Publication number: CN112598718A
Application number: CN202011632743.2A
Authority: CN
Inventors: 王成; 高启宇; 李一鸣; 俞益洲; 乔昕
Original assignee: Beijing Shenrui Bolian Technology Co Ltd; Shenzhen Deepwise Bolian Technology Co Ltd
Current assignee: Beijing Shenrui Bolian Technology Co Ltd; Shenzhen Deepwise Bolian Technology Co Ltd
Priority date: 2020-12-31
Filing date: 2020-12-31
Publication date: 2021-04-02
Anticipated expiration: 2040-12-31
Also published as: CN112598718B

Abstract

The invention provides an unsupervised multi-view multi-modal intelligent glasses image registration method, which comprises the following steps: alternately training an auxiliary feature extraction network and a high-dimensional feature similarity discrimination network, and training the high-dimensional feature similarity discrimination network by using the same-mode images and/or different-mode images extracted by the auxiliary feature extraction network and the images at the same view angle and/or different view angles; fixing a high-dimensional feature similarity discrimination network and an auxiliary feature extraction network, and training a registration network, wherein the input of the registration network is an image to be registered, a template image and a difference image of the image and the template image, the high output of the auxiliary feature extraction network is inserted in the middle layer as a feature vector, and the output is a dense registration field; and predicting the image to be registered by adopting a registration network and an auxiliary feature extraction network to obtain the registered image.

Description

Unsupervised multi-view multi-mode intelligent glasses image registration method and device

Technical Field

The invention relates to the field of computers, in particular to an unsupervised multi-view multi-mode intelligent glasses image registration method and device.

Background

The method is characterized in that the acquisition of the environmental image is an important condition for obstacle avoidance, various devices for acquiring the image are provided at present, such as an RGB camera, a structured light camera, a TOF camera and the like, each camera has advantages and disadvantages and an applicable scene, and a certain camera can not adapt to an actual scene when being used alone, so that the various cameras are usually combined for use in the obstacle avoidance scene. The imaging results of the multiple cameras are different due to the difference of the placement positions, the imaging time and the imaging forms, so that the fusion of the multi-view multi-modal image information is an important technical link in the obstacle avoidance task.

Disclosure of Invention

The present invention aims to provide an unsupervised multi-view multimodal smart glasses image registration method and apparatus that overcomes or at least partially solves the above mentioned problems.

In order to achieve the purpose, the technical scheme of the invention is realized as follows:

one aspect of the present invention provides an unsupervised multi-view multi-modal smart glasses image registration method, including: alternately training an auxiliary feature extraction network and a high-dimensional feature similarity discrimination network, and training the high-dimensional feature similarity discrimination network by using the same-mode images and/or different-mode images extracted by the auxiliary feature extraction network and the images at the same view angle and/or different view angles; the input images of the auxiliary feature extraction network are template images and images to be registered, and output are high-dimensional feature vectors; the high-dimensional feature similarity discrimination network adopts a convolutional neural network, input data is auxiliary features to extract high-dimensional features output by the network, and the output is a numerical value from 0 to 1; fixing a high-dimensional feature similarity discrimination network and an auxiliary feature extraction network, and training a registration network, wherein the input of the registration network is an image to be registered, a template image and a difference image of the image and the template image, the high output of the auxiliary feature extraction network is inserted in the middle layer as a feature vector, and the output is a dense registration field; and predicting the image to be registered by adopting a registration network and an auxiliary feature extraction network to obtain the registered image.

Wherein training the registration network comprises: and calculating the similarity between the registered image and the template object as a similarity loss function, adding the regularization loss function of the registration field, and training the registration network.

The high-dimensional feature similarity distinguishing network adopts a resnet classification network architecture, the auxiliary feature extraction network adopts a resnet classification network architecture, and the registration network adopts a UNet structure.

The auxiliary feature extraction network splices two images with the size of H, W, L into an image pair with the size of H, W, L, 2; the registration network tiles the two images of size H W L and their difference images into an image pair of size H W L3.

The method for predicting the image to be registered by adopting the registration network and the auxiliary feature extraction network to obtain the registered image comprises the following steps: combining the images to be registered and the template images with the same size of H, W, L and differential images of the images to be registered and the template images to form an image pair with the size of H, W, L and 3, and inputting the image pair into a registration network and an auxiliary feature extraction network; obtaining abstract features through a down-sampling path, combining the abstract features with high-dimensional features extracted by an auxiliary feature extraction network, and entering an up-sampling path together; and outputting a dense registration field with the size H W L3, and sampling the original image by using the registration field to obtain a registered image.

In another aspect, the present invention provides an unsupervised multi-view multi-modal smart glasses image registration apparatus, including: the first training module is used for alternately training the auxiliary feature extraction network and the high-dimensional feature similarity discrimination network, and training the high-dimensional feature similarity discrimination network by using the same-mode images and/or different-mode images extracted by the auxiliary feature extraction network and the images at the same view angle and/or different view angles; the input images of the auxiliary feature extraction network are template images and images to be registered, and output are high-dimensional feature vectors; the high-dimensional feature similarity discrimination network adopts a convolutional neural network, input data is auxiliary features to extract high-dimensional features output by the network, and the output is a numerical value from 0 to 1; the second training module is used for fixing the high-dimensional feature similarity discrimination network and the auxiliary feature extraction network and training the registration network, wherein the input of the registration network is the image to be registered, the template image and the difference image of the image and the template image, the high-level feature vector output by the auxiliary feature extraction network is inserted in the middle layer, and the output is a dense registration field; and the prediction module is used for predicting the image to be registered by adopting the registration network and the auxiliary feature extraction network to obtain the registered image.

Wherein the third training module trains the registration network by: and the third training module is specifically used for calculating the similarity between the registered image and the template object as a similarity loss function, adding a regularization loss function of the registration field, and training the registration network.

The prediction module predicts the image to be registered by adopting a registration network and an auxiliary feature extraction network in the following mode to obtain the registered image: the prediction module is specifically used for combining the image to be registered and the template image which have the same size H W L and the difference image of the image to be registered and the template image to form an image pair with the size H W L3, and inputting the image pair into the registration network and the auxiliary feature extraction network; obtaining abstract features through a down-sampling path, combining the abstract features with high-dimensional features extracted by an auxiliary feature extraction network, and entering an up-sampling path together; and outputting a dense registration field with the size H W L3, and sampling the original image by using the registration field to obtain a registered image.

Therefore, the unsupervised multi-view multi-modal intelligent glasses image registration method and device provided by the invention can be used for registering multi-view multi-modal images, and are convenient for subsequent tasks such as depth estimation, detection, segmentation and the like.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

Fig. 1 is a flowchart of an unsupervised multi-view multi-modal smart glasses image registration method according to an embodiment of the present invention;

fig. 2 is a schematic diagram of a network structure in an unsupervised multi-view multi-modal smart glasses image registration method according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of an unsupervised multi-view multi-modal smart glasses image registration apparatus according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

Fig. 1 shows a flowchart of an unsupervised multi-view multi-modal smart glasses image registration method provided by an embodiment of the present invention, and the following describes, with reference to fig. 1 and fig. 2, the unsupervised multi-view multi-modal smart glasses image registration method provided by the embodiment of the present invention, where the unsupervised multi-view multi-modal smart glasses image registration method provided by the embodiment of the present invention includes:

s1, alternately training an auxiliary feature extraction network and a high-dimensional feature similarity discrimination network, and training the high-dimensional feature similarity discrimination network by using the same-mode images and/or different-mode images extracted by the auxiliary feature extraction network and the same-view and/or different-view images; the input images of the auxiliary feature extraction network are template images and images to be registered, and output are high-dimensional feature vectors; the high-dimensional feature similarity discrimination network adopts a convolutional neural network, input data are auxiliary features, high-dimensional features output by the network are extracted, and the output is a numerical value from 0 to 1.

Specifically, the method firstly trains an auxiliary feature extraction network and a high-dimensional feature similarity discrimination network.

After the auxiliary feature extraction network is trained, for the output obtained by input calculation of different modes and different visual angles, the similarity of the input image is judged by using the high-dimensional feature similarity judgment network, and the training aim is that for the input images of different modes and different visual angles, the auxiliary feature extraction network can extract similar high-dimensional features.

As an optional implementation manner of the embodiment of the present invention, the structure of the auxiliary feature extraction network is a resnet classification network architecture, and the input images of the auxiliary feature extraction network are the template image and the image to be registered, and the output is the high-dimensional feature vector.

The invention utilizes the same mode image/different mode images extracted by the auxiliary feature extraction network, and the same visual angle/different visual angle images to train the high-dimensional feature similarity discrimination network, so that the input of the different mode and different visual angle images of the high-dimensional feature similarity discrimination network is close to 0 in network output, and the input of the same mode and different visual angle images is close to 1 in network output. Wherein 0 represents that the two images are not similar; 1 represents that the two images are similar.

As an optional implementation manner of the embodiment of the present invention, the basic structure of the high-dimensional feature similarity discrimination network is a convolutional neural network, a resnet classification network architecture; the input data is an image pair, two images with the size of H, W and L are spliced into the image pair, and the size of the image pair is H, W, L and 2; the input image is subjected to feature extraction through multilayer convolution operation, and finally 1 number of 0-1 is output to represent probability, wherein 1 represents similarity, and 0 represents dissimilarity.

And S2, fixing a high-dimensional feature similarity discrimination network and an auxiliary feature extraction network, and training a registration network, wherein the input of the registration network is the image to be registered, the template image and a difference image of the image and the template image, the high-dimensional feature vector output by the auxiliary feature extraction network is inserted in the middle layer, and the output is a dense registration field.

Specifically, a high-dimensional feature similarity discrimination network and an auxiliary feature extraction network are fixed, a registration network is trained, and an image to be registered, a template image and a difference image of the image to be registered and the template image are input through the network. And inserting auxiliary features into the middle layer of the network to extract high-dimensional features output by the network, and finally outputting a predicted registration field. The template image is a registration reference template, and the difference image is used for describing the difference degree between the image to be detected and the template image.

As an optional implementation manner of the embodiment of the present invention, a network structure of the registration network is an UNet structure, and includes a down-sampling path and an up-sampling path, where an intermediate layer is an output of the down-sampling path and an input of the up-sampling path; the input of the registration network is an image to be registered, a template image and a difference image of the image to be registered and the template image, the sizes of the images are H x W x L, the combined images with the sizes of H x W x L3 are input into the registration network and the auxiliary feature extraction network, abstract features are obtained through a down-sampling channel firstly, then the abstract features are combined with high-dimensional features extracted by the auxiliary feature extraction network and enter an up-sampling channel together, and finally a dense registration field with the size of H W x L3 is output.

And S3, predicting the image to be registered by adopting the registration network and the auxiliary feature extraction network to obtain the registered image.

As an optional implementation manner of the embodiment of the present invention, the predicting, by the registration network and the auxiliary feature extraction network, the image to be registered, and obtaining the registered image includes: combining the images to be registered and the template images with the same size of H, W, L and differential images of the images to be registered and the template images to form an image pair with the size of H, W, L and 3, and inputting the image pair into a registration network and an auxiliary feature extraction network; obtaining abstract features through a down-sampling path, combining the abstract features with high-dimensional features extracted by an auxiliary feature extraction network, and entering an up-sampling path together; and outputting a dense registration field with the size H W L3, and sampling the original image by using the registration field to obtain a registered image.

As an optional implementation manner of the embodiment of the present invention, training the registration network includes: and calculating the similarity between the registered image and the template object as a similarity loss function, adding the regularization loss function of the registration field, and training the registration network.

Specifically, the similarity between the registered image and the template object is calculated as a similarity loss function, and a regularization loss function of a registration field is added, so that the similarity loss function and the regularization loss function jointly guide the training of the registration network. The template image is marked as F, the image to be registered is marked as M, and the registration field obtained by the network is marked as

Wherein, a loss function loss is added, comprising:

similarity loss function | F-M |²And the method is used for counting the gray difference between the template image and the image to be detected as similarity measurement.

Regularization loss function

Wherein

Is an absolute value used for limiting the amplitude of the deformation field, and plays a role in regularization, and alpha is a weighting coefficient of the term.

To be registered with fieldThe first order differential is used to constrain the smoothness of the registration field, and β is the weighting factor of this term.

Therefore, compared with the mode in the prior art, the unsupervised multi-view multi-modal intelligent glasses image registration method provided by the embodiment of the invention removes the cycle-gan module and cancels the restoration of the high-dimensional abstract features to the 3-dimensional image. And a feature extraction network is used for abstracting high-dimensional features, a neural network is used for judging the similarity of the high-dimensional features of the multi-view multi-modal images, and a supervision feature extraction network is used for extracting similar features from the multi-view multi-modal images, so that the subsequent registration is facilitated.

The invention reduces the complexity of the algorithm and reduces the training process into three steps, namely training an auxiliary feature extraction network, training a high-dimensional feature similarity discrimination network and training a registration network, wherein the high-dimensional feature is used as the information supplement of the registration network, and the coupling degree is low.

Based on the unsupervised multi-view multimode intelligent glasses image registration method provided by the invention, an end-to-end deep learning registration scheme is provided, a template image and an image to be registered are input, a registration result can be output, a multimode image can be processed, images (an RGB camera, a TOF camera, a structured light camera and the like) from different cameras are registered together, an unsupervised algorithm is adopted, data do not need to be marked, the time and the price cost are greatly saved, a cycle-gan module is cancelled compared with the existing method, the algorithm stability is improved, the model does not have too many long jump structures, and the method is light and convenient to deploy on small equipment.

Fig. 3 is a schematic structural diagram of an unsupervised multi-view multi-modal smart glasses image registration apparatus provided in an embodiment of the present invention, in which the above method is applied, and the following only briefly explains the structure of the unsupervised multi-view multi-modal smart glasses image registration apparatus, and makes reference to the related description in the above unsupervised multi-view multi-modal smart glasses image registration method for other matters, referring to fig. 3, the unsupervised multi-view multi-modal smart glasses image registration apparatus provided in an embodiment of the present invention includes:

the first training module is used for alternately training the auxiliary feature extraction network and the high-dimensional feature similarity discrimination network, and training the high-dimensional feature similarity discrimination network by using the same-mode images and/or different-mode images extracted by the auxiliary feature extraction network and the images at the same view angle and/or different view angles; the input images of the auxiliary feature extraction network are template images and images to be registered, and output are high-dimensional feature vectors; the high-dimensional feature similarity discrimination network adopts a convolutional neural network, input data is auxiliary features to extract high-dimensional features output by the network, and the output is a numerical value from 0 to 1;

the second training module is used for fixing the high-dimensional feature similarity discrimination network and the auxiliary feature extraction network and training the registration network, wherein the input of the registration network is the image to be registered, the template image and the difference image of the image and the template image, the high-level feature vector output by the auxiliary feature extraction network is inserted in the middle layer, and the output is a dense registration field;

and the prediction module is used for predicting the image to be registered by adopting the registration network and the auxiliary feature extraction network to obtain the registered image.

As an optional implementation manner of the embodiment of the present invention, the third training module trains the registration network by: and the third training module is specifically used for calculating the similarity between the registered image and the template object as a similarity loss function, adding a regularization loss function of the registration field, and training the registration network.

As an optional implementation manner of the embodiment of the present invention, the high-dimensional feature similarity discrimination network adopts a resnet classification network architecture, the auxiliary feature extraction network adopts a resnet classification network architecture, and the registration network adopts a UNet structure.

As an optional implementation of the embodiment of the present invention, the auxiliary feature extraction network pieces two images with the size H × W × L into an image pair with the size H × W × L × 2; the registration network tiles the two images of size H W L and their difference images into an image pair of size H W L3.

As an optional implementation manner of the embodiment of the present invention, the prediction module predicts the image to be registered by using the registration network and the auxiliary feature extraction network in the following manner, so as to obtain the registered image: the prediction module is specifically used for combining the image to be registered and the template image which have the same size H W L and the difference image of the image to be registered and the template image to form an image pair with the size H W L3, and inputting the image pair into the registration network and the auxiliary feature extraction network; obtaining abstract features through a down-sampling path, combining the abstract features with high-dimensional features extracted by an auxiliary feature extraction network, and entering an up-sampling path together; outputting a dense registration field with a size H W L3, and sampling the original image by using the registration field to obtain a registered image

Therefore, compared with the mode in the prior art, the unsupervised multi-view multi-modal intelligent glasses image registration device provided by the embodiment of the invention removes the cycle-gan module, and simultaneously cancels the restoration of the high-dimensional abstract features to the 3-dimensional image. And a feature extraction network is used for abstracting high-dimensional features, a neural network is used for judging the similarity of the high-dimensional features of the multi-view multi-modal images, and a supervision feature extraction network is used for extracting similar features from the multi-view multi-modal images, so that the subsequent registration is facilitated.

Based on the unsupervised multi-view multimode intelligent glasses image registration device provided by the invention, an end-to-end deep learning registration scheme is provided, a template image and an image to be registered are input, a registration result can be output, a multimode image can be processed, images (an RGB camera, a TOF camera, a structured light camera and the like) from different cameras are registered together, an unsupervised algorithm is adopted, data do not need to be marked, the time and the price cost are greatly saved, a cycle-gan module is cancelled compared with the existing method, the algorithm stability is improved, the model does not have too many long jump structures, and the unsupervised multimode intelligent glasses image registration device is light in weight and convenient to deploy on small equipment.

The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. An unsupervised multi-view multi-modal smart eyewear image registration method, comprising:

alternately training an auxiliary feature extraction network and a high-dimensional feature similarity discrimination network, and training the high-dimensional feature similarity discrimination network by using the same-mode images and/or different-mode images extracted by the auxiliary feature extraction network and the images at the same view angle and/or different view angles; the input images of the auxiliary feature extraction network are template images and images to be registered, and the output is a high-dimensional feature vector; the high-dimensional feature similarity discrimination network adopts a convolutional neural network, input data are high-dimensional features output by the auxiliary feature extraction network, and the output is a numerical value from 0 to 1;

fixing the high-dimensional feature similarity discrimination network and the auxiliary feature extraction network, and training a registration network, wherein the input of the registration network is an image to be registered, a template image and a difference image of the image and the template image, the high output of the auxiliary feature extraction network inserted in the middle layer is a feature vector, and the output is a dense registration field;

and predicting the image to be registered by adopting the registration network and the auxiliary feature extraction network to obtain the registered image.

2. The method of claim 1, wherein training the registration network comprises:

and calculating the similarity between the registered image and the template object as a similarity loss function, adding a regularization loss function of the registration field, and training the registration network.

3. The method according to claim 1, wherein the high-dimensional feature similarity discrimination network adopts a resnet classification network architecture, the auxiliary feature extraction network adopts a resnet classification network architecture, and the registration network adopts a UNet structure.

4. The method of claim 1, wherein the assist feature extraction network tiles two images of size H W L into an image pair of size H W L2;

the registration network tiles two images of size H W L and their difference images into an image pair of size H W L3.

5. The method according to claim 1, wherein the predicting the image to be registered by using the registration network and the assistant feature extraction network to obtain the registered image comprises:

combining the to-be-registered image and the template image which are same in size and are H x W x L, and the difference image of the to-be-registered image and the template image to form an image pair with the size of H x W x L3, and inputting the image pair into the registration network and the auxiliary feature extraction network;

obtaining abstract features through a down-sampling path, combining the abstract features with high-dimensional features extracted by the auxiliary feature extraction network, and jointly entering the up-sampling path;

and outputting the dense registration field with the size H W L3, and sampling the original image by using the registration field to obtain a registered image.

6. An unsupervised multi-view multi-modal smart eyewear image registration apparatus, comprising:

the first training module is used for alternately training an auxiliary feature extraction network and a high-dimensional feature similarity discrimination network, and training the high-dimensional feature similarity discrimination network by using the same-mode images and/or different-mode images extracted by the auxiliary feature extraction network and the images at the same view angle and/or different view angles; the input images of the auxiliary feature extraction network are template images and images to be registered, and the output is a high-dimensional feature vector; the high-dimensional feature similarity discrimination network adopts a convolutional neural network, input data are high-dimensional features output by the auxiliary feature extraction network, and the output is a numerical value from 0 to 1;

the second training module is used for fixing the high-dimensional feature similarity discrimination network and the auxiliary feature extraction network and training a registration network, wherein the input of the registration network is an image to be registered, a template image and a difference image of the image and the template image, the high-level feature vector output by the auxiliary feature extraction network is inserted in the middle layer, and the output is a dense registration field;

7. The apparatus of claim 6, wherein the third training module trains the registration network by:

the third training module is specifically configured to calculate similarity between the registered image and the template object as a similarity loss function, add a regularization loss function of the registration field, and train the registration network.

8. The apparatus according to claim 6, wherein the high-dimensional feature similarity discrimination network adopts a resnet classification network architecture, the auxiliary feature extraction network adopts a resnet classification network architecture, and the registration network adopts a UNet structure.

9. The apparatus of claim 6, wherein the assist feature extraction network tiles two images of size H W L into an image pair of size H W L2;

10. The apparatus according to claim 6, wherein the prediction module predicts the image to be registered by using the registration network and the assistant feature extraction network to obtain the registered image by:

the prediction module is specifically configured to combine the to-be-registered image and the template image with the same size of H × W × L and the difference image of the to-be-registered image and the template image to form an image pair with the size of H × W × L3, and input the image pair to the registration network and the auxiliary feature extraction network; obtaining abstract features through a down-sampling path, combining the abstract features with high-dimensional features extracted by the auxiliary feature extraction network, and jointly entering the up-sampling path; and outputting the dense registration field with the size H W L3, and sampling the original image by using the registration field to obtain a registered image.