WO2022143314A1

WO2022143314A1 - Object registration method and apparatus

Info

Publication number: WO2022143314A1
Application number: PCT/CN2021/140241
Authority: WO
Inventors: 李尔; 杨威; 郑波; 刘建滨
Original assignee: 华为技术有限公司
Priority date: 2020-12-29
Filing date: 2021-12-21
Publication date: 2022-07-07
Also published as: CN114758334A

Abstract

An object registration method and apparatus, which relate to the field of computer vision and by means of which the problem of how to improve the accuracy of a pose of a terminal device are solved. The solution comprises: acquiring a plurality of first input images that contain a first object, wherein the plurality of first input images comprise a real image of the first object and/or a plurality of first composite images of the first object, the plurality of first composite images are obtained by means of performing differentiable rendering by using a three-dimensional model of the first object under a plurality of first poses, and the plurality of first poses are different; respectively extracting feature information of the plurality of first input images, wherein the feature information is used for indicating features of the first object in the first input images where the first object is located; and making the feature information extracted from each of the first input images correspond to an identifier of the first object, and registering the first object.

Description

Object registration method and device

This application claims the priority of the Chinese patent application with the application number of 202011607387.9 and the invention titled "An object registration method and device", which was submitted to the State Intellectual Property Office on December 29, 2020, the entire contents of which are incorporated herein by reference. middle.

technical field

The embodiments of the present application relate to the field of computer vision, and in particular, to an object registration method and apparatus.

Background technique

Computer vision is an integral part of various intelligent/autonomous systems in various application fields, such as manufacturing, inspection, document analysis, medical diagnosis, and military. What we need is the knowledge of the data and information of the subject being photographed. To put it figuratively, it is to install eyes (cameras/camcorders) and brains (algorithms) on the computer to identify, track and measure the target instead of the human eye, so that the computer can perceive the environment. Because perception can be viewed as extracting information from sensory signals, computer vision can also be viewed as the science of how to make artificial systems "perceive" from images or multidimensional data. In general, computer vision is to use various imaging systems to replace the visual organ to obtain input information, and then use the computer to replace the brain to complete the processing and interpretation of these input information. The ultimate research goal of computer vision is to enable computers to observe and understand the world through vision like humans, and have the ability to adapt to the environment autonomously.

The pose detection and tracking of objects (people or objects) is a key technology in the field of computer vision, which can endow machines with the ability to perceive the three-dimensional spatial position and semantics of objects in the real environment. use.

In practical applications, a multi-object pose estimation network is usually constructed first to identify the poses of recognizable objects from the input images; (3 dimensions, 3D) network model, train the multi-object pose estimation network, register the object to be recognized in the multi-object pose estimation network, and realize that the multi-object pose estimation network can recognize the registered objects. When applied online, the image is input to the multi-object pose estimation network, and the multi-object pose estimation network recognizes the pose of the object in the picture.

When a new object to be recognized needs to be registered in the multi-object pose estimation network, the user provides the 3D model of the newly added object to be recognized, and the new 3D model and the original 3D model are used to estimate the pose of the multi-object. The retraining of the network leads to a linear increase in the training time, and affects the recognition effect of the multi-object pose estimation network on the recognizable objects that have been trained, resulting in a decrease in the detection accuracy and success rate.

SUMMARY OF THE INVENTION

The object registration method and device provided by the present application solve the problem of how to improve the accuracy of the pose of the terminal device.

To achieve the above object, the application adopts the following technical solutions:

In a first aspect, an object registration method is provided, the method may include: acquiring a plurality of first input images including a first object, the plurality of first input images including a real image of the first object and/or an image of the first object a plurality of first composite images; the plurality of first composite images are obtained by differentiable rendering of the three-dimensional model of the first object in a plurality of first poses; the plurality of first poses are different; a plurality of first inputs are extracted respectively The feature information of the image, the feature information is used to indicate the features of the first object in the first input image where it is located; the feature information extracted from each first input image corresponds to the identification of the first object, and the first Registration of an object.

Through the object registration method provided by this application, the feature information of the object in the image is extracted for object registration, the registration time is short, and the recognition performance of other registered objects is not affected, even if many new objects are registered, the registration will not increase too much. time, also guarantees the detection performance of registered objects.

In a possible implementation manner, the above feature information may include descriptors of local feature points and descriptors of global features.

In another possible implementation manner, the feature information of the plurality of first input images is extracted respectively, which can be specifically implemented as: inputting the plurality of first input images into the first network respectively for pose recognition, and obtaining each first input image The pose of the first object in the image; according to the obtained pose of the first object, project the three-dimensional model of the first object to each of the first input images, respectively, to obtain the projection area in each of the first input images; For each projected region in the first input image, feature information is extracted. Wherein, the first network is used to identify the pose of the first object in the image. Through the pose of the first object in the image, the region of the first object in the image is determined, and feature information is extracted in the region, which improves the efficiency and accuracy of feature information extraction.

In another possible implementation manner, the feature information of the plurality of first input images is extracted respectively, which can be specifically implemented as: inputting the plurality of first input images into the first network respectively to extract black and white images, and obtaining each first input image A black-and-white image of the first object in the image; extract feature information from the black-and-white image of the first object in each first input image respectively. Wherein, the first network is used to extract the black and white image of the first object in the image. The black and white image of the first object in the image is determined as the region of the first object in the image, and feature information is extracted in the region, which improves the efficiency and accuracy of feature information extraction.

In another possible implementation manner, the object registration method provided by the present application may further include a process of optimizing the differentiable renderer, and the process may include: inputting N real images of the first object into the first network respectively, to obtain the first The second pose of the first object in each real image output by a network; N is greater than or equal to 1; the first network is used to identify the pose of the first object in the image; according to the three-dimensional model of the first object, a differentiable Under each second pose, the renderer can obtain N second composite images by micro-rendering; obtain a real image of one second pose corresponding to the second composite image rendered by the one second pose; intercept them respectively In each real image, the area at the same position of the first object in the second composite image corresponding to the real object is taken as the foreground image of each real image; a difference information to construct a first loss function; wherein, the first difference information is used to indicate the difference between the foreground image and its corresponding second composite image; the differentiable renderer is updated according to the first loss function, so that the output of the differentiable renderer has The synthetic image approximates the real image of the object. By optimizing the differentiable renderer, the rendering authenticity of the differentiable renderer is improved, and the difference between the synthetic image obtained by differentiable rendering and the real image is reduced.

In another possible implementation manner, the first difference information may include one or more of the following information: feature map difference, pixel color difference, and difference between extracted feature descriptors.

In another possible implementation manner, the first loss function may be the sum of calculated values of multiple first difference information of the foreground images of the N real images and their corresponding second synthetic images.

In another possible implementation manner, the object registration method provided by the present application may further include a method for training an object pose detection network, which may specifically include: acquiring multiple second input images including the first object, the multiple first The second input image includes a real image of the first object and/or a plurality of third composite images of the first object; the plurality of third composite images are obtained by rendering the three-dimensional model of the first object in a plurality of third poses; The multiple third poses are different; the multiple second input images are respectively input into the second network for pose recognition, and the fourth pose of the first object in each second input image output by the second network is obtained; the second network It is used to identify the pose of the first object in the image; according to the three-dimensional model of the first object, microrendering obtains the fourth composite image of the first object under each fourth pose; obtains a second image of the fourth pose The input image corresponds to a fourth composite image rendered by a fourth pose; a second loss function is constructed according to the second difference information of each fourth composite image and its corresponding second input image; the second difference information is used for Indicate the difference between the fourth composite image and its corresponding second input image; update the second network according to the second loss function to obtain the first network; the pose of the first object in the image recognized by the first network is different from the first object in the image. The difference of the real pose is smaller than the difference between the pose of the first object in the image recognized by the second network and the real pose of the first object in the image. By training the object pose detection network, the recognition accuracy of the pose recognition network is improved, and the difference between the output of the pose recognition network and the real pose of the object in the image is reduced.

In another possible implementation, the second loss function Loss ₂ can satisfy the following expression:

X is greater than or equal to 1, λ _i is a weight value, and L _i is used to represent a calculated value of the second difference information between the fourth composite image and its corresponding second input image. This implementation provides a specific expression of the second loss function, so that the object pose detection network is trained, and the recognition accuracy of the pose detection network is improved.

In another possible implementation manner, the second difference information may include one or more of the following contents: the black-and-white image of the first object in the fourth composite image and the black-and-white image of the first object in the corresponding second input image The difference between the intersection of union (IOU) of the black and white images, the difference between the fourth pose obtained from the fourth composite image and the pose obtained by the fourth composite image through the first network, the fourth composite image and its corresponding 2. The similarity of the region images in the input image at the same position as the first object in the fourth composite image. This implementation provides a possible realization of the second difference information and enriches the content of the second difference information.

In a second aspect, a display method is provided, the method may include: acquiring a first image; if the first image includes one or more identifiable objects, outputting first information, where the first information is used to prompt detection of the first image The image includes recognizable objects; the pose detection network corresponding to each recognizable object included in the first image is used to obtain the pose of each recognizable object in the first image; according to the pose of each recognizable object, display Virtual content corresponding to each recognizable object; if the first image does not include any recognizable object, output the second information, the second information is used to prompt that no recognizable object is detected, adjust the viewing angle to obtain the second image, and the first The second image is different from the first image.

With the display method provided in the present application, a prompt is output to the user whether the image includes a identifiable object, so that the user can intuitively obtain whether the image includes a identifiable object, which improves user experience.

In a possible implementation manner, the display method provided by the present application may further include: extracting feature information in the first image, where the feature information is used to indicate recognizable features in the first image; judging whether the feature library exists Feature information whose matching distance with the extracted feature information satisfies a preset condition; wherein, one or more feature information of different objects is stored in the feature library; if there is a feature whose matching distance with the feature information satisfies the preset condition in the feature library information, it is determined that the first image includes one or more identifiable objects; if there is no feature information whose matching distance with the feature information satisfies the preset condition in the feature library, it is determined that the first image does not include any identifiable objects. By comparing the feature information of the image with the feature library, it can be easily and quickly judged whether the image contains identifiable objects.

In a possible implementation manner, the preset condition may include less than or equal to a preset threshold.

In another possible implementation manner, the display method provided by the present application may further include: acquiring one or more first local feature points in the first image, descriptors of the first local feature points and local features in the feature library The matching distance of the descriptors of the points is less than or equal to the first threshold; the descriptors of the local feature points of different objects are stored in the feature library; according to the first local feature points, one or more regions of interest (regions) in the first image are determined. of interest, ROI); an object is included in an ROI; global features in each ROI are extracted; if there are one or more first global features in the global features in each ROI, it is determined that the first image includes the first global feature The identifiable object corresponding to the feature; wherein, the matching distance between the descriptor of the first global feature and the descriptor of the global feature in the feature library is less than or equal to the second threshold; the descriptor of the global feature of different objects is also stored in the feature library ; If the first global feature does not exist in the global features in each ROI, it is determined that the first image does not include any identifiable object. First, the ROI area is determined by comparing the local feature points of the image with the feature library, and then the global features are extracted in the ROI area and compared with the feature library, which can improve the efficiency of judging whether the image contains identifiable objects, and also improves the judgment of whether the image contains identifiable objects. Identifiable object accuracy.

In another possible implementation manner, the display method provided by the present application may further include using in combination with the object registration method provided in the foregoing first aspect, and the display method provided by the present application may further include: acquiring a plurality of first objects including the first object. an input image, the plurality of first input images include a real image of the first object and/or a plurality of first composite images of the first object; the plurality of first composite images are composed of a three-dimensional model of the first object in a plurality of first Differentiable rendering is obtained under the pose; multiple first poses are different; feature information of multiple first input images is extracted respectively, and the feature information is used to indicate the feature of the first object in the first input image where it is located; The feature information extracted from each first input image is stored in the feature library corresponding to the identifier of the first object, and the first object is registered. Object registration is performed by extracting the feature information of objects in the image. The registration time is short, and the recognition performance of other registered objects is not affected. Even if many new objects are registered, the registration time will not be increased too much, and the registered objects are also guaranteed. detection performance.

In another possible implementation manner, the above feature information may include descriptors of local feature points and descriptors of global features.

It should be noted that, for the specific implementation of the object registration method, reference may be made to the foregoing first aspect, which will not be repeated here.

In another possible implementation manner, the display method provided by the present application may further include a process of optimizing the differentiable renderer, and the process may include: inputting N real images of the first object into the first network respectively, to obtain the first network The second pose of the first object in each output real image; N is greater than or equal to 1; the first network is used to identify the pose of the first object in the image; according to the three-dimensional model of the first object, a differentiable renderer is used Under each second pose, N second composite images are obtained by differentiable rendering; a real image of a second pose is obtained corresponding to the second composite image rendered by the one second pose; each image is intercepted separately In the real image, the area at the same position of the first object in the second synthetic image corresponding to the real object is used as the foreground image of each real image; according to the first difference between the foreground images of the N real images and their corresponding second synthetic images information to construct a first loss function; wherein, the first difference information is used to indicate the difference between the foreground image and its corresponding second composite image; the differentiable renderer is updated according to the first loss function, so that the output composite image of the differentiable renderer is Approach the real image of the object. By optimizing the differentiable renderer, the rendering authenticity of the differentiable renderer is improved, and the difference between the synthetic image obtained by differentiable rendering and the real image is reduced.

In another possible implementation manner, the object registration method provided by the present application may further include a method for training an object pose detection network, which may specifically include: acquiring multiple second input images including the first object, the multiple first The second input image includes a real image of the first object and/or a plurality of third composite images of the first object; the plurality of third composite images are obtained by rendering the three-dimensional model of the first object in a plurality of third poses; The multiple third poses are different; the multiple second input images are respectively input into the second network for pose recognition, and the fourth pose of the first object in each second input image output by the second network is obtained; the second network It is used to identify the pose of the first object in the image; according to the three-dimensional model of the first object, microrendering obtains the fourth composite image of the first object under each fourth pose; obtains a second image of the fourth pose The input image corresponds to a fourth composite image rendered by a fourth pose; a second loss function is constructed according to the second difference information of each fourth composite image and its corresponding second input image; the second difference information is used for Indicate the difference between the fourth composite image and its corresponding second input image; update the second network according to the second loss function to obtain the first network; the pose of the first object in the image recognized by the first network is different from the first object in the image. The difference of the real pose is smaller than the difference between the pose of the first object in the image recognized by the second network and the real pose of the first object in the image. By training the pose detection network, the recognition accuracy of the pose detection network is improved, and the difference between the output of the pose detection network and the real pose of the object in the image is reduced.

X is greater than or equal to 1, λ _i is a weight value, and L _i is used to represent a calculated value of the second difference information between the fourth composite image and its corresponding second input image. This implementation provides a specific expression of the second loss function, optimizes the pose detection network, and improves the recognition accuracy of the pose detection network.

In another possible implementation manner, the second difference information may include one or more of the following contents: the black-and-white image of the first object in the fourth composite image and the black-and-white image of the first object in the corresponding second input image The difference between the IOUs of the black and white images, the difference between the fourth pose obtained from the fourth composite image and the pose obtained by the fourth composite image through the first network, the fourth composite image and the corresponding second input image and the fourth composite image The similarity of the region images in the same position of the first object. This implementation provides a possible realization of the second difference information and enriches the content of the second difference information.

In a third aspect, the present application provides a method for training an object pose detection network, the method may specifically include: acquiring a plurality of second input images including a first object, the plurality of second input images including a real image of the first object image and/or a plurality of third composite images of the first object; the plurality of third composite images are rendered by the three-dimensional model of the first object in a plurality of third poses; the plurality of third poses are different; the A plurality of second input images are respectively input to the second network for pose recognition, and the fourth pose of the first object in each second input image output by the second network is obtained; the second network is used to identify the pose of the first object in the image. pose; according to the three-dimensional model of the first object, microrendering obtains the fourth composite image of the first object under each fourth pose; obtains a second input image of the fourth pose and a fourth pose rendering The obtained fourth composite image corresponds to; according to the second difference information of each fourth composite image and its corresponding second input image, a second loss function is constructed; the second difference information is used to indicate the fourth composite image and its corresponding No. The difference between the two input images; the second network is updated according to the second loss function to obtain the first network; the difference between the pose of the first object in the image recognized by the first network and the real pose of the first object in the image is smaller than the second The pose of the first object in the image recognized by the network is different from the real pose of the first object in the image.

By training the pose detection network, the recognition accuracy of the pose detection network is improved, and the difference between the output of the pose detection network and the real pose of the object in the image is reduced.

It should be noted that, for the specific implementation of the third aspect, reference may be made to the specific implementation of the training object pose detection network described in the foregoing first aspect, which can achieve the same beneficial effects, which will not be repeated here.

In a fourth aspect, an object registration device is provided, the device includes: a first acquisition unit, an extraction unit, and a registration unit. in:

a first acquiring unit, configured to acquire a plurality of first input images including a first object, the plurality of first input images including a real image of the first object and/or a plurality of first composite images of the first object; a plurality of The first composite image is obtained by differentiable rendering of the three-dimensional model of the first object in multiple first poses; the multiple first poses are different.

The extraction unit is configured to extract feature information of a plurality of first input images respectively, where the feature information is used to indicate features of the first object in the first input image where it is located.

The registration unit is used for registering the first object by corresponding the feature information extracted by the extraction unit in each first input image to the identifier of the first object.

With the object registration device provided in this application, the feature information of the object in the image is extracted to register the object, and the registration time is short, and the recognition performance of other registered objects is not affected. time, also guarantees the detection performance of registered objects.

In a possible implementation manner, the feature information includes descriptors of local feature points and descriptors of global features.

In another possible implementation manner, the extraction unit may be specifically configured to: input a plurality of first input images into the first network respectively for pose recognition, and obtain the pose of the first object in each first input image; Obtaining the pose of the first object, project the three-dimensional model of the first object to each first input image, respectively, to obtain the projection area in each first input image; respectively in the projection area in each first input image , extract feature information. Wherein, the first network is used to identify the pose of the first object in the image. Through the pose of the first object in the image, the region of the first object in the image is determined, and feature information is extracted in the region, which improves the efficiency and accuracy of feature information extraction.

In another possible implementation manner, the extraction unit may be specifically configured to: input a plurality of first input images into a first network respectively to extract black and white images, and obtain a black and white image of the first object in each of the first input images; Within the black and white image of the first object in each of the first input images, feature information is extracted. Wherein, the first network is used to extract the black and white image of the first object in the image. The black and white image of the first object in the image is determined as the region of the first object in the image, and feature information is extracted in the region, which improves the efficiency and accuracy of feature information extraction.

In another possible implementation manner, the apparatus may further include: a processing unit, a differentiable renderer, a screenshot unit, a construction unit, and an update unit. in:

The processing unit is configured to input the N real images of the first object into the first network respectively, and obtain the second pose of the first object in each real image output by the first network. N is greater than or equal to 1. The first network is used to identify the pose of the first object in the image.

The differentiable renderer is used to obtain N second composite images by differentiable rendering under each second pose according to the three-dimensional model of the first object; obtain a real image of the second pose and a second pose The rendered second composite image corresponds to.

The intercepting unit is used for intercepting, in each real image, an area at the same position of the first object in the second composite image corresponding to the real object, as the foreground image of each real image.

The construction unit is configured to construct a first loss function according to the first difference information of the foreground images of the N real images and their corresponding second synthetic images. The first difference information is used to indicate the difference between the foreground image and its corresponding second composite image.

The updating unit is configured to update the differentiable renderer according to the first loss function, so that the synthetic image output by the differentiable renderer approximates the real image of the object.

By optimizing the differentiable renderer, the rendering authenticity of the differentiable renderer is improved, and the difference between the synthetic image obtained by differentiable rendering and the real image is reduced.

In another possible implementation manner, the apparatus may further include: a second acquisition unit, a processing unit, a differentiable renderer, a construction unit, and an update unit. in:

a second acquiring unit, configured to acquire a plurality of second input images including the first object, the plurality of second input images include a real image of the first object and/or a plurality of third composite images of the first object; The three composite images are rendered by the three-dimensional model of the first object in multiple third poses. The multiple third poses are different.

The processing unit is used for inputting a plurality of second input images into the second network respectively for pose recognition, and obtaining the fourth pose of the first object in each second input image output by the second network; the second network is used for identifying The pose of the first object in the image.

A differentiable renderer, configured to obtain a fourth composite image of the first object in each fourth pose by differentiable rendering according to the three-dimensional model of the first object. Obtaining a second input image in a fourth pose corresponds to a fourth composite image rendered in a fourth pose.

The construction unit is configured to construct a second loss function according to the second difference information of each fourth composite image and its corresponding second input image. The second difference information is used to indicate the difference between the fourth composite image and its corresponding second input image.

The updating unit is configured to update the second network according to the second loss function to obtain the first network. The difference between the pose of the first object in the image recognized by the first network and the real pose of the first object in the image is smaller than the pose of the first object in the image recognized by the second network and the real pose of the first object in the image difference.

It should be noted that the object registration apparatus provided in the fourth aspect is used to implement the object registration method provided in the first aspect, and the specific implementation can refer to the specific implementation of the foregoing first aspect, which will not be repeated here.

In a fifth aspect, a display device is provided, the device includes a first acquisition unit, an output unit and a processing unit; wherein:

The first acquisition unit is used to acquire the first image.

The output unit is configured to output first information if the first image includes one or more identifiable objects, where the first information is used for prompting detection that the first image includes identifiable objects. If the first image does not include any recognizable object, output second information, the second information is used to prompt that no recognizable object is detected, adjust the viewing angle so that the first acquisition unit acquires the second image, the second image and the The first image is different.

The processing unit is configured to obtain the pose of each recognizable object in the first image through the pose detection network corresponding to each recognizable object included in the first image if the first image includes one or more recognizable objects ; According to the pose of each recognizable object, display the virtual content corresponding to each recognizable object.

With the display device provided in the present application, a prompt is output to the user whether the image includes a identifiable object, so that the user can intuitively obtain whether the image includes an identifiable object, which improves user experience.

In a possible implementation manner, the apparatus may further include: an extraction unit, a judgment unit, and a first determination unit. in:

An extraction unit, configured to extract feature information in the first image, where the feature information is used to indicate recognizable features in the first image.

The judgment unit is used for judging whether there is feature information whose matching distance with the feature information satisfies a preset condition in the feature library. Among them, the feature library stores one or more feature information of different objects.

The first determination unit is configured to determine that one or more identifiable objects are included in the first image if there is feature information whose matching distance with the feature information meets a preset condition in the feature library; The feature information whose distance satisfies the preset condition is matched, and it is determined that the first image does not include any identifiable object.

By comparing the feature information of the image with the feature library, it can be easily and quickly judged whether the image contains identifiable objects.

In another possible implementation manner, the preset condition may include less than or equal to a preset threshold.

In another possible implementation manner, the apparatus may further include: a second acquiring unit, a second determining unit, and a first determining unit. in:

The second acquiring unit is configured to acquire one or more first local feature points in the first image, and the matching distance between the descriptors of the first local feature points and the descriptors of the local feature points in the feature library is less than or equal to the first Threshold; descriptors of local feature points of different objects are stored in the feature library.

The second determining unit is configured to determine one or more ROIs in the first image according to the first local feature points; one ROI includes one object.

Correspondingly, the extraction unit can also be used to extract global features in each ROI.

a first determining unit, configured to determine that the first image includes an identifiable object corresponding to the first global feature if there are one or more first global features in the global features in each ROI; if the global features in each ROI If the first global feature does not exist in the first image, it is determined that the first image does not include any identifiable object. The matching distance between the descriptor of the first global feature and the descriptor of the global feature in the feature library is less than or equal to the second threshold. Descriptors of global features of different objects are also stored in the feature library.

First, the ROI area is determined by comparing the local feature points of the image with the feature library, and then the global features are extracted in the ROI area and compared with the feature library, which can improve the efficiency of judging whether the image contains identifiable objects, and also improves the judgment of whether the image contains identifiable objects. Identifiable object accuracy.

In another possible implementation manner, the apparatus may further include: a third acquisition unit, an extraction unit, and a registration unit. in:

a third acquiring unit, configured to acquire a plurality of first input images including a first object, the plurality of first input images include a real image of the first object and/or a plurality of first composite images of the first object; a plurality of first input images A composite image is obtained by differentiable rendering of the three-dimensional model of the first object in multiple first poses; the multiple first poses are different.

The extraction unit extracts feature information of a plurality of first input images respectively, where the feature information is used to indicate features of the first object in the first input image where it is located.

The registration unit is configured to store the feature information extracted from each first input image corresponding to the identifier of the first object in the feature database, and register the first object.

Object registration is performed by extracting the feature information of objects in the image. The registration time is short, and the recognition performance of other registered objects is not affected. Even if many new objects are registered, the registration time will not be increased too much, and the registered objects are also guaranteed. detection performance.

In another possible implementation manner, the apparatus may further include: a fourth acquisition unit, a processing unit, a differentiable renderer, a construction unit, and an update unit. in:

a fourth acquisition unit, configured to acquire a plurality of second input images including the first object, the plurality of second input images including a real image of the first object and/or a plurality of third composite images of the first object; a plurality of The third composite image is obtained by rendering the three-dimensional model of the first object in multiple third poses; the multiple third poses are different.

The differentiable renderer is used for obtaining a fourth composite image of the first object in each fourth pose according to the three-dimensional model of the first object, and obtaining a second input image of a fourth pose and a first The fourth composite image obtained by the four pose rendering corresponds to.

The construction unit is configured to construct a second loss function according to the second difference information of each fourth composite image and its corresponding second input image; the second difference information is used to indicate the difference between the fourth composite image and its corresponding second input image difference.

The updating unit is used to update the second network according to the second loss function to obtain the first network; the difference between the pose of the first object in the image recognized by the first network and the real pose of the first object in the image is smaller than that of the second network The recognized pose of the first object in the image is different from the real pose of the first object in the image.

X is greater than or equal to 1, λ _i is a weight value, and L _i is used to represent a calculated value of the second difference information between the fourth composite image and its corresponding second input image. This implementation provides a specific expression of the second loss function, trains the pose detection network, and improves the recognition accuracy of the pose detection network.

It should be noted that the display device provided in the fifth aspect is used to implement the display method provided in the above-mentioned second aspect, and the specific implementation thereof can refer to the specific implementation of the foregoing second aspect, which will not be repeated here.

In a sixth aspect, an apparatus for training an object pose detection network is provided, the apparatus may include: an acquisition unit, a processing unit, a differentiable renderer, a construction unit, and an update unit. in:

an acquiring unit, configured to acquire multiple second input images including the first object, the multiple second input images including the real image of the first object and/or multiple third composite images of the first object; multiple third The composite image is rendered by the three-dimensional model of the first object in multiple third poses; the multiple third poses are different.

It should be noted that the device for optimizing the pose recognition network provided in the sixth aspect is used to implement the method for optimizing the pose recognition network provided in the third aspect. The specific implementation can refer to the specific implementation of the third aspect. Here No longer.

In a seventh aspect, the present application provides an electronic device, which can implement the functions in the method examples described in the first aspect or the second aspect or the third aspect, and the functions can be implemented by hardware or by hardware. Execute the corresponding software implementation. The hardware or software includes one or more modules corresponding to the above functions. The electronic device may exist in the form of a chip product.

In one possible implementation, the electronic device may include a processor and a transmission interface. Among them, the transmission interface is used to receive and send data. The processor is configured to invoke program instructions stored in the memory to cause the electronic device to perform the functions in the method examples described in the first aspect or the second aspect or the third aspect above.

In an eighth aspect, a computer-readable storage medium is provided, comprising instructions that, when run on a computer, cause the computer to execute the object registration method or display method described in any one of the above aspects or any possible implementation manner, Or a method for training a network for object pose detection.

A ninth aspect provides a computer program product that, when running on a computer, enables the computer to execute the object registration method described in any of the above aspects or any possible implementation manner, or the display method, or the training object pose A method to detect the network.

In a tenth aspect, a chip system is provided, the chip system includes a processor, and may also include a memory, for implementing the functions in the above method. The chip system can be composed of chips, and can also include chips and other discrete devices.

The solutions provided in the fourth aspect to the tenth aspect are used to realize the method provided in the first aspect or the second aspect or the third aspect, so the same beneficial effects as the first aspect or the second aspect or the third aspect can be achieved. , and will not be repeated here.

It should be noted that, various possible implementation manners of any one of the above aspects can be combined on the premise that the solutions are not contradictory.

Description of drawings

FIG. 1 is a schematic structural diagram of a terminal device according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a software structure of a terminal device according to an embodiment of the present application;

3 is a schematic diagram of a virtual-real fusion effect of explanation information and a real object provided by an embodiment of the present application;

FIG. 4 is a schematic flowchart of a pose detection process provided by an embodiment of the present application;

5a is a schematic flowchart of a method for incremental learning provided by an embodiment of the present application;

FIG. 5b is a schematic diagram of a system architecture provided by an embodiment of the present application;

6 is a schematic structural diagram of a convolutional neural network provided by an embodiment of the present application;

7a is a schematic diagram of a chip hardware structure provided by an embodiment of the application;

FIG. 7b is a schematic diagram of the overall system framework of the solution provided by the embodiment of the present application;

8 is a schematic flowchart of an object registration method provided in Embodiment 1 of the present application;

9 is a schematic diagram of a spherical surface provided by an embodiment of the present application;

10 is a schematic flowchart of another object registration method provided in Embodiment 1 of the present application;

11 is a schematic diagram of an image area provided by an embodiment of the present application;

12 is a schematic flowchart of an optimization method provided in Embodiment 2 of the present application;

13 is a schematic flowchart of another optimization method provided in Embodiment 2 of the present application;

14 is a schematic flowchart of still another object registration method provided by an embodiment of the present application;

15 is a schematic flowchart of a method for training an object pose detection network according to Embodiment 3 of the present application;

16 is a schematic flowchart of another method for training an object pose detection network provided in Embodiment 3 of the present application;

17 is a schematic flowchart of a display method provided in Embodiment 4 of the present application;

18 is a schematic flowchart of determining whether a first image contains an identifiable object according to Embodiment 4 of the present application;

19 is a schematic diagram of a mobile phone interface provided by an embodiment of the application;

FIG. 20a is a schematic diagram of another mobile phone interface provided by an embodiment of the present application;

20b is a schematic diagram of still another mobile phone interface provided by the embodiment of the application;

FIG. 21 is a schematic structural diagram of an object registration apparatus provided by an embodiment of the present application;

22 is a schematic structural diagram of an optimization device provided by an embodiment of the application;

23 is a schematic structural diagram of an apparatus for training an object pose detection network provided by an embodiment of the application;

FIG. 24 is a schematic structural diagram of a display device according to an embodiment of the present application;

FIG. 25 is a schematic structural diagram of an apparatus provided by an embodiment of the present application.

Detailed ways

In the embodiments of the present application, in order to clearly describe the technical solutions of the embodiments of the present application, words such as "first" and "second" are used to distinguish the same or similar items with basically the same functions and functions. Those skilled in the art can understand that the words "first", "second" and the like do not limit the quantity and execution order, and the words "first", "second" and the like are not necessarily different. The technical features described in the "first" and second" have no sequence or order of magnitude.

In the embodiments of the present application, words such as "exemplary" or "for example" are used to represent examples, illustrations or illustrations. Any embodiments or designs described in the embodiments of the present application as "exemplary" or "such as" should not be construed as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "exemplary" or "such as" is intended to present the related concepts in a specific manner to facilitate understanding.

In the embodiments of the present application, at least one may also be described as one or more, and the multiple may be two, three, four or more, which is not limited in this application.

In addition, the network architecture and scenarios described in the embodiments of the present application are for the purpose of illustrating the technical solutions of the embodiments of the present application more clearly, and do not constitute a limitation on the technical solutions provided by the embodiments of the present application. The evolution of the network architecture and the emergence of new service scenarios, the technical solutions provided in the embodiments of the present application are also applicable to similar technical problems.

Before describing the embodiments of the present application, the terms involved in the present application will be uniformly explained and explained here, and will not be explained one by one later.

An image, also known as a picture, is a picture with visual effects. The image referred to in this application may be a still image, a video frame in a video stream, or other, which is not limited.

Object, a person or thing that can exist. For example, things can be buildings, commodities, plants, animals, etc., which are not listed here.

The pose refers to the pose of the object in the camera coordinate system. The poses can include 6DoF poses, that is, the translation pose and rotation pose of the object relative to the camera.

Pose detection refers to detecting and recognizing the pose of an object in an image.

The real image of an object refers to a static image with visual effects and a drawing image of the background area containing the object. The real image of the object can be in red, green, blue, RGB format or RGBD (red, green, blue, depth) format.

Rendering is the process of converting a 3D model of an object into a 2D image through a renderer. Usually, scenes and entities are represented in three-dimensional form, which can be closer to the real world and facilitate manipulation and transformation, while graphics display devices are mostly two-dimensional rasterized displays and dot-matrix printers. A raster display can be regarded as a pixel matrix, and any graphic displayed on a raster display is actually a collection of pixels with one or more colors and grayscales. The representation of a three-dimensional solid scene through rasterization and lattice is image rendering -- that is, rasterization.

Conventional rendering refers to a rendering method in which rasterization is not differentiable.

Differentiable rendering refers to a rendering method in which rasterization is differentiable. Since the rendering process is differentiable, a loss function can be constructed according to the difference between the rendered image and the real image, the parameters of the differentiable rendering can be updated, and the authenticity of the differentiable rendering results can be improved.

The composite image of an object refers to an image containing only the object obtained by rendering the 3D model of the object in a desired pose. The composite image of an object in a certain pose is equivalent to the image obtained by taking pictures of the object in this pose.

The image corresponding to the composite image refers to the pose of rendering the composite image, which is obtained by inputting the corresponding image to the neural network. It should be understood that the composite image and the source image for rendering the pose of the composite image correspond to each other, which will not be described in detail below. The image corresponding to the composite image may be a real image of the object, or may be other composite images of the object.

A black-and-white image of an object is a black-and-white pixel image that contains only the object and no background. Specifically, the black-and-white image of the object can be represented by a binarized image, the pixel value of the region including the object is 1, and the pixel value of other regions is 0.

Local feature points are local expressions of image features, which reflect the local characteristics of the image. Local feature points are points on the image that are clearly distinguishable from other pixels, including but not limited to corner points, key points, and the like. In image processing, local feature points mainly refer to scale-invariant points or blocks. Scale invariance means that the same object or scene is collected from different angles, and the same place can be identified as the same. The local feature points may include SIFT feature points, SURF feature points, DAISY feature points, and the like. Usually, the local feature points of the image can be extracted by methods such as FAST and DOG. The descriptor of a local feature point is a high-dimensional vector representing the local image information of the feature point.

The global feature refers to the feature that can represent the entire image. The global feature is relative to the local features of the image and is used to describe the overall features such as the color and shape of the image or target. For example, global features may include color features, texture features, shape features, and the like. Usually, the bag-of-words method can be used to extract the global features of the image. The descriptor of the global feature is a high-dimensional vector that represents the image information of the entire image or a larger area.

The calculated value may refer to a mathematical calculation value of multiple data, and the mathematical calculation may be an average, a maximum, or a minimum, or others.

In order to describe the following embodiments clearly and concisely, a brief introduction of the related technology is given first:

In recent years, the functions of terminal devices have become more and more abundant, which brings better user experience to users. For example, a terminal device can implement a virtual reality (virtual reality, VR) function, so that the user is in the virtual world and experiences the virtual world. For another example, a terminal device can implement an augmented reality (AR) function, combine virtual objects with real scenes, and enable users to interact with virtual objects.

The terminal device may be a smartphone, a tablet computer, a wearable device, an AR/VR device, or the like. This application does not limit the specific form of the terminal device. Wearable devices can also be called wearable smart devices, which are the general term for the intelligent design of daily wear and the development of wearable devices using wearable technology, such as glasses, gloves, watches, clothing and shoes. A wearable device is a portable device that is worn directly on the body or integrated into the user's clothing or accessories. Wearable device is not only a hardware device, but also realizes powerful functions through software support, data interaction, and cloud interaction. In a broad sense, wearable smart devices include full-featured, large-scale, complete or partial functions without relying on smart phones, such as smart watches or smart glasses, and only focus on a certain type of application function, which needs to cooperate with other devices such as smart phones. Use, such as all kinds of smart bracelets, smart jewelry, etc. for physical sign monitoring.

In this application, the structure of the terminal device may be as shown in FIG. 1 . As shown in FIG. 1 , the terminal device 100 may include a processor 110 , an external memory interface 120 , an internal memory 121 , a universal serial bus (USB) interface 130 , a charging management module 140 , a power management module 141 , and a battery 142 , Antenna 1, Antenna 2, Mobile Communication Module 150, Wireless Communication Module 160, Audio Module 170, Speaker 170A, Receiver 170B, Microphone 170C, Headphone Interface 170D, Sensor Module 180, Key 190, Motor 191, Indicator 192, Camera 193 , a display screen 194, and a subscriber identification module (subscriber identification module, SIM) card interface 195 and the like. The sensor module 180 may include a pressure sensor 180A, a gyroscope sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity light sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, and ambient light. Sensor 180L, bone conduction sensor 180M, etc.

It can be understood that the structure illustrated in this embodiment does not constitute a specific limitation on the terminal device 100 . In other embodiments, the terminal device 100 may include more or less components than shown, or combine some components, or separate some components, or arrange different components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

The processor 110 may include one or more processing units, for example, the processor 110 may include an application processor (application processor, AP), a modem processor, a graphics processor (graphics processing unit, GPU), an image signal processor (image signal processor, ISP), controller, video codec, digital signal processor (digital signal processor, DSP), baseband processor, and/or neural-network processing unit (neural-network processing unit, NPU), etc. Wherein, different processing units may be independent devices, or may be integrated in one or more processors. For example, in the present application, the processor 110 may control to turn on other cameras when it is determined that the first image satisfies the abnormal condition.

The controller may be the nerve center and command center of the terminal device 100 . The controller can generate an operation control signal according to the instruction operation code and timing signal, and complete the control of fetching and executing instructions.

A memory may also be provided in the processor 110 for storing instructions and data. In some embodiments, the memory in processor 110 is cache memory. This memory may hold instructions or data that have just been used or recycled by the processor 110 . If the processor 110 needs to use the instruction or data again, it can be called directly from the memory. Repeated accesses are avoided and the latency of the processor 110 is reduced, thereby increasing the efficiency of the system.

In some embodiments, the processor 110 may include one or more interfaces. The interface may include an integrated circuit (inter-integrated circuit, I2C) interface, an integrated circuit built-in audio (inter-integrated circuit sound, I2S) interface, a pulse code modulation (pulse code modulation, PCM) interface, a universal asynchronous transceiver (universal asynchronous transmitter) receiver/transmitter, UART) interface, mobile industry processor interface (MIPI), general-purpose input/output (GPIO) interface, subscriber identity module (SIM) interface, and / or universal serial bus (universal serial bus, USB) interface, etc.

The MIPI interface can be used to connect the processor 110 with peripheral devices such as the display screen 194 and the camera 193 . MIPI interfaces include camera serial interface (CSI), display serial interface (DSI), etc. In some embodiments, the processor 110 communicates with the camera 193 through a CSI interface to implement the shooting function of the terminal device 100. The processor 110 communicates with the display screen 194 through the DSI interface to implement the display function of the terminal device 100 .

The GPIO interface can be configured by software. The GPIO interface can be configured as a control signal or as a data signal. In some embodiments, the GPIO interface may be used to connect the processor 110 with the camera 193, the display screen 194, the wireless communication module 160, the audio module 170, the sensor module 180, and the like. The GPIO interface can also be configured as I2C interface, I2S interface, UART interface, MIPI interface, etc.

The USB interface 130 is an interface that conforms to the USB standard specification, and may specifically be a Mini USB interface, a Micro USB interface, a USB Type C interface, and the like. The USB interface 130 can be used to connect a charger to charge the terminal device 100, and can also be used to transmit data between the terminal device 100 and peripheral devices. It can also be used to connect headphones to play audio through the headphones. This interface can also be used to connect other terminal devices, such as AR devices.

It can be understood that the interface connection relationship between the modules illustrated in this embodiment is only a schematic illustration, and does not constitute a structural limitation of the terminal device 100 . In other embodiments of the present application, the terminal device 100 may also adopt different interface connection manners in the foregoing embodiments, or a combination of multiple interface connection manners.

The power management module 141 is used for connecting the battery 142 , the charging management module 140 and the processor 110 . The power management module 141 receives input from the battery 142 and/or the charging management module 140, and supplies power to the processor 110, the internal memory 121, the display screen 194, the camera 193, and the wireless communication module 160. The power management module 141 can also be used to monitor parameters such as battery capacity, battery cycle times, battery health status (leakage, impedance). In some other embodiments, the power management module 141 may also be provided in the processor 110 . In other embodiments, the power management module 141 and the charging management module 140 may also be provided in the same device.

The wireless communication function of the terminal device 100 may be implemented by the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, the modulation and demodulation processor, the baseband processor, and the like.

The terminal device 100 implements a display function through a GPU, a display screen 194, an application processor, and the like. The GPU is a microprocessor for image processing, and is connected to the display screen 194 and the application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. Processor 110 may include one or more GPUs that execute program instructions to generate or alter display information.

Display screen 194 is used to display images, videos, and the like. Display screen 194 includes a display panel. The display panel can be a liquid crystal display (LCD), an organic light-emitting diode (OLED), an active-matrix organic light-emitting diode or an active-matrix organic light-emitting diode (active-matrix organic light). emitting diode, AMOLED), flexible light-emitting diode (flex light-emitting diode, FLED), Miniled, MicroLed, Micro-oled, quantum dot light-emitting diode (quantum dot light emitting diodes, QLED) and so on. In some embodiments, the terminal device 100 may include one or N display screens 194 , where N is a positive integer greater than one.

A series of graphical user interfaces (graphical user interfaces, GUIs) may be displayed on the display screen 194 of the terminal device 100 , and these GUIs are the main screens of the terminal device 100 . Generally speaking, the size of the display screen 194 of the terminal device 100 is fixed, and only limited controls can be displayed in the display screen 194 of the terminal device 100 . A control is a GUI element, which is a software component that is included in an application and controls all the data processed by the application and the interaction with this data. The user can interact with the control through direct manipulation (direct manipulation). , so as to read or edit the relevant information of the application. In general, controls may include icons, buttons, menus, tabs, text boxes, dialog boxes, status bars, navigation bars, widgets, and other visual interface elements.

The terminal device 100 can realize the shooting function through the ISP, the camera 193, the video codec, the GPU, the display screen 194 and the application processor.

The ISP is used to process the data fed back by the camera 193 . For example, when taking a photo, the shutter is opened, the light is transmitted to the camera photosensitive element through the lens, the light signal is converted into an electrical signal, and the camera photosensitive element transmits the electrical signal to the ISP for processing, and converts it into an image visible to the naked eye. ISP can also perform algorithm optimization on image noise, brightness, and skin tone. ISP can also optimize the exposure, color temperature and other parameters of the shooting scene. In some embodiments, the ISP may be provided in the camera 193 .

Camera 193 is used to capture still images or video. The object is projected through the lens to generate an optical image onto the photosensitive element. The photosensitive element may be a charge coupled device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor. The photosensitive element converts the optical signal into an electrical signal, and then transmits the electrical signal to the ISP to convert it into a digital image signal. The ISP outputs the digital image signal to the DSP for processing. DSP converts digital image signals into standard RGB, YUV and other formats of image signals. In some embodiments, the terminal device 100 may include 1 or N cameras 193 , where N is a positive integer greater than 1.

A digital signal processor is used to process digital signals, in addition to processing digital image signals, it can also process other digital signals. For example, when the terminal device 100 selects a frequency point, the digital signal processor is used to perform Fourier transform on the frequency point energy, and the like.

Video codecs are used to compress or decompress digital video. The terminal device 100 may support one or more video codecs. In this way, the terminal device 100 can play or record videos in various encoding formats, for example, moving picture experts group (moving picture experts group, MPEG) 1, MPEG2, MPEG3, MPEG4 and so on.

The NPU is a neural-network (NN) computing processor. By drawing on the structure of biological neural networks, such as the transfer mode between neurons in the human brain, it can quickly process the input information, and can continuously learn by itself. Applications such as intelligent cognition of the terminal device 100 can be implemented through the NPU, such as image recognition, face recognition, speech recognition, text understanding, and the like.

The external memory interface 120 can be used to connect an external memory card, such as a Micro SD card, to expand the storage capacity of the terminal device 100 . The external memory card communicates with the processor 110 through the external memory interface 120 to realize the data storage function. For example to save files like music, video etc in external memory card.

Internal memory 121 may be used to store computer executable program code, which includes instructions. The processor 110 executes various functional applications and data processing of the terminal device 100 by executing the instructions stored in the internal memory 121 . For example, in this embodiment, the processor 110 may acquire the pose of the terminal device 100 by executing the instructions stored in the internal memory 121 . The internal memory 121 may include a storage program area and a storage data area. The storage program area can store an operating system, an application program required for at least one function (such as a sound playback function, an image playback function, etc.), and the like. The storage data area may store data (such as audio data, phone book, etc.) created during the use of the terminal device 100 and the like. In addition, the internal memory 121 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, universal flash storage (UFS), and the like. The processor 110 executes various functional applications and data processing of the terminal device 100 by executing instructions stored in the internal memory 121 and/or instructions stored in a memory provided in the processor.

The terminal device 100 may implement audio functions through an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, an application processor, and the like. Such as music playback, recording, etc.

The audio module 170 is used for converting digital audio information into analog audio signal output, and also for converting analog audio input into digital audio signal. Audio module 170 may also be used to encode and decode audio signals. In some embodiments, the audio module 170 may be provided in the processor 110 , or some functional modules of the audio module 170 may be provided in the processor 110 .

Speaker 170A, also referred to as a "speaker", is used to convert audio electrical signals into sound signals. The terminal device 100 can listen to music through the speaker 170A, or listen to a hands-free call.

The receiver 170B, also referred to as "earpiece", is used to convert audio electrical signals into sound signals. When the terminal device 100 answers a call or a voice message, the voice can be answered by placing the receiver 170B close to the human ear.

The microphone 170C, also called "microphone" or "microphone", is used to convert sound signals into electrical signals. When making a call or sending a voice message, the user can make a sound by approaching the microphone 170C through a human mouth, and input the sound signal into the microphone 170C. The terminal device 100 may be provided with at least one microphone 170C. In other embodiments, the terminal device 100 may be provided with two microphones 170C, which may implement a noise reduction function in addition to collecting sound signals. In other embodiments, the terminal device 100 may further be provided with three, four or more microphones 170C to collect sound signals, reduce noise, identify sound sources, and implement directional recording functions.

The earphone jack 170D is used to connect wired earphones. The earphone interface 170D can be the USB interface 130, or can be a 3.5mm open mobile terminal platform (OMTP) standard interface, a cellular telecommunications industry association of the USA (CTIA) standard interface.

The pressure sensor 180A is used to sense pressure signals, and can convert the pressure signals into electrical signals. In some embodiments, the pressure sensor 180A may be provided on the display screen 194 . There are many types of pressure sensors 180A, such as resistive pressure sensors, inductive pressure sensors, capacitive pressure sensors, and the like. The capacitive pressure sensor may be comprised of at least two parallel plates of conductive material. When a force is applied to the pressure sensor 180A, the capacitance between the electrodes changes. The terminal device 100 determines the intensity of the pressure according to the change in capacitance. When a touch operation acts on the display screen 194, the terminal device 100 detects the intensity of the touch operation according to the pressure sensor 180A. The terminal device 100 may also calculate the touched position according to the detection signal of the pressure sensor 180A. In some embodiments, touch operations acting on the same touch position but with different touch operation intensities may correspond to different operation instructions. For example, when a touch operation whose intensity is less than the first pressure threshold acts on the short message application icon, the instruction for viewing the short message is executed. When a touch operation with a touch operation intensity greater than or equal to the first pressure threshold acts on the short message application icon, the instruction to create a new short message is executed.

The gyro sensor 180B may be used to determine the motion attitude of the terminal device 100 . In some embodiments, the angular velocity of the end device 100 about three axes (ie, the x, y and z axes) may be determined by the gyro sensor 180B. The gyro sensor 180B can be used for image stabilization. Exemplarily, when the shutter is pressed, the gyro sensor 180B detects the shaking angle of the terminal device 100, calculates the distance to be compensated by the lens module according to the angle, and allows the lens to offset the shaking of the terminal device 100 through reverse motion to achieve anti-shake. The gyro sensor 180B can also be used for navigation and somatosensory game scenarios.

The air pressure sensor 180C is used to measure air pressure. In some embodiments, the terminal device 100 calculates the altitude through the air pressure value measured by the air pressure sensor 180C to assist in positioning and navigation.

The magnetic sensor 180D includes a Hall sensor. The terminal device 100 can detect the opening and closing of the flip holster using the magnetic sensor 180D. In some embodiments, when the terminal device 100 is a flip machine, the terminal device 100 can detect the opening and closing of the flip according to the magnetic sensor 180D. Further, according to the detected opening and closing state of the leather case or the opening and closing state of the flip cover, characteristics such as automatic unlocking of the flip cover are set.

The acceleration sensor 180E can detect the magnitude of the acceleration of the terminal device 100 in various directions (generally three axes). The magnitude and direction of gravity can be detected when the terminal device 100 is stationary. It can also be used to identify the posture of terminal devices, and can be used in applications such as horizontal and vertical screen switching, pedometers, etc.

Distance sensor 180F for measuring distance. The terminal device 100 can measure the distance through infrared or laser. In some embodiments, when shooting a scene, the terminal device 100 can use the distance sensor 180F to measure the distance to achieve fast focusing.

Proximity light sensor 180G may include, for example, light emitting diodes (LEDs) and light detectors, such as photodiodes. The light emitting diodes may be infrared light emitting diodes. The terminal device 100 emits infrared light to the outside through the light emitting diode. The terminal device 100 detects infrared reflected light from nearby objects using a photodiode. When sufficient reflected light is detected, it can be determined that there is an object near the terminal device 100 . When insufficient reflected light is detected, the terminal device 100 may determine that there is no object near the terminal device 100 . The terminal device 100 can use the proximity light sensor 180G to detect that the user holds the terminal device 100 close to the ear to talk, so as to automatically turn off the screen to save power. Proximity light sensor 180G can also be used in holster mode, pocket mode automatically unlocks and locks the screen.

The ambient light sensor 180L is used to sense ambient light brightness. The terminal device 100 can adaptively adjust the brightness of the display screen 194 according to the perceived ambient light brightness. The ambient light sensor 180L can also be used to automatically adjust the white balance when taking pictures. The ambient light sensor 180L can also cooperate with the proximity light sensor 180G to detect whether the terminal device 100 is in a pocket, so as to prevent accidental touch.

The fingerprint sensor 180H is used to collect fingerprints. The terminal device 100 can use the collected fingerprint characteristics to realize fingerprint unlocking, accessing application locks, taking photos with fingerprints, answering incoming calls with fingerprints, and the like.

The temperature sensor 180J is used to detect the temperature. In some embodiments, the terminal device 100 uses the temperature detected by the temperature sensor 180J to execute the temperature processing strategy. For example, when the temperature reported by the temperature sensor 180J exceeds a threshold value, the terminal device 100 reduces the performance of the processor located near the temperature sensor 180J, so as to reduce power consumption and implement thermal protection. In other embodiments, when the temperature is lower than another threshold, the terminal device 100 heats the battery 142 to avoid abnormal shutdown of the terminal device 100 caused by the low temperature. In some other embodiments, when the temperature is lower than another threshold, the terminal device 100 boosts the output voltage of the battery 142 to avoid abnormal shutdown caused by low temperature.

Touch sensor 180K, also called "touch device". The touch sensor 180K may be disposed on the display screen 194 , and the touch sensor 180K and the display screen 194 form a touch screen, also called a “touch screen”. The touch sensor 180K is used to detect a touch operation on or near it. The touch sensor can pass the detected touch operation to the application processor to determine the type of touch event. Visual output related to touch operations may be provided through display screen 194 . In other embodiments, the touch sensor 180K may also be disposed on the surface of the terminal device 100 , which is different from the position where the display screen 194 is located.

The bone conduction sensor 180M can acquire vibration signals. In some embodiments, the bone conduction sensor 180M can acquire the vibration signal of the vibrating bone mass of the human voice. The bone conduction sensor 180M can also contact the pulse of the human body and receive the blood pressure beating signal. In some embodiments, the bone conduction sensor 180M can also be disposed in the earphone, combined with the bone conduction earphone. The audio module 170 can analyze the voice signal based on the vibration signal of the voice vibration bone block obtained by the bone conduction sensor 180M, and realize the voice function. The application processor can analyze the heart rate information based on the blood pressure beat signal obtained by the bone conduction sensor 180M, and realize the function of heart rate detection.

The keys 190 include a power-on key, a volume key, and the like. Keys 190 may be mechanical keys. It can also be a touch key. The terminal device 100 may receive key input and generate key signal input related to user settings and function control of the terminal device 100 .

Motor 191 can generate vibrating cues. The motor 191 can be used for vibrating alerts for incoming calls, and can also be used for touch vibration feedback. For example, touch operations acting on different applications (such as taking pictures, playing audio, etc.) can correspond to different vibration feedback effects. The motor 191 can also correspond to different vibration feedback effects for touch operations on different areas of the display screen 194 . Different application scenarios (for example: time reminder, receiving information, alarm clock, games, etc.) can also correspond to different vibration feedback effects. The touch vibration feedback effect can also support customization.

The indicator 192 can be an indicator light, which can be used to indicate the charging state, the change of the power, and can also be used to indicate a message, a missed call, a notification, and the like.

In addition, an operating system runs on the above-mentioned components. For example, the iOS operating system developed by Apple, the Android open source operating system developed by Google, and the Windows operating system developed by Microsoft. Applications can be installed and run on this operating system.

The operating system of the terminal device 100 may adopt a layered architecture, an event-driven architecture, a microkernel architecture, a microservice architecture, or a cloud architecture. The embodiments of the present application take an Android system with a layered architecture as an example to exemplarily describe the software structure of the terminal device 100 .

FIG. 2 is a block diagram of a software structure of a terminal device 100 according to an embodiment of the present application.

The layered architecture divides the software into several layers, and each layer has a clear role and division of labor. Layers communicate with each other through software interfaces. In some embodiments, the Android system is divided into four layers, which are, from top to bottom, an application layer, an application framework layer, an Android runtime (Android runtime) and a system library, and a kernel layer.

The application layer can include a series of application packages. As shown in Figure 2, the application package can include applications such as camera, gallery, calendar, call, map, navigation, WLAN, Bluetooth, music, video, short message and so on. For example, when taking pictures, the camera application can access the camera interface management service provided by the application framework layer.

The application framework layer provides an application programming interface (application programming interface, API) and a programming framework for applications in the application layer. The application framework layer includes some predefined functions. As shown in Figure 2, the application framework layer may include window managers, content providers, view systems, telephony managers, resource managers, notification managers, and the like. For example, in this embodiment of the present application, when taking pictures, the application framework layer may provide the application layer with APIs related to the photographing function, and provide the application layer with a camera interface management service to realize the photographing function.

A window manager is used to manage window programs. The window manager can get the size of the display screen, determine whether there is a status bar, lock the screen, take screenshots, etc.

Content providers are used to store and retrieve data and make these data accessible to applications. The data may include video, images, audio, calls made and received, browsing history and bookmarks, phone book, etc.

The view system includes visual controls, such as controls for displaying text, controls for displaying pictures, and so on. View systems can be used to build applications. A display interface can consist of one or more views. For example, the display interface including the short message notification icon may include a view for displaying text and a view for displaying pictures.

The telephony manager is used to provide the communication function of the terminal device 100 . For example, the management of call status (including connecting, hanging up, etc.).

The resource manager provides various resources for the application, such as localization strings, icons, pictures, layout files, video files and so on.

The notification manager enables applications to display notification information in the status bar, which can be used to convey notification-type messages, and can disappear automatically after a brief pause without user interaction. For example, the notification manager is used to notify download completion, message reminders, etc. The notification manager can also display notifications in the status bar at the top of the system in the form of graphs or scroll bar text, such as notifications of applications running in the background, and notifications on the screen in the form of dialog windows. For example, text information is prompted in the status bar, a prompt sound is issued, the terminal device vibrates, and the indicator light flashes.

Android Runtime includes core libraries and a virtual machine. Android runtime is responsible for scheduling and management of the Android system.

The core library consists of two parts: one is the function functions that the java language needs to call, and the other is the core library of Android.

The application layer and the application framework layer run in virtual machines. The virtual machine executes the java files of the application layer and the application framework layer as binary files. The virtual machine is used to perform functions such as object lifecycle management, stack management, thread management, safety and exception management, and garbage collection.

A system library can include multiple functional modules. For example: surface manager (surface manager), media library (Media Libraries), 3D graphics processing library (eg: OpenGL ES), 2D graphics engine (eg: SGL), etc.

The Surface Manager is used to manage the display subsystem and provides a fusion of 2D and 3D layers for multiple applications.

The media library supports playback and recording of a variety of commonly used audio and video formats, as well as still image files. The media library can support a variety of audio and video encoding formats, such as: moving picture experts group (MPEG) 4, H.264, MP3, advanced audio coding (AAC), adaptive multi-rate (adaptive) multi rate, AMR), joint photographic experts group (joint photo graphic experts group, JPEG), portable network graphic format (portable network graphic format, PNG) and so on.

The 3D graphics processing library is used to implement 3D graphics drawing, image rendering, compositing, and layer processing.

Two-dimensional (2 dimensions, 2D) graphics engine is a drawing engine for 2D drawing.

The kernel layer is the layer between hardware and software. The kernel layer contains at least display drivers, camera drivers, audio drivers, and sensor drivers.

It should be noted that although the embodiments of the present application are based on

The system is used as an example to illustrate, but its basic principles are also applicable to

or

Terminal devices such as operating systems.

The technical solutions in the present application will be described below with reference to the accompanying drawings.

The three-dimensional object registration method provided by the embodiments of the present application can be applied to AR, VR, and scenarios where virtual content of an object needs to be displayed. Specifically, the three-dimensional object registration method in this embodiment of the present application can be applied in an AR scene. The following describes the workflow of the software and hardware of the terminal device 100 with reference to FIG. 1 and the AR scene.

The touch sensor 180K receives the touch operation and reports it to the processor 110 , so that the processor starts the AR application in response to the above-mentioned touch operation, and displays the user interface of the AR application on the display screen 194 . For example, after receiving the touch operation on the AR icon, the touch sensor 180K reports the touch operation on the AR icon to the processor 110, so that the processor 110 responds to the above touch operation, starts the AR application corresponding to the AR icon, and displays it on the display screen. 194 shows the user interface for AR. In addition, in this embodiment of the present application, the terminal may also enable AR in other manners, and display the AR user interface on the display screen 194. For example, when the terminal is on a black screen, displays a lock screen interface, or displays a certain user interface after being unlocked, the terminal may start AR in response to a user's voice command or a shortcut operation, and display the AR user interface on the display screen 194 .

In the APP for detecting and tracking the pose of the object in the terminal device 100, a network for detecting the pose of the object is configured. When the user starts the APP for detecting and tracking the pose of the object in the terminal device 100, the terminal device 100 captures an image in the field of view through the camera 193, and the network that detects the pose of the object recognizes the identifiable objects included in the image. and acquire the pose of the recognizable object, and then superimpose and display the virtual content corresponding to the recognizable object on the captured image through the display screen 194 according to the acquired pose.

Exemplarily, in shopping malls, museums, exhibition hall visits, etc., after capturing the scene image, the terminal device 100 recognizes the identifiable objects in the image, and uses the explanation information of the identifiable objects according to the pose of the identifiable object, according to the position and position of the identifiable object. Gesture presents the corresponding explanation content to the user by superimposing it on different positions of the three-dimensional object accurately in the three-dimensional space. Figure 3 shows the virtual-real fusion effect of the explanation information and the real object, which can intuitively and vividly present the information of the real object in the image to the user.

In the current scene where the virtual content of the object is displayed, the network configured in the terminal device 100 for detecting the pose of the object is usually trained offline according to the three-dimensional model of the object, so that the network supports the Pose recognition, that is, the registration of the object to be recognized in the network is completed. In the actual application process of 3D object pose detection, it is necessary to add and delete identifiable objects efficiently and quickly. Currently, recognizable objects are usually added based on machine learning methods. For each new object, all recognizable objects need to be retrained, which will lead to a linear increase in training time and affect the trained objects. Identify the effect.

Figure 4 illustrates an existing pose detection process. As shown in Figure 4, a picture containing an object is input to a multi-object pose estimation network, and the multi-object pose estimation network outputs the pose of the recognizable objects contained in the picture. and categories, and perform pose optimization on the output pose. The multi-object pose estimation network used in this process is generated by offline training. When the user needs to add a new identifiable object, it needs to be retrained together with all the identifiable objects supported by the original network to obtain a new Multi-object pose estimation network, enabling support for pose detection of newly identifiable objects. In this way, re-training of new objects will lead to a sharp increase in training time, and new objects will affect the pose detection effect of the trained objects, resulting in a decrease in detection accuracy and success rate.

In order to solve the problem of increased training time caused by retraining all identifiable objects for new objects, the industry has proposed an incremental learning method. The process of this method can be shown in Figure 5a. The user submits a 3D model of the object to be recognized. , train a multi-object pose estimation network M0 according to the submitted 3D model; when adding a new identifiable object, the user submits a new 3D model of the object to be recognized, based on the trained network M0, using the trained object's 3D model A small amount of data is incrementally trained to obtain a new network M1; when recognizable objects are added, the user continues to submit the 3D model of the object expected to be recognized, and based on the trained network M1, a small amount of data of the trained object is used Perform incremental training to get a new network M2, and so on. When a new object to be recognized is added to the training set, this scheme only uses a small amount of data of the trained object, and performs incremental learning on the basis of the existing model, which can greatly reduce the retraining time. However, since incremental learning faces the problem of catastrophic forgetting, that is, the training of new models only refers to a small amount of data of recognized objects, and as the number of new objects increases, the performance of already trained objects will drop sharply.

Based on this, the present application provides a three-dimensional object registration method, which specifically includes: configuring a single-object pose detection network, and constructing a loss function using the difference between a real image of a three-dimensional object and a composite image obtained by differentiable rendering under multiple poses, to train a single-object pose detection network to obtain a pose detection network for extracting the three-dimensional object, and then use the trained pose detection network to extract the real image of the three-dimensional object and the differentiable rendering composite image in multiple poses In the feature of the three-dimensional object, record the feature and the identification of the three-dimensional object, and complete the registration of the three-dimensional object. Because the three-dimensional object registration method provided in this application adopts a single-object pose detection network, even if a new recognizable object is added, the training time is short, and the recognition effect of other recognizable objects will not be affected; Differentiable rendering of synthetic images in multiple poses builds a loss function that improves the accuracy of single-object pose detection networks.

The method provided by this application is described below from the model training side and the model application side:

The method for obtaining the pose of an object provided by the embodiments of the present application involves computer vision processing, and can be specifically applied to data processing methods such as data training, machine learning, and deep learning. Symbolized and formalized intelligent information modeling, extraction, preprocessing, training, etc., finally obtain a trained single-object pose detection network; and, the object registration method provided in the embodiment of the present application can use the above-trained single-object Pose detection network, input the input data (such as the image including the object to be recognized in this application) into the trained single object pose detection network corresponding to the object to be recognized, and obtain output data (such as in this application) pose of recognizable objects in the image). It should be noted that the training method and the object registration method of the single-object pose detection network provided by the embodiments of the present application are inventions based on the same concept, and can also be understood as two parts in a system, or as part of an overall process. Two stages: such as model training stage and model application stage.

Since the embodiments of the present application involve a large number of neural network applications, for ease of understanding, related terms and neural networks and other related concepts involved in the embodiments of the present application are first introduced below.

(1) Neural Network (NN)

Neural network is a machine learning model, which is a kind of machine learning technology that simulates the neural network of the human brain to realize artificial intelligence. The input and output of the neural network can be configured according to actual needs, and the neural network can be trained through sample data, so that the error between its output and the real output corresponding to the sample data is minimized. A neural network can be composed of neural units, and a neural unit can refer to an operation unit that takes x _s and an intercept 1 as input, and the output of the operation unit can be:

Among them, s=1, 2, ... n, n is a natural number greater than 1, W _s is the weight of x _s , and b is the bias of the neural unit. f is an activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into an output signal. The output signal of this activation function can be used as the input of the next convolutional layer. The activation function can be a sigmoid function. A neural network is a network formed by connecting many of the above single neural units together, that is, the output of one neural unit can be the input of another neural unit. The input of each neural unit can be connected with the local receptive field of the previous layer to extract the features of the local receptive field, and the local receptive field can be an area composed of several neural units.

(2) Deep neural network

Deep neural network (deep neural network, DNN), also known as multi-layer neural network, can be understood as a neural network with many hidden layers, and there is no special metric for "many" here. From the division of DNN according to the position of different layers, the neural network inside DNN can be divided into three categories: input layer, hidden layer, and output layer. Generally speaking, the first layer is the input layer, the last layer is the output layer, and the middle layers are all hidden layers. The layers are fully connected, that is, any neuron in the i-th layer must be connected to any neuron in the i+1-th layer. Although DNN looks complicated, in terms of the work of each layer, it is not complicated. In short, it is the following linear relationship expression:

in,

is the input vector,

is the output vector, b is the offset vector, W is the weight matrix (also called coefficients), and α() is the activation function. Each layer is just an input vector

After such a simple operation to get the output vector

Due to the large number of DNN layers, the number of coefficients W and offset vector b is also large. These parameters are defined in the DNN as follows: Take the coefficient W as an example: Suppose that in a three-layer DNN, the linear coefficient from the 4th neuron in the second layer to the 2nd neuron in the third layer is defined as

The superscript 3 represents the number of layers where the coefficient W is located, and the subscript corresponds to the output third layer index 2 and the input second layer index 4. The summary is: the coefficient from the kth neuron in the L-1 layer to the jth neuron in the Lth layer is defined as

It should be noted that the input layer does not have a W parameter. In a deep neural network, more hidden layers allow the network to better capture the complexities of the real world. In theory, a model with more parameters is more complex and has a larger "capacity", which means that it can complete more complex learning tasks. Training the deep neural network is the process of learning the weight matrix, and its ultimate goal is to obtain the weight matrix of all layers of the trained deep neural network (the weight matrix formed by the vectors W of many layers).

(3) Convolutional Neural Network

Convolutional neural network (CNN) is a deep neural network with a convolutional structure. A convolutional neural network consists of a feature extractor consisting of convolutional and subsampling layers. The feature extractor can be viewed as a filter, and the convolution process can be viewed as convolution with an input image or a convolutional feature map using a trainable filter. The convolutional layer refers to the neuron layer in the convolutional neural network that convolves the input signal. In a convolutional layer of a convolutional neural network, a neuron can only be connected to some of its neighbors. A convolutional layer usually contains several feature planes, and each feature plane can be composed of some neural units arranged in a rectangle. Neural units in the same feature plane share weights, and the shared weights here are convolution kernels. Shared weights can be understood as the way to extract image information is independent of location. The underlying principle is that the statistics of one part of the image are the same as the other parts. This means that image information learned in one part can also be used in another part. So for all positions on the image, the same learned image information can be used. In the same convolution layer, multiple convolution kernels can be used to extract different image information. Generally, the more the number of convolution kernels, the richer the image information reflected by the convolution operation.

The convolution kernel can be initialized in the form of a matrix of random size, and the convolution kernel can obtain reasonable weights by learning during the training process of the convolutional neural network. In addition, the immediate benefit of sharing weights is to reduce the connections between the layers of the convolutional neural network, while reducing the risk of overfitting.

(4) Recurrent neural networks (RNN) are used to process sequence data. In the traditional neural network model, from the input layer to the hidden layer to the output layer, the layers are fully connected, and each node in each layer is unconnected. Although this ordinary neural network solves many problems, it is still powerless for many problems. For example, if you want to predict the next word of a sentence, you generally need to use the previous words, because the front and rear words in a sentence are not independent. The reason why RNN is called a recurrent neural network is that the current output of a sequence is also related to the previous output. The specific manifestation is that the network will memorize the previous information and apply it to the calculation of the current output, that is, the nodes between the hidden layer and this layer are no longer unconnected but connected, and the input of the hidden layer not only includes The output of the input layer also includes the output of the hidden layer at the previous moment. In theory, RNNs can process sequence data of any length. The training of RNN is the same as the training of traditional CNN or DNN. The error back-propagation algorithm is also used, but there is a difference: that is, if the RNN is expanded, the parameters, such as W, are shared; while the traditional neural network mentioned above is not the case. And in the gradient descent algorithm, the output of each step depends not only on the network of the current step, but also on the state of the network in the previous steps. This learning algorithm is called the Back propagation Through Time (BPTT) algorithm based on time.

Why use a recurrent neural network when you already have a convolutional neural network? The reason is very simple. In the convolutional neural network, there is a premise that the elements are independent of each other, and the input and output are also independent, such as cats and dogs. But in the real world, many elements are interconnected, such as the change of stocks over time, and another example of a person who said: I like to travel, and my favorite place is Yunnan. I must go there in the future. Fill in the blanks here. Humans should all know that it is "Yunnan". Because humans make inferences based on the content of the context, but how do you get machines to do this? RNN came into being. RNNs are designed to give machines the ability to memorize like humans do. Therefore, the output of RNN needs to rely on current input information and historical memory information.

(6) Loss function

In the process of training a deep neural network, because it is hoped that the output of the deep neural network is as close as possible to the value you really want to predict, you can compare the predicted value of the current network with the target value you really want, and then based on the difference between the two to update the weight vector of each layer of neural network (of course, there is usually an initialization process before the first update, that is, to pre-configure parameters for each layer in the deep neural network), for example, if the predicted value of the network If it is high, adjust the weight vector to make its prediction lower, and keep adjusting until the deep neural network can predict the real desired target value or a value that is very close to the real desired target value. Therefore, it is necessary to pre-define "how to compare the difference between the predicted value and the target value", which is the loss function (loss function) or objective function (objective function), which are used to measure the difference between the predicted value and the target value. important equation. Among them, taking the loss function as an example, the higher the output value of the loss function (loss), the greater the difference, then the training of the deep neural network becomes the process of reducing the loss as much as possible.

(7) Pixel value

The pixel value of the image can be a red-green-blue (RGB) color value, and the pixel value can be a long integer representing the color. For example, the pixel value is 256*Red+100*Green+76Blue, where Blue represents the blue component, Green represents the green component, and Red represents the red component. In each color component, the smaller the value, the lower the brightness, and the larger the value, the higher the brightness. For grayscale images, the pixel values can be grayscale values.

The following describes the system architecture provided by the embodiments of the present application.

Referring to FIG. 5b, an embodiment of the present invention provides a system architecture 500. As shown in the system architecture 500, the data collection device 560 is used to collect training data. In this embodiment of the present application, the training data includes: real images and/or synthetic images of objects to be recognized; The device 520 obtains the target model/rule 501 by training based on the training data maintained in the database 530 . The following will describe in more detail how the training device 520 obtains the target model/rule 501 based on the training data with the second embodiment and the third embodiment. The target model/rule 501 may be the single-object pose detection network ( The first network for extracting the pose of the object in the image), that is, inputting the image into the target model/rule 501 to obtain the pose of the identifiable object included in the image; or, the target model/rule 501 may be an embodiment of the application The differentiable renderer described in , inputting the 3D model and the preset pose of the object into the target model/rule 501 can obtain the composite image of the object in the preset pose. The target model/rule 501 in the embodiment of the present application may specifically be a single-object pose detection network or a differentiable renderer. In the embodiment provided by the present application, the single-object pose detection network is performed by training a single-object basic pose Detected by the network. It should be noted that, in practical applications, the training data maintained in the database 530 may not necessarily all come from the collection of the data collection device 560, and may also be received from other devices. In addition, it should be noted that the training device 520 may not necessarily train the target model/rule 501 entirely based on the training data maintained by the database 530, and may also obtain training data from the cloud or other places for model training. The above description should not be used as a reference to this application Limitations of Examples.

The target model/rule 501 trained according to the training device 520 can be applied to different systems or devices, such as the execution device 510 shown in FIG. 5b, the execution device 510 can be a terminal, such as a mobile phone terminal, a tablet Notebook computer, AR/VR, vehicle terminal, etc., it can also be a server or cloud, etc. In FIG. 5b, the execution device 510 is configured with an I/O interface 512, which is used for data interaction with external devices. The user can input data to the I/O interface 512 through the client device 540, and the input data is described in the embodiments of the present application. may include: a three-dimensional model of the object to be recognized, a real image of the object to be recognized, and a composite image rendered by the three-dimensional model of the object to be recognized in different poses.

When the computing module 511 of the execution device 510 performs calculations and other related processing processes, the execution device 510 may call the data, codes, etc. in the data storage system 550 for corresponding processing, or may store the data, instructions, etc. obtained from the corresponding processing. Stored in data storage system 150 .

Finally, the I/O interface 512 returns the processing result, such as the virtual content and pose of the recognizable object in the obtained image, to the client device 140, so as to provide the user with the virtual content displayed according to the pose, so as to realize the experience of combining virtual and real .

It is worth noting that the training device 520 can generate corresponding target models/rules 501 based on different training data for different goals or tasks, and the corresponding target models/rules 501 can be used to achieve the above goals or complete The above task, thus providing the user with the desired result.

In the case shown in FIG. 5 b , the user can manually specify input data, which can be operated through the interface provided by the I/O interface 512 . In another case, the client device 540 can automatically send the input data to the I/O interface 512 . If the user's authorization is required to require the client device 540 to automatically send the input data, the user can set the corresponding permission in the client device 140 . The user can view the result output by the execution device 510 on the client device 540, and the specific presentation form can be a specific manner such as display, sound, and action. The client device 540 can also be used as a data collection terminal to collect the input data of the input I/O interface 512 and the output result of the output I/O interface 512 as new sample data as shown in FIG. 5b and store them in the database 530. Of course, it is also possible not to collect through the client device 540, but the I/O interface 512 directly uses the input data input into the I/O interface 512 and the output result of the output I/O interface 512 as shown in the figure as a new sample The data is stored in database 530.

It is worth noting that FIG. 5b is only a schematic diagram of a system architecture provided by an embodiment of the present invention, and the positional relationship among the devices, devices, modules, etc. shown in the figure does not constitute any limitation. For example, in FIG. 5b, the data The storage system 550 is an external memory relative to the execution device 110 , and in other cases, the data storage system 550 may also be placed in the execution device 510 .

The methods and apparatuses provided in the embodiments of the present application can also be used to expand the training database. As shown in FIG. 5b, the I/O interface 512 of the execution device 510 can convert the images processed by the execution device (such as the objects to be recognized in different poses) The rendered synthetic image) and the real image of the object to be recognized input by the user are sent to the database 530 as a pair of training data, so that the training data maintained by the database 530 is more abundant, thereby providing more abundant training for the training work of the training device 520 data.

As shown in FIG. 5b, a target model/rule 501 is obtained by training according to the training device 520, and the target model/rule 501 may be a single-object pose recognition network (the first network, second network). The single-object pose recognition networks provided in the embodiments of the present application may all be convolutional neural networks, recurrent neural networks, or others.

As mentioned in the introduction to the basic concepts above, a convolutional neural network is a deep neural network with a convolutional structure and a deep learning architecture. learning at multiple levels of abstraction. As a deep learning architecture, CNN is a feed-forward artificial neural network in which individual neurons can respond to images fed into it.

As shown in FIG. 6 , a convolutional neural network (CNN) 600 may include an input layer 610 , a convolutional/pooling layer 620 (where the pooling layer is optional), and a neural network layer 630 .

Convolutional layer/pooling layer 620:

Convolutional layer:

As shown in Figure 6, the convolutional/pooling layer 620 may include layers 621-626 as examples, for example: in one implementation, layer 621 is a convolutional layer, layer 622 is a pooling layer, and layer 623 is a convolutional layer Layer 624 is a pooling layer, 625 is a convolutional layer, and 626 is a pooling layer; in another implementation, 621 and 622 are convolutional layers, 623 are pooling layers, and 624 and 625 are convolutional layers. layer, 626 is the pooling layer. That is, the output of a convolutional layer can be used as the input of a subsequent pooling layer, or it can be used as the input of another convolutional layer to continue the convolution operation.

The following will take the convolutional layer 621 as an example to introduce the inner working principle of a convolutional layer.

The convolution layer 621 may include many convolution operators. The convolution operator is also called a kernel, and its role in image processing is equivalent to a filter that extracts specific information from the input image matrix. The convolution operator is essentially Can be a weight matrix, which is usually pre-defined, usually one pixel by one pixel (or two pixels by two pixels) along the horizontal direction on the input image during the convolution operation on the image. ...It depends on the value of the stride step) to process, so as to complete the work of extracting specific features from the image. The size of the weight matrix should be related to the size of the image. It should be noted that the depth dimension of the weight matrix is the same as the depth dimension of the input image. During the convolution operation, the weight matrix will be extended to Enter the entire depth of the image. Therefore, convolution with a single weight matrix will result in a single depth dimension of the convolutional output, but in most cases a single weight matrix is not used, but multiple weight matrices of the same size (row × column) are applied, That is, multiple isotype matrices. The output of each weight matrix is stacked to form the depth dimension of the convolutional image, where the dimension can be understood as determined by the "multiple" described above. Different weight matrices can be used to extract different features in the image. For example, one weight matrix is used to extract image edge information, another weight matrix is used to extract specific colors of the image, and another weight matrix is used to extract unwanted noise in the image. Blur, etc. The multiple weight matrices have the same size (row×column), and the size of the feature maps extracted from the multiple weight matrices with the same size is also the same, and then the multiple extracted feature maps with the same size are combined to form a convolution operation. output.

The weight values in these weight matrices need to be obtained through a lot of training in practical applications, and each weight matrix formed by the weight values obtained by training can be used to extract information from the input image, so that the convolutional neural network 600 can make correct predictions .

When the convolutional neural network 600 has multiple convolutional layers, the initial convolutional layer (eg 621 ) often extracts more general features, which can also be called low-level features; with the convolutional neural network As the depth of the network 600 deepens, the features extracted by the later convolutional layers (eg 626 ) become more and more complex, such as features such as high-level semantics. Features with higher semantics are more suitable for the problem to be solved.

Pooling layer:

Since it is often necessary to reduce the number of training parameters, it is often necessary to periodically introduce a pooling layer after the convolutional layer. In the layers 621-626 exemplified by 620 in Figure 6, it can be a convolutional layer followed by a layer. The pooling layer can also be a multi-layer convolutional layer followed by one or more pooling layers. During image processing, the only purpose of pooling layers is to reduce the spatial size of the image. The pooling layer may include an average pooling operator and/or a max pooling operator for sampling the input image to obtain a smaller size image. The average pooling operator can calculate the pixel values in the image within a certain range to produce an average value as the result of average pooling. The max pooling operator can take the pixel with the largest value within a specific range as the result of max pooling. Also, just as the size of the weight matrix used in the convolutional layer should be related to the size of the image, the operators in the pooling layer should also be related to the size of the image. The size of the output image after processing by the pooling layer can be smaller than the size of the image input to the pooling layer, and each pixel in the image output by the pooling layer represents the average or maximum value of the corresponding sub-region of the image input to the pooling layer.

Neural network layer 630:

After being processed by the convolutional layer/pooling layer 620, the convolutional neural network 600 is not sufficient to output the required output information. Because as mentioned before, the convolutional layer/pooling layer 620 only extracts features and reduces the parameters brought by the input image. However, in order to generate the final output information (required class information or other relevant information), the convolutional neural network 600 needs to utilize the neural network layer 630 to generate one or a set of outputs of the desired number of classes. Therefore, the neural network layer 630 may include multiple hidden layers (631, 632 to 63n as shown in FIG. 6) and the output layer 640, and the parameters contained in the multiple hidden layers may be based on specific task types The relevant training data is pre-trained, for example, the task type can include image recognition, image classification, image super-resolution reconstruction, etc...

After the multi-layer hidden layers in the neural network layer 630, that is, the last layer of the entire convolutional neural network 600 is the output layer 640, the output layer 640 has a loss function similar to the classification cross entropy, and is specifically used to calculate the prediction error, Once the forward propagation of the entire convolutional neural network 600 (as shown in Fig. 6, the propagation from the direction 610 to 640 is the forward propagation) is completed, the back propagation (as shown in Fig. 6, the propagation from the 640 to 610 direction is the back propagation) will Begin to update the weight values and biases of the aforementioned layers to reduce the loss of the convolutional neural network 600 and the error between the result output by the convolutional neural network 600 through the output layer and the ideal result.

It should be noted that the convolutional neural network 600 shown in FIG. 6 is only used as an example of a convolutional neural network, and in a specific application, the convolutional neural network may also exist in the form of other network models.

The following describes a chip hardware structure provided by an embodiment of the present application.

FIG. 7a is a hardware structure of a chip according to an embodiment of the present invention, where the chip includes a neural network processor (NPU) 70 . The chip can be set in the execution device 510 as shown in FIG. 5 b to complete the calculation work of the calculation module 511 . The chip can also be set in the training device 520 as shown in FIG. 5 b to complete the training work of the training device 520 and output the target model/rule 501 . The algorithms of each layer in the convolutional neural network shown in Figure 6 can be implemented in the chip shown in Figure 7a.

As shown in Figure 7a, the NPU 70 is mounted on the main central processing unit (CPU) (Host CPU) as a co-processor, and tasks are allocated by the Host CPU. The core part of the NPU is the operation circuit 70, and the controller 704 controls the operation circuit 703 to extract the data in the memory (weight memory or input memory) and perform operations.

In some implementations, the arithmetic circuit 703 includes multiple processing units (process engines, PEs). In some implementations, arithmetic circuit 703 is a two-dimensional systolic array. The arithmetic circuit 703 may also be a one-dimensional systolic array or other electronic circuitry capable of performing mathematical operations such as multiplication and addition. In some implementations, arithmetic circuit 703 is a general-purpose matrix processor.

For example, suppose there is an input matrix A, a weight matrix B, and an output matrix C. The operation circuit 703 fetches the data corresponding to the matrix B from the weight memory 702 and buffers it on each PE in the operation circuit. The operation circuit 703 takes the data of the matrix A from the input memory 701 and performs the matrix operation on the matrix B, and stores the partial result or the final result of the matrix in the accumulator 708 (accumulator).

The vector calculation unit 707 can further process the output of the operation circuit 703, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison and so on. For example, the vector computing unit 707 can be used for network computation of non-convolutional/non-fully connected layers (FC) layers in the neural network, such as pooling (Pooling), batch normalization (Batch Normalization), local response Normalization (Local Response Normalization), etc.

In some implementations, the vector computation unit 707 can store the processed output vectors to the unified buffer 706 . For example, the vector calculation unit 707 may apply a nonlinear function to the output of the arithmetic circuit 703, eg, a vector of accumulated values, to generate activation values. In some implementations, vector computation unit 707 generates normalized values, merged values, or both. In some implementations, the vector of processed outputs can be used as activation input to the arithmetic circuit 703, eg, for use in subsequent layers in a neural network.

For example, the algorithms of each layer in the convolutional neural network shown in FIG. 6 can be executed by 703 or 707 . The algorithms of the calculation module 511 and the training device 520 in FIG. 5b can all be executed by 703 or 707 .

Unified memory 706 is used to store input data and output data.

The weight data directly transfers the input data in the external memory to the input memory 701 and/or the unified memory 706 through the storage unit access controller (direct memory access controller, DMAC) 705, and stores the weight data in the external memory into the weight memory 702, and storing the data in the unified memory 706 into the external memory.

A bus interface unit (BIU) 710 is used to realize the interaction between the main CPU, the DMAC and the instruction fetch memory 709 through the bus.

An instruction fetch buffer 709 connected to the controller 704 is used to store the instructions used by the controller 704 .

The controller 704 is used for invoking the instructions cached in the memory 709 to control the working process of the operation accelerator.

Exemplarily, the data here may be descriptive data, may be the input or output data of each layer in the convolutional neural network shown in FIG. Output Data.

Generally, the unified memory 706 , the input memory 701 , the weight memory 702 and the instruction fetch memory 709 are all on-chip memories, and the external memory is the memory outside the NPU, and the external memory can be double data rate synchronous dynamic random access. Memory (double data rate synchronous dynamic random access memory, DDR SDRAM), high bandwidth memory (high bandwidth memory, HBM) or other readable and writable memory.

Optionally, the program algorithms in FIG. 5b and FIG. 6 are jointly completed by the main CPU and the NPU.

Exemplarily, FIG. 7b illustrates the overall system framework of the solution provided by this application. As shown in Figure 7b, the framework includes two parts: offline registration and online detection.

Among them, in the offline registration part, the user inputs the three-dimensional model of the object and the real picture, and then trains the basic single-object pose detection network according to the three-dimensional model and the real picture to obtain the single-object pose detection network of the object, and the single-object pose detection network of the object. The pose detection network is used to detect the pose of the object contained in the image, and the detection accuracy is better than the basic single-object pose detection network. Further, according to the three-dimensional model and the real picture, the single-object pose detection network of the object that has been obtained can be used to extract the characteristics of the object for incremental object registration, and the characteristics of the object and the category of the object (which can be represented by identification) are carried out. Register to get a multi-object class classifier. A multi-object class classifier can be used to identify classes of recognizable objects included in an image. For the specific operation of the offline registration part, reference may be made to the specific implementation of the object registration method (for example, the object registration method illustrated in FIG. 8 ) provided in the following embodiments of the present application, which will not be repeated here.

In the online detection part, after the input image is acquired, the multi-feature fusion classification is performed using the multi-object category classifier obtained in the offline registration part to obtain the classification results (categories) of the identifiable objects included in the input image, and the objects obtained in the offline registration part are used. The single-object pose detection network is used to obtain the pose of the recognizable object included in the input image; according to the pose of the recognizable object in the input image, the virtual content corresponding to the category of the recognizable object is presented. For the specific operation of the online detection part, reference may be made to the specific implementation of the display method (eg, the display method illustrated in FIG. 17 ) provided in the following embodiments of the present application, which will not be repeated here.

Application Embodiment 1 provides an object registration method for registering a first object, where the first object is any object to be identified. The registration process of each object is the same. Embodiment 1 of the present application is described by taking the registration of the first object as an example, and the others will not be described in detail.

The object registration method provided in the first embodiment of the present application may be executed by the execution device 510 as shown in FIG. 5b, and the real image in the object registration method may be the input data given by the client device 540 as shown in FIG. 5b, so The computing module 511 in the executing device 510 may be used to execute the S801 to S803.

Optionally, the object registration method provided in Embodiment 1 of the present application may be processed by the CPU, or may be jointly processed by the CPU and the GPU, or other processors suitable for neural network computing may be used without the use of the GPU. limit.

As shown in FIG. 8 , the object registration method provided in Embodiment 1 of the present application may include:

S801. Acquire a plurality of first input images including a first object.

Wherein, the plurality of first input images include real images of the first object and/or a plurality of first composite images of the first object.

In a possible implementation manner, the first input image may include multiple real images of the first object.

Since the number of real images of the object input by the user is limited, when performing S802 to extract feature information using the real image of the first object, there is a deficiency that the visual features of the object to be identified under different illumination, different angles and distances cannot be fully characterized. On the basis of real images, synthetic images are added to further increase the effectiveness of feature extraction. Therefore, the first input image may include the real image of the first object as well as the first composite image. The first composite image is obtained by differentiable rendering, which can reduce the difference between the first composite image and the real image.

In another possible implementation manner, the first input image may include multiple first composite images of the first object. The multiple first composite images may be obtained by differentiable rendering of the three-dimensional model of the first object in multiple first poses. The plurality of first poses are different.

The multiple first poses being different means that the multiple first poses correspond to different shooting angles of the camera.

Exemplarily, the plurality of first poses can be taken on a spherical surface. The multiple first poses may be multiple poses obtained by uniform sampling on the spherical surface shown in FIG. 9 , and the density of the sampling poses on the spherical surface is not limited in this embodiment of the present application, and may be selected according to actual needs. Of course, the first pose may also be multiple different poses input by the user, which is not limited.

Specifically, in the case where the first input image includes the first composite image in S801, a differentiable renderer (also referred to as a differentiable rendering engine may be used according to the three-dimensional model of the first object with texture information input by the user) or a differentiable rendering network), synthesizing the 2D images of the 3D model (first composite images) obtained by the camera in multiple first poses.

Exemplarily, in the case where the first input image includes the first composite image in S801, as shown in the specific process of the object registration method shown in FIG. 10, S801 may specifically include S801a and S801b.

S801a. Perform differentiable rendering on the three-dimensional model of the first object to obtain a plurality of first composite images.

S801b, obtaining a first input image according to the first composite image.

The first input image obtained in S801b may include a real image of the first object and a plurality of first composite images, or the first input image obtained in S801b may only include a plurality of first composite images.

It should be noted that the process of differentiable rendering has been described in detail in the foregoing content, and will not be repeated here.

S802. Extract feature information of a plurality of first input images respectively, where the feature information is used to indicate features of the first object in the first input image where it is located.

The feature information may include descriptors of local feature points and descriptors of global features.

Descriptors are used to describe features and are a data structure that describes features. The dimension of a descriptor can be multi-dimensional. Descriptors may be of multiple types, such as SIFT, SURF, MSER, etc. The embodiments of the present application do not limit the types of descriptors. Descriptors can be in the form of multidimensional vectors.

For example, the descriptor of a local feature point may be a high-dimensional vector representing the local image information of the feature point; the descriptor of the global feature may be a high-dimensional vector representing the image information of the entire image or a larger area.

It should be noted that the definitions and extraction methods of local feature points and global features have been described in the foregoing content and will not be repeated here.

Further optionally, the feature information may also include the location of the local feature point.

In a possible implementation manner, in S802, feature information may be extracted from all regions in each first input image.

In another possible implementation manner, in S802, the region of the first object may be determined in each first input image, and then feature information is extracted in the region of the first object.

The area of the first object may be an area of the first input image that only includes the first object, or the area of the first object may be an area of the first input image that includes the first object, and the area is included in the first input image. within the image.

Exemplarily, each first input image may be input into the first network in sections, and the region of the first object in each first input image may be determined according to the output of the first network. The first network is used for recognizing the pose of the first object in the image, or the first network is used for extracting a black and white image of the first object in the image.

Optionally, determining the region of the first object in the first input image, and then extracting feature information in the region of the first object, may include but are not limited to the following two possible implementations:

In the first implementation manner, the first network is used to recognize the pose of the first object in the image. In S802, a plurality of first input images can be input into the first network respectively for pose recognition, and each first input image can be obtained. According to the obtained pose of the first object, project the three-dimensional model of the first object to each first input image respectively, and obtain the projection area in each first input image (as the first area of the object); feature information is extracted from the projection area in each first input image respectively.

In the second implementation manner, the first network is used to extract the black-and-white image of the first object in the image. In S802, a plurality of first input images can be input into the first network respectively for pose recognition, and each first input image can be obtained. The black-and-white image of the first object (as the area of the first object) in each of the first input images; the feature information is extracted from the black-and-white image of the first object in each of the first input images.

Exemplarily, in the region of the first object in each first input image obtained in S802, the pixel positions of the visually significant feature points and the corresponding descriptors can be extracted as the feature information of the local feature points. The sub is a multi-dimensional vector; in the area of the first object in each second input image obtained in S802, the feature information of all visual feature points is extracted, and a multi-dimensional vector is output as the feature information of the global feature.

It should be noted that the algorithm for extracting features is not limited in the embodiments of the present application, and can be selected according to actual needs.

Exemplarily, in the image area shown in FIG. 11 , according to the result obtained by inputting the first input image to the first network, it can be determined that the area where the first object is located in the first input image is shown by the dashed bounding box in FIG. 11 . Area. Feature information of visually significant feature points can be extracted in the area shown by the dashed bounding box shown in FIG. 11 , and feature information of all visual feature points can be extracted from the area shown by the dashed bounding box shown in FIG. 11 . .

S803 . The feature information extracted from each first input image is corresponding to the identifier of the first object, and the registration of the first object is performed.

Specifically, in S803, the feature information of the first object and the identifier of the first object are recorded correspondingly to complete the registration of the first object, and the registration content of multiple objects may be referred to as a multi-object category classifier. The multi-object category classifier records the correspondence between the features and identifiers of different objects. In practical applications, by extracting the feature information of the objects in the image and comparing with the features recorded in the multi-object category classifier to determine the The identification of the object completes the identification of the object in the image.

Wherein, the identifier of the first object may be used to indicate the category of the first object. The embodiment of the present application does not limit the form of the identification.

Specifically, in S803, the feature information of the first object in each of the first input images extracted in S802, and the identifier of the first object may be stored according to a certain structure to complete the registration of the first object.

The feature information and identification of multiple objects stored in the structure are called multi-object category classifiers for efficient search.

Through the object registration method provided in this application, the features of objects in the image are extracted for object registration, and the registration time is short, and the recognition performance of other registered objects is not affected. Even if many new objects are registered, the registration time will not be increased too much. , which also guarantees the detection performance of registered objects.

It should be noted that the above-mentioned method for object registration can obtain the first network offline and complete the registration of the object. When applied online, the pose of the identifiable object can be obtained through the first network, and determined according to the characteristics of the identifiable object. The registered logo of the object is then displayed according to the pose of the identifiable object, and the virtual content corresponding to the logo of the identifiable object is displayed, thereby completing the presentation effect of the virtual and real results.

The second embodiment of the present application provides an optimization method, which can be used in combination with the object registration method shown in FIG. 8 and the method for training an object pose detection network provided in the third embodiment of the present application, or can be used independently. The usage scenarios thereof are not limited in the embodiments of the present application. The optimization method provided in the second embodiment of the present application is shown in Figure 12, including:

S1201. Input N real images of the first object into the first network respectively, and obtain the second pose of the first object in each real image output by the first network.

where N is greater than or equal to 1. The first network is used to identify the pose of the first object in the image.

S1202. According to the three-dimensional model of the first object, using a differentiable renderer under each second pose, obtain N second composite images by differentiable rendering.

Wherein, a differentiable renderer is used in S1202, and under each second pose obtained in S1201, N second composite images are obtained by differentiable rendering.

Specifically, acquiring a real image of a second pose corresponds to a second composite image rendered by a second pose.

S1203 , respectively intercepting a region at the same position of the first object in the second composite image corresponding to the real object in each real image, as the foreground image of each real image.

Specifically, in S1203, the foreground image of each real image input to the first network in S1201 is intercepted.

It should be understood that the area images at the same position may refer to area images with the same coordinates based on a certain point of the first object in the second composite image. In other words, the area image at the same position may refer to a projection area image in which the black and white image of the second synthetic image is projected to the real image based on a certain point of the first object in the second synthetic image.

In a possible implementation manner, the black-and-white image of the first object in the second composite image may be projected into the real image corresponding to the second composite image, and the projection area is used as the foreground image of the real image.

In another possible implementation manner, a black-and-white image of the first object in the second composite image may be obtained first, and the black-and-white image is a binarized image, and the black-and-white image of the first object in the second composite image is combined with the first object in the second composite image. The real images corresponding to the two synthetic images are multiplied, and the reserved area obtained after the multiplication is used as the foreground image of the real image.

S1204: Construct a first loss function according to the first difference information between the foreground images of the N real images and their corresponding second composite images.

The first difference information is used to indicate the difference between the foreground image and its corresponding second composite image.

Optionally, the first difference information may include one or more items of feature map difference, pixel color difference, and difference between extracted feature descriptors.

Among them, the difference between the feature maps of the two images can be the perceptual loss, that is, the difference between the feature maps encoded by the deep learning network. Specifically, the deep learning network can use a pre-trained visual geometry group (VGG)16 network. For a given image, the feature map encoded by the VGG16 network is a C*H*W tensor, and C is The number of channels of the feature map, H and W are the length and width of the feature map. The difference between the feature maps is the distance between the two tensors, and specifically, it can be the L1 norm, the L2 norm or other norm of the difference between the tensors.

The pixel color difference may be: the numerical value of the pixel color calculates the difference, and specifically may be the L2 norm of the difference between the pixels of the two images.

The difference between the extracted feature descriptors: refers to the distance between the vectors representing the descriptors. For example, the distance may include, but is not limited to, Euclidean distance, Hamming distance, and the like. Specifically, the descriptor can be an N-dimensional floating-point vector, the distance is the Euclidean distance of two N-dimensional vectors, and the L2 norm of the difference between the two N-dimensional vectors; or the descriptor can be an M-dimensional binary vector, then the distance is the L1 norm of the difference between two vectors.

The L2 norm referred to above is the sum of the squares of each element of the pointing quantity and then the square root, and the L1 norm is the sum of the absolute values of each element in the pointing quantity.

It should be noted that, if N is greater than 1, the first difference information when constructing the first loss function in S1204 may be the calculated value of the difference information between the multiple foreground images and their corresponding second composite images.

The second composite image corresponding to the foreground image refers to the real image to which the foreground image belongs, and the second composite image is obtained by differentiable rendering through the second pose obtained by the first network.

Exemplarily, for the real image I (one or more) of the first object input by the user, the first network is used to detect the 6DOF pose of the first object in the real image. Then, according to the initial three-dimensional model input by the user and the corresponding texture and illumination information, the differentiable renderer obtains the second composite image R (with the real image I) based on the 6DOF pose of the first object in the detected real image by differentiable rendering. the same number). According to each second composite image R, obtain the mask of the first object in each R (the same number as the real image I), and intercept the foreground image F of the first object in the real image I based on the mask (one mask is used to intercept The corresponding foreground image of the real image), the first loss function Loss ₁ of F and R is constructed as follows: Loss ₁ =L _p +L _i +L _f .

Among them, L _p is the calculated value of the difference between the feature maps of each foreground image and its corresponding second composite image, L _i is the calculated value of the pixel color difference between each foreground image and its corresponding second composite image, L _f is the calculated value of the difference between the feature descriptors of each foreground image and its corresponding second composite image.

It should be noted that there can be one or more feature descriptors extracted from an image. When there are multiple feature descriptors extracted from an image, the difference between the feature descriptors of the two images can be the difference of the feature descriptors of the same position. poor calculated value.

S1205. Update the differentiable renderer according to the first loss function, so that the synthesized image output by the differentiable renderer approximates the real image of the object.

Exemplarily, in S1205, the texture, lighting or other parameters in the differentiable renderer can be updated according to the first loss function constructed in S1204, so as to achieve the smallest difference between the synthetic image output by the differentiable renderer and the real image. change.

Specifically, S1205 can adjust and update textures, lighting or other parameters in the differentiable renderer according to preset rules, and then repeat the optimization process shown in FIG. 12 until the synthetic image output by the differentiable renderer is the same as the real image. The difference is minimal.

The preset rule may be that when the preconfigured first loss function satisfies different conditions, different parameters and adjustment values are correspondingly adjusted. After the first loss function is determined in S1204, according to the satisfied conditions, the corresponding adjustment parameters and adjustment values are adjusted according to the conditions. make adjustments.

Exemplarily, the optimization process shown in FIG. 12 can also be shown in FIG. 13 , the real image of the object is input into the first network, the second pose output by the first network is input into the differentiable renderer, and the differentiable renderer is based on the object's The three-dimensional model, the second composite image and the black-and-white image of the object in the second composite image are output by differentiable rendering, a first loss function is constructed according to the second composite image and the real image, and the differentiable renderer is updated by using the first loss function.

With the optimization method provided in this application, a differentiable renderer (also called a differentiable rendering engine or a differentiable rendering network) can be optimized, the rendering authenticity of the differentiable renderer can be improved, and the composite image obtained by differentiable rendering can be reduced. difference from the real image.

The optimization method provided in the second embodiment of the present application may be specifically performed by the training device 520 as shown in FIG. 5b, and the second composite image in the optimization method may be the training data maintained in the database 530 as shown in FIG. 5b. Optionally, some or all of S1201 to S1203 in the optimization method provided in the second embodiment may be executed in the training device 520, or may be pre-executed by other functional modules before the training device 520, that is, the database The training data received or acquired in 530 is preprocessed, as described in the process of S1201 to S1203, to obtain the foreground image and the second composite image, as the input of the training device 520, and the training device 520 executes S1204 to S1204 to S1205.

Optionally, the optimization method provided in the second embodiment of the present application may be processed by the CPU, or may be jointly processed by the CPU and the GPU, or other processors suitable for neural network computing may be used without the use of the GPU, which is not limited in this application. .

Exemplarily, before the object registration method provided by the present application obtains the composite image, the optimization method shown in FIG. 12 may be used to optimize the differentiable renderer to reduce the difference between the composite image obtained by differentiable rendering and the real image. Figure 14 illustrates another object registration method, which may include:

S1401. Optimize the differentiable renderer.

The specific operation of S1401 may refer to the process illustrated in FIG. 12 , which will not be repeated here.

S1402. Input the three-dimensional model of the first object into the optimized differentiable renderer to perform differentiable rendering to obtain a first composite image.

S1403. Obtain a first input image according to the first composite image.

The specific implementation of S1403 may refer to S801b, which will not be repeated here.

S1404. Extract feature information of each first input image respectively.

The specific implementation of S1404 refers to the aforementioned S802, which will not be repeated here.

S1405. Register the feature information extracted from each first input image.

The specific implementation of S1405 refers to the aforementioned S803, which will not be repeated here.

The third embodiment of the present application provides a method for training an object pose detection network, which can train the aforementioned first network to improve the accuracy of the pose of the first object output by the first network. Exemplarily, the method for training the object pose detection network can be used in combination with the object registration method shown in FIG. 8 and the optimization method provided in the second embodiment of the present application, or can be used independently, and the usage scenarios of the method are not different from the embodiments of the present application. be limited.

Specifically, the third embodiment of the present application provides a method for training an object pose detection network, which further optimizes the prediction ability of the object pose detection network for the object pose in the real image by using the real image and the synthesized image of the first object (pan chemical).

The method for training an object pose detection network provided in Embodiment 3 of the present application is shown in FIG. 15 , including:

S1501. Acquire a plurality of second input images including the first object.

Wherein, the plurality of second input images include real images of the first object and/or a plurality of third composite images of the first object.

In a possible implementation manner, the second input image may include multiple real images of the first object.

In another possible implementation manner, the second input image may include multiple real images of the first object and multiple third composite images of the first object. The multiple third composite images are obtained by rendering the three-dimensional model of the first object in multiple third poses. The multiple third poses are different.

Exemplarily, the plurality of third poses can be taken on a spherical surface. The multiple third poses may be multiple poses obtained by uniform sampling on the spherical surface as shown in FIG. 9 , and the density of the sampling poses on the spherical surface is not limited in this embodiment of the present application, and may be selected according to actual needs.

In another possible implementation manner, the second input image may include multiple third composite images of the first object.

Specifically, in this S1501, the first object can be synthesized in multiple third poses by conventional rendering, or differentiable rendering, or other rendering methods according to the three-dimensional model with texture information of the first object input by the user 2D images at different angles, recorded as the third composite image.

S1502 , inputting the plurality of second input images into the second network respectively for pose recognition, and obtaining the fourth pose of the first object in each second input image output by the second network.

Wherein, the second network can be used to identify the pose of the first object in the image.

In a possible implementation manner, the second network may be a basic single-object pose detection network. The basic single-object pose detection network is a general-purpose neural network that only recognizes a single object in an image, and is the initial model of the configuration.

Specifically, the input of the basic single-object pose detection network is an image, and the output is the pose of a single object recognizable by the network in the image. Further optionally, the output of the basic single-object pose detection network may also include a black-and-white image (mask) of a single object recognizable by the network.

Among them, the mask of an object refers to the black and white version of the area in the image that contains only that object.

In another possible implementation manner, the second network may be a neural network after training the basic single-object pose detection network according to the image of the first object.

Exemplarily, since the third composite image is generated by rendering under the known third pose, the pose of the first object relative to the camera in each third composite image is known, and each third composite image can be The three composite images are input to the basic single-object pose detection network, and the predicted pose of the first object in each third composite image is obtained; then, according to the predicted pose, the actual position of the first object in each third composite image is compared pose, calculate the loss in the iterative process of the basic single-object pose detection network, construct a loss function, and then according to the loss function, train the basic single-object pose detection network until convergence, as the second network. It should be noted that the training process of the neural network will not be described in detail in this application.

In another possible implementation manner, the second network may be the currently used first network.

S1503. According to the three-dimensional model of the first object, microrendering obtains a fourth composite image of the first object in each fourth pose.

Wherein, the second input image obtained from a fourth pose corresponds to a fourth composite image rendered by the fourth pose.

S1504. Construct a second loss function according to the second difference information between each fourth composite image and its corresponding second input image.

The second difference information is used to indicate the difference between the fourth composite image and its corresponding second input image.

Exemplarily, the second difference information includes one or more of the following: the intersection ratio ( The difference between the intersection over union, IOU), the difference between the fourth pose obtained from the fourth composite image and the pose obtained by the fourth composite image through the first network, the difference between the fourth composite image and its corresponding second input image and the fourth composite image The similarity of the region images in the same position of the first object in the composite image.

Among them, the difference between the IOUs of the two black and white images: refers to the ratio of the area of the intersection of the two masks to the area of the union.

The difference between the two poses: refers to the difference between the mathematical expressions of the two poses. Specifically, the difference is the sum of the difference between translation and rotation. Given two poses R ₁ , T ₁ and R ₂ , T ₂ , the difference in translation is the Euclidean distance of the two vectors T ₁ and T ₂ ; the difference in rotation is

where Tr represents the trace of the matrix, arccos is the inverse cosine function, and R ₁ ^T is the transpose of R ₁ .

The similarity of the two images may be the difference in pixel color or the difference in feature maps of the two images.

The region image of the first object in the second input image can be obtained by cropping the second input image according to the black and white image of the first object in the corresponding fourth composite image. For example, the black-and-white image of the first object in the fourth composite image can be a binarized image, and the binarized image is multiplied by the corresponding second input image, and then the region of the first object in the second input image can be obtained by cropping image.

Optionally, when the number of the second input images is M (M is greater than or equal to 2), then there are also M fourth composite images, and the second difference information between the fourth composite image and its corresponding second input image is also There are M, and _Li can be the calculated value of the second difference information between the fourth composite image and its corresponding second input image.

Specifically, in S1504, the difference between the poses of the second network on different second input images is calculated and constructed as a second loss function, so that the generalization of the second network on real images can be improved by optimizing these differences .

In a possible implementation, the second loss function Loss ₂ satisfies the following expression:

Wherein, X is greater than or equal to 1; λ _i is a weight value, and L _i is a calculated value used to represent the second difference information between the fourth composite image and its corresponding second input image. X is less than or equal to the number of kinds of the second difference information.

In a possible implementation manner, the first loss function Loss ₂ may be Loss ₂ =λ ₁ L ₁ +λ ₂ L ₂ +λ ₃ L ₃ .

Wherein, L ₁ is the calculated value of the difference between the black and white image of the first object in the second input image output by S1502 and the IoU of the black and white image of the first object in the fourth composite image obtained in S1503; L ₂ is the output value of S1502 Calculated value of the difference between the pose of the first object in the second input image and the pose of the first object obtained by using the second network to detect the fourth composite image _; L3 is the fourth composite image and its corresponding second input The calculated value of the visual perception similarity of the part of the image in the same area as the first object in the fourth composite image.

It should be understood that the preset weight λ _i when constructing the second loss function may be configured according to actual requirements, which is not limited in this embodiment of the present application.

S1505. Update the second network according to the second loss function to obtain the first network.

The difference between the pose of the first object in the image recognized by the first network and the real pose of the first object in the image is smaller than the pose of the first object in the image recognized by the second network and the pose of the first object in the image. Describe the real pose difference of the first object.

It should be noted that, for the process of updating the second network according to the second loss function in S1505 to obtain the first network, reference may be made to the training process of the neural network, which is not repeated in this embodiment of the present application.

Exemplarily, the process of the method for training the object pose detection network shown in FIG. 15 can also be shown in FIG. 16 , the second input image composed of the unlabeled real image of the object and the third synthetic image is input to the second network, and the output is Fourth pose. The fourth pose and the 3D model of the object are input to the differentiable renderer, and the fourth composite image and the corresponding mask are output. According to the result of passing the fourth synthetic image, the corresponding mask, and the second input image through the second network, a second loss function is constructed, and the second loss function is used to train the second network to obtain the first network.

Through the method for training an object pose detection network provided by this application, a single object pose detection network is used, even if a new recognizable object is added, the training time is short, and the recognition effect of other recognizable objects will not be affected; The difference between real images and differentiable-renderable synthetic images in multiple poses constructs a loss function, which improves the accuracy of the single-object pose detection network.

The method for training the object pose detection network provided in the third embodiment of the present application may be specifically performed by the training device 520 as shown in FIG. 5b, and the second composite image in the optimization method may be maintained in the database 530 as shown in FIG. 5b. training data. Optionally, some or all of S1501 to S1503 in the method for training an object pose detection network provided in Embodiment 3 may be executed in the training device 520, or may be pre-executed by other functional modules before the training device 520, that is, First, preprocess the training data received or acquired from the database 530, as described in the process of S1501 to S1503, to obtain the second input image and the fourth composite image, which are used as the input of the training device 520, and are sent by The training device 520 performs S1504 to S1505.

The fourth embodiment of the present application further provides a display method, which is applied to a terminal device. The display method may be used in combination with the aforementioned object registration method, optimization method, and method for training an object pose detection network, or may be used alone, which is not specifically limited in this embodiment of the present application.

As shown in FIG. 17 , the display method provided by the embodiment of the present application may include:

S1701. A terminal device acquires a first image.

The first image refers to any image acquired by the terminal device.

In a possible implementation manner, the terminal device may acquire an image in the viewfinder by using a camera to capture the image as the first image.

Exemplarily, the terminal device may be a mobile phone, the user may start an APP in the mobile phone, and input an image acquisition instruction on the APP interface, and the mobile phone starts a camera to capture the first image.

Exemplarily, the terminal device may be smart glasses. After wearing the smart glasses, the user captures an image in the field of view through a viewfinder as the first image.

In another possible implementation manner, the terminal device may load a locally stored image as the first image.

Of course, in this embodiment of the present application, the manner in which the terminal acquires the first image in S1701 is not limited.

S1702. The terminal device determines whether a recognizable object is included in the first image.

Specifically, the terminal determines whether the first image contains an identifiable object according to a network for identifying objects configured offline.

Optionally, the network for identifying objects may be the current pose detection network, or may be the first network or the second network described in the foregoing embodiments.

In a possible implementation manner, in S1702, the terminal device may determine whether the first image contains an identifiable object according to the pose and the network currently used to identify the object.

Exemplarily, the terminal device may input the first image into a network for recognizing poses and identifiers of objects, and if the network outputs poses and identifiers of one or more objects, it is determined that the first image includes recognizable objects. , otherwise, no identifiable objects are included in the first image.

In another possible implementation manner, the terminal device in S1702 may determine whether the first image contains a recognizable object according to the first network described in the foregoing embodiments of the present application and the multi-object category classifier obtained by registering the object.

Exemplarily, the terminal device may extract the features of the first image, and perform matching with the features in the multi-object category classifier described in the foregoing embodiments of the present application. If there are matching features, it is determined that the first image includes recognizable objects. , otherwise, no identifiable objects are included in the first image.

Exemplarily, in S1702, feature information in the first image can be extracted, and the feature information is used to indicate the identifiable features in the first image; Set the feature information of the condition; if there is feature information in the feature library whose matching distance with the feature information satisfies the preset condition, it is determined that one or more identifiable objects are included in the first image; if there is no match with the feature information in the feature library The distance satisfies the feature information of the preset condition, it is determined that the first image does not include any identifiable object. Among them, the feature library stores one or more feature information of different objects.

In a possible implementation manner, the preset condition may include less than or equal to a preset threshold. The value of the preset threshold may be configured according to actual requirements, which is not limited in this embodiment of the present application.

Optionally, if the terminal device determines in S1702 that the first image includes one or more identifiable objects, S1703 and S1704 are performed. If the terminal device determines in S1702 that no identifiable object is included in the first image, S1705 is executed.

S1703. Output first information, where the first information is used to prompt detection that the first image includes an identifiable object.

Optionally, the first information may be text information, voice information, or other forms, which are not limited in this embodiment of the present application.

Exemplarily, the content of the first information may be "a recognizable object has been detected, please keep the current angle", and the content of the first information may be superimposed and displayed on the first image in the form of text, or the content of the first information may be displayed on the first image. The content can be played through the speakers of the terminal device.

S1704: Obtain the pose of each recognizable object in the first image through the first network corresponding to each recognizable object included in the first image; and display the corresponding pose of each recognizable object according to the pose of each recognizable object virtual content.

The virtual content corresponding to the object may be configured according to actual requirements, which is not limited in this embodiment of the present application. For example, it can be the introduction information of the exhibition item, or it can also be the attribute information of the product, or others.

S1705. Output second information, where the second information is used to prompt that no identifiable object is detected, and adjust the viewing angle to acquire the second image.

Wherein, the second image is different from the first image.

Optionally, the second information may be text information, voice information, or other forms, which are not limited in this embodiment of the present application.

Exemplarily, the content of the second information may be "No identifiable object is detected, please adjust the angle of the device", the content of the second information may be superimposed and displayed on the first image in the form of text, or, the content of the second information may be displayed on the first image. The content can be played through the speakers of the terminal device.

Through the display method provided by the present application, a prompt is output to the user whether the image includes a identifiable object, so that the user can intuitively obtain whether the image includes an identifiable object, and the user experience is improved.

Exemplarily, in S1702, the terminal device determines whether the first image contains recognizable objects according to the first network described in the foregoing embodiments of the present application and the multi-object category classifier obtained by registering the object, as shown in FIG. 18 . As shown, the process can include:

S1801. The terminal device extracts local feature points in the first image.

It should be noted that the embodiments of the present application do not limit the specific scheme for extracting local feature points in S1801, but the method for extracting local feature points by the terminal device in S1801 needs to be consistent with that in S803 to ensure the accuracy of the scheme.

S1802. The terminal device acquires one or more first local feature points in the first image.

The matching distance between the descriptor of the first local feature point and the descriptor of the local feature point in the feature library is less than or equal to the first threshold. The value of the first threshold may be configured according to actual requirements, which is not limited in this embodiment of the present application.

The matching distance between descriptors refers to the distance between vectors representing the descriptors. For example, Euclidean distance, Hamming distance, etc.

Specifically, descriptors of local feature points of different objects are stored in the feature library. For example, the feature library may be the multi-object category classifier described in the foregoing embodiments.

S1803. The terminal device determines one or more ROIs in the first image according to the first local feature point.

Among them, one ROI includes one object.

Specifically, the terminal device can combine the color information and depth information of the image to classify the first local feature points into different objects, and determine the area of the first local feature points of the same object that are centrally distributed as an ROI to obtain a or multiple ROIs.

S1804, the terminal device extracts global features in each ROI.

It should be noted that the embodiment of the present application does not limit the specific scheme for extracting global features in S1804, but the method for extracting global features by the terminal device in S1804 needs to be consistent with that in S803 to ensure the accuracy of the scheme.

S1805. The terminal device determines whether the first image contains an identifiable object according to the global feature in each ROI.

Exemplarily, if there are one or more first global features in the global features in each ROI, it is determined that the first image includes an identifiable object corresponding to the first global features. The matching distance between the descriptor of the first global feature and the descriptor of the global feature in the feature library is less than or equal to the second threshold. Descriptors of global features of different objects are also stored in the feature library. If the first global feature does not exist in the global features in each ROI, it is determined that the first image does not include any identifiable object.

The value of the second threshold may be configured according to actual requirements, which is not limited in this embodiment of the present application.

The display method provided by the present application will be described below through specific examples.

Suppose a merchant provides an APP, which is used for users to capture images through mobile phone framing, identify the products contained in the images, and display the product introduction corresponding to the products. The three-dimensional objects of each commodity are registered offline according to the solutions provided in the embodiments of the present application. When a user is shopping at the merchant, the scene captured by the APP in the mobile phone is shown in Figure 1 as shown in the mobile phone interface shown in Figure 19. The APP uses the method shown in Figure 18 to extract the features in the scene Figure 1, identify The scene shown in Figure 1 contains identifiable objects (smart speakers), and outputs the mobile phone interface as shown in Figure 20a, "A identifiable object has been detected, please keep the current angle." Next, the APP calls the second network corresponding to the smart speaker to obtain the 6DOF pose of the smart speaker in Figure 1 of the scene, and then displays the virtual content corresponding to the smart speaker according to the 6DOF pose "Hello everyone, I'm Xiaoyi, can I A Huawei smart speaker that can listen to songs, tell stories, tell jokes, and understand countless encyclopedic knowledge."

The foregoing mainly introduces the solutions provided by the embodiments of the present invention from the perspective of the working principle of the device. It can be understood that, in order to realize the above-mentioned functions, the electronic device or the like includes corresponding hardware structures and/or software modules for executing each function. Those skilled in the art should easily realize that the present invention can be implemented in hardware or a combination of hardware and computer software in conjunction with the units and algorithm steps of each example described in the embodiments disclosed herein. Whether a function is performed by hardware or computer software driving hardware depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each particular application, but such implementations should not be considered beyond the scope of the present invention.

In this embodiment of the present invention, functional modules may be divided into apparatuses that execute the method of the present application according to the foregoing method examples. For example, each functional module may be divided corresponding to each function, or two or more functions may be integrated into one processing module. middle. The above-mentioned integrated modules can be implemented in the form of hardware, and can also be implemented in the form of software function modules. It should be noted that, the division of modules in the embodiment of the present invention is schematic, and is only a logical function division, and there may be other division manners in actual implementation.

In the case where each functional module is divided according to each function, FIG. 21 illustrates an object registration apparatus 210 provided by an embodiment of the present application, which is used to implement the functions in the above-mentioned first embodiment. As shown in FIG. 21 , the object registration apparatus 210 may include: a first acquisition unit 2101 , an extraction unit 2102 and a registration unit 2103 . The first acquiring unit 2101 is configured to execute the process S801 in FIG. 8 , or S801a and S801b in FIG. 10 , or S1402 and S1403 in FIG. 14 ; the extracting unit 2102 is configured to execute the process S802 in FIG. 8 or FIG. 10 , or S1404 in FIG. 14 ; the registration unit 2103 is configured to execute the process S803 in FIG. 8 or FIG. 10 , or S1405 in FIG. 14 . Wherein, all relevant contents of the steps involved in the above method embodiments can be cited in the functional descriptions of the corresponding functional modules, which will not be repeated here.

In the case where each functional module is divided according to each function, FIG. 22 illustrates an optimization apparatus provided by an embodiment of the present application, which is used to implement the functions in the foregoing second embodiment. As shown in FIG. 22 , the optimization apparatus 220 may include: a processing unit 2201 , a differentiable renderer 2202 , a screenshot unit 2203 , a construction unit 2204 and an update unit 2205 . The processing unit 2201 is used to execute the process S1201 in FIG. 12 ; the differentiable renderer 2202 is used to execute the process S1202 in FIG. 12 ; the screenshot unit 2203 is used to execute the process S1203 in FIG. 12 ; process S1204; the updating unit 2205 is used to execute the process S1205 in FIG. 12 . Wherein, all relevant contents of the steps involved in the above method embodiments can be cited in the functional descriptions of the corresponding functional modules, which will not be repeated here.

In the case where each functional module is divided according to each function, FIG. 23 illustrates an apparatus 230 for training an object pose detection network provided by an embodiment of the present application, which is used to implement the functions in the third embodiment. As shown in FIG. 23 , the apparatus 230 for training an object pose detection network may include: a second acquisition unit 2301 , a processing unit 2302 , a differentiable renderer 2303 , a construction unit 2304 and an update unit 2305 . The second acquisition unit 2301 is used to execute the process S1501 in FIG. 15; the processing unit 2302 is used to execute the process S1502 in FIG. 15; the differentiable renderer 2303 is used to execute the process S1503 in FIG. 15; the construction unit 2304 is used to execute the process S1503 of FIG. The process S1504 in 15; the updating unit 2305 is used to execute the process S1505 in FIG. 15 . Wherein, all relevant contents of the steps involved in the above method embodiments can be cited in the functional descriptions of the corresponding functional modules, which will not be repeated here.

In the case where each functional module is divided corresponding to each function, FIG. 24 illustrates a display device 240 provided by an embodiment of the present application, which is used to implement the functions in the above-mentioned fourth embodiment. As shown in FIG. 24 , the display device 240 may include: a first acquisition unit 2401 , an output unit 1402 and a processing unit 1403 . The first acquisition unit 2401 is used for executing the process S1701 in FIG. 17 ; the output unit 1402 is used for executing the process S1703 or S1705 in FIG. 17 ; the processing unit 1403 is used for executing the process S1704 in FIG. 17 . Wherein, all relevant contents of the steps involved in the above method embodiments can be cited in the functional descriptions of the corresponding functional modules, which will not be repeated here.

FIG. 25 provides a schematic diagram of the hardware structure of an apparatus 250 . The apparatus 250 may be an object registration apparatus provided in this embodiment of the present application, and is configured to execute the object registration method provided in the first embodiment of this application. Alternatively, the apparatus 250 may be the optimization apparatus provided in this embodiment of the present application, and is configured to execute the optimization method provided in the second embodiment of the present application. Alternatively, the apparatus 250 may be an apparatus for training an object pose detection network provided in this embodiment of the present application, and configured to execute the method for training an object pose detection network provided in the third embodiment of the present application. Alternatively, the device 250 may be the display device provided in this embodiment of the present application, and configured to execute the display method provided by the fourth embodiment of the present application.

As shown in FIG. 25 , the apparatus 250 (the apparatus 250 may specifically be a computer device) may include a memory 2501 , a processor 2502 , a communication interface 2503 and a bus 2504 . The memory 2501 , the processor 2502 , and the communication interface 2503 are connected to each other through the bus 2504 for communication.

The memory 2501 may be a read only memory (ROM), a static storage device, a dynamic storage device, or a random access memory (RAM). The memory 2501 can store programs, and when the programs stored in the memory 2501 are executed by the processor 2502, the processor 2502 and the communication interface 2503 are used to execute various steps of the methods provided in Embodiments 1 to 4 of this application.

The processor 2502 can be a general-purpose central processing unit (CPU), a microprocessor, an application specific integrated circuit (ASIC), a graphics processor (graphics processing unit, GPU), or one or more An integrated circuit for executing a relevant program to implement the functions required to be performed by the object registration device, the optimization device, the device for training the object pose detection network, and the unit in the display device according to the embodiments of the present application, or to execute the method embodiments of the present application. The method provided in any one of Embodiments 1 to 4.

The processor 2502 may also be an integrated circuit chip with signal processing capability. In the implementation process, each step of the method for training an object pose detection network of the present application may be completed by an integrated logic circuit of hardware in the processor 2502 or instructions in the form of software. The above-mentioned processor 2502 can also be a general-purpose processor, a digital signal processor (digital signal processing, DSP), an application-specific integrated circuit (ASIC), an off-the-shelf programmable gate array (field programmable gate array, FPGA) or other programmable logic devices. , discrete gate or transistor logic devices, discrete hardware components. The methods, steps, and logic block diagrams disclosed in the embodiments of this application can be implemented or executed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in conjunction with the embodiments of the present application may be directly embodied as executed by a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor. The software modules may be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other storage media mature in the art. The storage medium is located in the memory 2501, and the processor 2502 reads the information in the memory 2501, and combines its hardware to complete the object registration device, the optimization device, the device for training the object pose detection network, and the unit included in the display device of the embodiment of the present application. function to be executed, or execute any one of the first to fourth embodiments of the method of the present application.

The communication interface 2503 uses a transceiver device such as, but not limited to, a transceiver to implement communication between the device 250 and other devices or a communication network. For example, training data (such as real images and synthetic images described in the method embodiments of the present application) can be acquired through the communication interface 2503 .

Bus 2504 may include a pathway for communicating information between various components of device 250 (eg, memory 2501, processor 2502, communication interface 2503).

It should be understood that the first acquisition unit 2101 , the extraction unit 2102 and the registration unit 2103 in the object registration apparatus 210 are equivalent to the processor 2502 in the apparatus 250 . The processing unit 2201 , the differentiable renderer 2202 , the screenshot unit 2203 , the constructing unit 2204 and the updating unit 2205 in the optimization device 220 are equivalent to the processor 2502 in the device 250 . The second acquiring unit 2301 , the processing unit 2302 , the differentiable renderer 2303 , the constructing unit 2304 and the updating unit 2305 in the apparatus 230 for training the object pose detection network are equivalent to the processor 2502 in the apparatus 250 . The first acquiring unit 2401 and the processing unit 1403 in the display device 240 are equivalent to the processor 2502 in the device 250 , and the output unit 1402 is equivalent to the communication interface 2503 in the device 250 .

It should be noted that although the apparatus 250 shown in FIG. 25 only shows a memory, a processor, and a communication interface, in the specific implementation process, those skilled in the art should understand that the apparatus 250 also includes other devices necessary for normal operation . Meanwhile, according to specific needs, those skilled in the art should understand that the apparatus 250 may further include hardware devices for implementing other additional functions. In addition, those skilled in the art should understand that the apparatus 250 may only include the necessary devices for implementing the embodiments of the present application, and does not necessarily include all the devices shown in FIG. 25 .

It can be understood that the apparatus 250 is equivalent to the training device 520 or the execution device 510 in FIG. 5b. Those of ordinary skill in the art can realize that the units and algorithm steps of each example described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each particular application, but such implementations should not be considered beyond the scope of this application.

In the above-mentioned embodiments, it may be implemented in whole or in part by software, hardware, firmware or any combination thereof. When implemented in software, it can be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer programs or instructions. When the computer program or instructions are loaded and executed on a computer, the processes or functions described in the embodiments of the present application are executed in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, network equipment, user equipment, or other programmable apparatus. The computer program or instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer program or instructions may be downloaded from a website site, computer, A server or data center transmits by wire or wireless to another website site, computer, server or data center. The computer-readable storage medium may be any available medium that can be accessed by a computer, or a data storage device such as a server, data center, or the like that integrates one or more available media. The usable medium can be a magnetic medium, such as a floppy disk, a hard disk, and a magnetic tape; it can also be an optical medium, such as a digital video disc (DVD); it can also be a semiconductor medium, such as a solid state drive (solid state drive). , SSD).

In the various embodiments of the present application, if there is no special description or logical conflict, the terms and/or descriptions between different embodiments are consistent and can be referred to each other, and the technical features in different embodiments are based on their inherent Logical relationships can be combined to form new embodiments.

In this application, "at least one" means one or more, and "plurality" means two or more. "And/or", which describes the relationship of the associated objects, indicates that there can be three kinds of relationships, for example, A and/or B, it can indicate that A exists alone, A and B exist at the same time, and B exists alone, where A, B can be singular or plural. In the text description of this application, the character "/" generally indicates that the related objects are a kind of "or" relationship; in the formula of this application, the character "/" indicates that the related objects are a kind of "division" Relationship.

It can be understood that, the various numbers and numbers involved in the embodiments of the present application are only for the convenience of description, and are not used to limit the scope of the embodiments of the present application. The size of the sequence numbers of the above processes does not imply the sequence of execution, and the execution sequence of each process should be determined by its function and internal logic.

Claims

An object registration method, characterized in that the method comprises:

acquiring a plurality of first input images including a first object, the plurality of first input images including a real image of the first object and/or a plurality of first composite images of the first object; the plurality of first input images The first composite image is obtained by differentiable rendering of the three-dimensional model of the first object under multiple first poses; the multiple first poses are different;

Extracting feature information of a plurality of the first input images respectively, the feature information is used to indicate the features of the first object in the first input image where it is located;

The feature information extracted from each of the first input images is corresponding to the identifier of the first object, and the registration of the first object is performed.
The method according to claim 1, wherein the feature information includes descriptors of local feature points and descriptors of global features.
The method according to claim 1 or 2, wherein extracting feature information of a plurality of the first input images respectively comprises:

A plurality of the first input images are respectively input into the first network for pose recognition, and the pose of the first object in each of the first input images is obtained; wherein, the first network is used to identify the the pose of the first object;

According to the acquired pose of the first object, project the three-dimensional model of the first object to each of the first input images, respectively, to obtain a projection area in each of the first input images;

The feature information is extracted in each of the projection regions in the first input image, respectively.
The method according to claim 1 or 2, wherein extracting feature information of a plurality of the first input images respectively comprises:

A plurality of the first input images are respectively input into a first network for black and white image extraction, and a black and white image of the first object in each of the first input images is obtained; wherein, the first network is used to extract the a black and white image of the first object;

The feature information is extracted from the black and white images of the first object in each of the first input images, respectively.
The method according to any one of claims 1-4, wherein the method further comprises:

The N real images of the first object are respectively input into the first network to obtain the second pose of the first object in each real image output by the first network; the N is greater than or equal to 1; The first network is used to identify the pose of the first object in the image;

According to the three-dimensional model of the first object, using a differentiable renderer under each second pose, differentiable rendering is used to obtain N second composite images; obtaining a real image of the second pose and the one The second composite image obtained by rendering the second pose corresponds to;

respectively intercepting the area of the same position of the first object in the second composite image corresponding to the real object in each of the real images, as the foreground image of each of the real images;

A first loss function is constructed according to the first difference information of the foreground images of the N real images and their corresponding second composite images; wherein the first difference information is used to indicate the foreground images and their corresponding second composite images differences in images;

The differentiable renderer is updated according to the first loss function, so that the synthesized image of the output of the differentiable renderer approximates the real image of the object.
The method according to claim 5, wherein the first difference information includes one or more of the following information: feature map difference, pixel color difference, and difference between extracted feature descriptors.
The method according to any one of claims 1-6, wherein the method further comprises:

acquiring a plurality of second input images including a first object, the plurality of second input images including a real image of the first object and/or a plurality of third composite images of the first object; the plurality of second input images The third composite image is obtained by rendering the three-dimensional model of the first object in multiple third poses; the multiple third poses are different;

The plurality of second input images are respectively input into the second network for pose recognition, and the fourth pose of the first object in each of the second input images output by the second network is obtained; The second network is used to identify the pose of the first object in the image;

According to the three-dimensional model of the first object, microrendering obtains a fourth composite image of the first object in each of the fourth poses; obtains a second input image of the fourth pose and the one fourth pose The fourth composite image obtained by pose rendering corresponds to;

A second loss function is constructed according to the second difference information of each fourth composite image and its corresponding second input image; the second difference information is used to indicate the fourth composite image and its corresponding second loss function the difference of the input image;

The second network is updated according to the second loss function to obtain the first network; the difference between the pose of the first object in the image recognized by the first network and the real pose of the first object in the image is less than The pose of the first object in the image recognized by the second network is different from the real pose of the first object in the image.
The method of claim 7, wherein:

The second loss function Loss 2 satisfies the following expression:
The X is greater than or equal to 1, the λ i is a weight, and the Li is used to represent the calculated value of the second difference information between the fourth composite image and the corresponding second input image.
The method according to claim 7 or 8, wherein the second difference information comprises one or more of the following: the black and white image of the first object in the fourth composite image and its corresponding black and white image In the second input image, the difference between the intersection ratio IOU of the black and white image of the first object, the fourth pose obtained from the fourth composite image, and the position obtained by the fourth composite image through the first network The difference in pose, and the similarity between the fourth composite image and its corresponding second input image and the region image at the same position as the first object in the fourth composite image.
A display method, characterized in that the method comprises:

get the first image;

If the first image includes one or more identifiable objects, output first information, where the first information is used to prompt detection that the first image includes identifiable objects; The pose detection network corresponding to each recognizable object obtains the pose of each recognizable object in the first image; and displays the corresponding pose of each recognizable object according to the pose of each recognizable object. virtual content;

If the first image does not include any identifiable object, output second information, the second information is used to prompt that no identifiable object is detected, adjust the viewing angle to obtain a second image, the second image and the The first image is different.
The method of claim 10, wherein the method further comprises:

extracting feature information in the first image, the feature information being used to indicate recognizable features in the first image;

Judging whether there is feature information whose matching distance with the feature information meets a preset condition in the feature library; wherein, the feature library stores one or more feature information of different objects;

If there is feature information whose matching distance with the feature information satisfies a preset condition in the feature library, determine that the first image includes one or more identifiable objects;

If there is no feature information whose matching distance with the feature information satisfies a preset condition in the feature library, it is determined that the first image does not include any recognizable object.
The method of claim 10, wherein the method further comprises:

Obtaining one or more first local feature points in the first image, and the matching distance between the descriptors of the first local feature points and the descriptors of the local feature points in the feature library is less than or equal to a first threshold; The descriptors of local feature points of different objects are stored in the descriptor library;

According to the first local feature point, one or more regions of interest ROIs in the first image are determined; one ROI includes one object;

extracting global features in each of said ROIs;

If there are one or more first global features in the global features in each of the ROIs, it is determined that the first image includes identifiable objects corresponding to the first global features; wherein, the first global feature The matching distance between the descriptor and the descriptor of the global feature in the feature library is less than or equal to the second threshold; the descriptor of the global feature of different objects is also stored in the feature library;

If the first global feature does not exist in the global features in each of the ROIs, it is determined that the first image does not include any identifiable object.
The method according to any one of claims 10-12, wherein the method further comprises:

acquiring a plurality of first input images including a first object, the plurality of first input images including a real image of the first object and/or a plurality of first composite images of the first object; the plurality of first input images the first composite image is obtained by differentiable rendering of the three-dimensional model of the first object under multiple first poses; the multiple first poses are different;

Extracting feature information of a plurality of the first input images respectively, the feature information is used to indicate the features of the first object in the first input image where it is located;

The feature information extracted from each of the first input images is stored in a feature library corresponding to the identifier of the first object, and the registration of the first object is performed.
The method according to claim 11 or 13, wherein the feature information includes descriptors of local feature points and descriptors of global features.
The method according to any one of claims 10-14, wherein the method further comprises:

acquiring a plurality of second input images including a first object, the plurality of second input images including a real image of the first object and/or a plurality of third composite images of the first object; the plurality of second input images The third composite image is obtained by rendering the three-dimensional model of the first object in multiple third poses; the multiple third poses are different;

The plurality of second input images are respectively input into the second network for pose recognition, and the fourth pose of the first object in each of the second input images output by the second network is obtained; The second network is used to identify the pose of the first object in the image;

According to the three-dimensional model of the first object, microrendering obtains a fourth composite image of the first object in each of the fourth poses; obtains a second input image of the fourth pose and the one fourth pose The fourth composite image obtained by pose rendering corresponds to;

A second loss function is constructed according to the second difference information of each fourth composite image and its corresponding second input image; the second difference information is used to indicate the fourth composite image and its corresponding second loss function the difference of the input image;

The second network is updated according to the second loss function to obtain the first network; the difference between the pose of the first object in the image recognized by the first network and the real pose of the first object in the image is less than The pose of the first object in the image recognized by the second network is different from the real pose of the first object in the image.
The method of claim 15, wherein:

The second loss function Loss 2 satisfies the following expression:
The X is greater than or equal to 1, the λ i is a weight, and the Li is used to represent the calculated value of the second difference information between the fourth composite image and the corresponding second input image.
The method according to claim 15 or 16, wherein the second difference information comprises one or more of the following: the black and white image of the first object in the fourth composite image and its corresponding black and white image In the second input image, the difference between the intersection ratio IOU of the black and white image of the first object, the fourth pose obtained from the fourth composite image, and the position obtained by the fourth composite image through the first network The difference in pose, and the similarity between the fourth composite image and its corresponding second input image in the region image at the same position as the first object in the fourth composite image.
An object registration device, characterized in that the device comprises:

A first acquiring unit, configured to acquire a plurality of first input images including a first object, the plurality of first input images including a real image of the first object and/or a plurality of first input images of the first object a composite image; the plurality of first composite images are obtained by differentiable rendering of the three-dimensional model of the first object in a plurality of first poses; the plurality of first poses are different;

an extraction unit, configured to separately extract a plurality of feature information of the first input image, where the feature information is used to indicate the feature of the first object in the first input image where it is located;

A registration unit, configured to perform the registration of the first object by corresponding the feature information extracted by the extraction unit in each of the first input images to the identifier of the first object.
The apparatus according to claim 18, wherein the feature information includes descriptors of local feature points and descriptors of global features.
The device according to claim 18 or 19, wherein the extraction unit is specifically used for:

A plurality of the first input images are respectively input into the first network for pose recognition, and the pose of the first object in each of the first input images is obtained; wherein, the first network is used to identify the the pose of the first object;

According to the acquired pose of the first object, project the three-dimensional model of the first object to each of the first input images, respectively, to obtain a projection area in each of the first input images;

The feature information is extracted in each of the projection regions in the first input image, respectively.
The device according to claim 18 or 19, wherein the extraction unit is specifically used for:

A plurality of the first input images are respectively input into a first network for black and white image extraction, and a black and white image of the first object in each of the first input images is obtained; wherein, the first network is used to extract the a black and white image of the first object;

The feature information is extracted from the black and white images of the first object in each of the first input images, respectively.
The device according to any one of claims 18-21, wherein the device further comprises:

a processing unit, configured to input N real images of the first object into the first network respectively, to obtain the second pose of the first object in each real image output by the first network; the N is greater than or equal to 1; the first network is used to identify the pose of the first object in the image;

The differentiable renderer is used to obtain N second composite images through differentiable rendering according to the three-dimensional model of the first object under each of the second poses; obtain a real image of the second pose and all the second poses. The second composite image obtained by rendering a second pose corresponds to;

an intercepting unit, configured to intercept the area of the same position of the first object in the second composite image corresponding to the real object in each of the real images, as the foreground image of each of the real images;

a construction unit, configured to construct a first loss function according to the first difference information of the foreground images of the N real images and their corresponding second synthetic images; wherein the first difference information is used to indicate that the foreground images are different from the corresponding second synthetic images. the difference of the corresponding second composite image;

An update unit, configured to update the differentiable renderer according to the first loss function, so that the synthesized image output by the differentiable renderer approximates the real image of the object.
The apparatus according to claim 22, wherein the first difference information includes one or more of the following information: difference in feature map, difference in pixel color, and difference between extracted feature descriptors.
The device according to any one of claims 18-23, wherein the device further comprises:

a second acquiring unit, configured to acquire a plurality of second input images including a first object, the plurality of second input images including a real image of the first object and/or a plurality of third images of the first object a composite image; the plurality of third composite images are obtained by rendering the three-dimensional model of the first object in a plurality of third poses; the plurality of third poses are different;

A processing unit, configured to input the plurality of second input images into the second network respectively for pose recognition, and obtain the fourth position of the first object in each of the second input images output by the second network pose; the second network is used to identify the pose of the first object in the image;

A differentiable renderer, configured to obtain a fourth composite image of the first object in each of the fourth poses through differentiable rendering according to the three-dimensional model of the first object; obtain a second input of the fourth pose The image corresponds to the fourth composite image obtained by rendering the one fourth pose;

a construction unit, configured to construct a second loss function according to the second difference information of each fourth composite image and its corresponding second input image; the second difference information is used to indicate that the fourth composite image corresponds to it the difference of the second input image;

an update unit, configured to update the second network according to the second loss function to obtain a first network; the pose of the first object in the image identified by the first network and the real position of the first object in the image The difference in pose is smaller than the difference between the pose of the first object in the image recognized by the second network and the real pose of the first object in the image.
The apparatus of claim 24, wherein:

The second loss function Loss 2 satisfies the following expression:
The X is greater than or equal to 1, the λ i is a weight, and the Li is used to represent the calculated value of the second difference information between the fourth composite image and the corresponding second input image.
The apparatus according to claim 24 or 25, wherein the second difference information includes one or more of the following: the black and white image of the first object in the fourth composite image and its corresponding black and white image In the second input image, the difference between the intersection ratio IOU of the black and white image of the first object, the fourth pose obtained from the fourth composite image, and the position obtained by the fourth composite image through the first network The difference in pose, and the similarity between the fourth composite image and its corresponding second input image in the region image at the same position as the first object in the fourth composite image.
A display device, characterized in that the device comprises a first acquisition unit, an output unit and a processing unit; wherein:

the first acquisition unit, configured to acquire a first image;

The output unit is configured to output first information if the first image includes one or more recognizable objects, where the first information is used to prompt detection that the first image includes recognizable objects; if The first image does not include any identifiable object, output second information, the second information is used to prompt that no identifiable object is detected, adjust the viewing angle so that the first acquisition unit acquires the second image, so the second image is different from the first image;

The processing unit is configured to, if the first image includes one or more identifiable objects, obtain the information in the first image through the pose detection network corresponding to each identifiable object included in the first image. The pose of each recognizable object; according to the pose of each recognizable object, the virtual content corresponding to each recognizable object is displayed.
The apparatus of claim 27, wherein the apparatus further comprises:

an extraction unit, configured to extract feature information in the first image, where the feature information is used to indicate recognizable features in the first image;

a judging unit for judging whether there is feature information whose matching distance with the feature information satisfies a preset condition in the feature library; wherein, the feature library stores one or more feature information of different objects;

a first determining unit, configured to determine that the first image includes one or more identifiable objects if there is feature information whose matching distance with the feature information meets a preset condition in the feature library; If there is no feature information whose matching distance with the feature information satisfies a preset condition in the library, it is determined that the first image does not include any identifiable object.
The apparatus of claim 27, wherein the apparatus further comprises:

a second acquiring unit, configured to acquire one or more first local feature points in the first image, where the matching distance between the descriptors of the first local feature points and the descriptors of the local feature points in the feature library is less than or equal to the first threshold; descriptors of local feature points of different objects are stored in the feature library;

a second determining unit, configured to determine one or more regions of interest ROIs in the first image according to the first local feature points; an ROI includes an object;

an extraction unit for extracting global features in each of the ROIs;

A first determining unit, configured to determine that the first image includes a recognizable object corresponding to the first global feature if there are one or more first global features in the global features in each of the ROIs; The first global feature does not exist in the global features in the ROI, and it is determined that the first image does not include any identifiable object; wherein, the descriptor of the first global feature is the same as the one in the feature library. The matching distance of the descriptors of the global features is less than or equal to the second threshold; the descriptors of the global features of different objects are also stored in the feature library.
The device according to any one of claims 27-29, wherein the device further comprises:

a third acquiring unit, configured to acquire a plurality of first input images including a first object, the plurality of first input images including a real image of the first object and/or a plurality of first input images of the first object a composite image; the plurality of first composite images are obtained by differentiable rendering of the three-dimensional model of the first object in a plurality of first poses; the plurality of first poses are different;

an extraction unit, configured to separately extract a plurality of feature information of the first input image, where the feature information is used to indicate the feature of the first object in the first input image where it is located;

The registration unit is configured to store the feature information extracted from each of the first input images and the identifier of the first object in a feature database, and register the first object.
The apparatus according to claim 28 or 30, wherein the feature information includes descriptors of local feature points and descriptors of global features.
The device according to any one of claims 27-31, wherein the device further comprises:

a fourth acquisition unit, configured to acquire a plurality of second input images including a first object, the plurality of second input images including a real image of the first object and/or a plurality of third images of the first object a composite image; the plurality of third composite images are obtained by rendering the three-dimensional model of the first object in a plurality of third poses; the plurality of third poses are different;

A processing unit, configured to input the plurality of second input images into the second network respectively for pose recognition, and obtain the fourth position of the first object in each of the second input images output by the second network pose; the second network is used to identify the pose of the first object in the image;

A differentiable renderer, configured to obtain a fourth composite image of the first object in each of the fourth poses through differentiable rendering according to the three-dimensional model of the first object; obtain a second input of the fourth pose The image corresponds to the fourth composite image obtained by rendering the one fourth pose;

a construction unit, configured to construct a second loss function according to the second difference information of each fourth composite image and its corresponding second input image; the second difference information is used to indicate that the fourth composite image corresponds to it the difference of the second input image;

an update unit, configured to update the second network according to the second loss function to obtain a first network; the pose of the first object in the image identified by the first network and the real position of the first object in the image The difference in pose is smaller than the difference between the pose of the first object in the image recognized by the second network and the real pose of the first object in the image.
The apparatus of claim 32, wherein

The second loss function Loss 2 satisfies the following expression:
The X is greater than or equal to 1, the λ i is a weight, and the Li is used to represent the calculated value of the second difference information between the fourth composite image and the corresponding second input image.
The apparatus according to claim 32 or 33, wherein the second difference information includes one or more of the following: the black and white image of the first object in the fourth composite image and its corresponding black and white image In the second input image, the difference between the intersection ratio IOU of the black and white image of the first object, the fourth pose obtained from the fourth composite image, and the position obtained by the fourth composite image through the first network The difference in pose, and the similarity between the fourth composite image and its corresponding second input image in the region image at the same position as the first object in the fourth composite image.
An electronic device, characterized in that the electronic device comprises: a processor and a memory;

The memory is connected to the processor; the memory is used for storing computer instructions, and when the processor executes the computer instructions, the electronic device is made to perform the method according to any one of claims 1-9. The object registration method, or, causing the electronic device to execute the display method according to any one of claims 10-17.
A computer-readable storage medium, characterized in that it includes an instruction that, when running on a computer, causes the computer to execute the object registration method described in any one of claims 1-9, or causes the computer to execute claim 10 The display method described in any one of -17.
A computer program product, characterized in that, when it is run on a computer, it causes the computer to execute the object registration method described in any one of claims 1-9, or causes the computer to execute any one of claims 10-17 The display method described in item.