WO2022143314A1 - Object registration method and apparatus - Google Patents

Object registration method and apparatus Download PDF

Info

Publication number
WO2022143314A1
WO2022143314A1 PCT/CN2021/140241 CN2021140241W WO2022143314A1 WO 2022143314 A1 WO2022143314 A1 WO 2022143314A1 CN 2021140241 W CN2021140241 W CN 2021140241W WO 2022143314 A1 WO2022143314 A1 WO 2022143314A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
pose
input
network
images
Prior art date
Application number
PCT/CN2021/140241
Other languages
French (fr)
Chinese (zh)
Inventor
李尔
杨威
郑波
刘建滨
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2022143314A1 publication Critical patent/WO2022143314A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T15/003D [Three Dimensional] image rendering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T15/003D [Three Dimensional] image rendering
    • G06T15/005General purpose rendering architectures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/50Image enhancement or restoration by the use of more than one image, e.g. averaging, subtraction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/75Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/64Three-dimensional objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20212Image combination
    • G06T2207/20221Image fusion; Image merging

Definitions

  • the embodiments of the present application relate to the field of computer vision, and in particular, to an object registration method and apparatus.
  • Computer vision is an integral part of various intelligent/autonomous systems in various application fields, such as manufacturing, inspection, document analysis, medical diagnosis, and military. What we need is the knowledge of the data and information of the subject being photographed. To put it figuratively, it is to install eyes (cameras/camcorders) and brains (algorithms) on the computer to identify, track and measure the target instead of the human eye, so that the computer can perceive the environment. Because perception can be viewed as extracting information from sensory signals, computer vision can also be viewed as the science of how to make artificial systems "perceive" from images or multidimensional data. In general, computer vision is to use various imaging systems to replace the visual organ to obtain input information, and then use the computer to replace the brain to complete the processing and interpretation of these input information. The ultimate research goal of computer vision is to enable computers to observe and understand the world through vision like humans, and have the ability to adapt to the environment autonomously.
  • the pose detection and tracking of objects is a key technology in the field of computer vision, which can endow machines with the ability to perceive the three-dimensional spatial position and semantics of objects in the real environment. use.
  • a multi-object pose estimation network is usually constructed first to identify the poses of recognizable objects from the input images; (3 dimensions, 3D) network model, train the multi-object pose estimation network, register the object to be recognized in the multi-object pose estimation network, and realize that the multi-object pose estimation network can recognize the registered objects.
  • 3 dimensions, 3D 3 dimensions, 3D
  • the image is input to the multi-object pose estimation network, and the multi-object pose estimation network recognizes the pose of the object in the picture.
  • the user When a new object to be recognized needs to be registered in the multi-object pose estimation network, the user provides the 3D model of the newly added object to be recognized, and the new 3D model and the original 3D model are used to estimate the pose of the multi-object.
  • the retraining of the network leads to a linear increase in the training time, and affects the recognition effect of the multi-object pose estimation network on the recognizable objects that have been trained, resulting in a decrease in the detection accuracy and success rate.
  • the object registration method and device provided by the present application solve the problem of how to improve the accuracy of the pose of the terminal device.
  • an object registration method may include: acquiring a plurality of first input images including a first object, the plurality of first input images including a real image of the first object and/or an image of the first object a plurality of first composite images; the plurality of first composite images are obtained by differentiable rendering of the three-dimensional model of the first object in a plurality of first poses; the plurality of first poses are different; a plurality of first inputs are extracted respectively
  • the feature information of the image the feature information is used to indicate the features of the first object in the first input image where it is located; the feature information extracted from each first input image corresponds to the identification of the first object, and the first Registration of an object.
  • the feature information of the object in the image is extracted for object registration, the registration time is short, and the recognition performance of other registered objects is not affected, even if many new objects are registered, the registration will not increase too much. time, also guarantees the detection performance of registered objects.
  • the above feature information may include descriptors of local feature points and descriptors of global features.
  • the feature information of the plurality of first input images is extracted respectively, which can be specifically implemented as: inputting the plurality of first input images into the first network respectively for pose recognition, and obtaining each first input image The pose of the first object in the image; according to the obtained pose of the first object, project the three-dimensional model of the first object to each of the first input images, respectively, to obtain the projection area in each of the first input images; For each projected region in the first input image, feature information is extracted.
  • the first network is used to identify the pose of the first object in the image. Through the pose of the first object in the image, the region of the first object in the image is determined, and feature information is extracted in the region, which improves the efficiency and accuracy of feature information extraction.
  • the feature information of the plurality of first input images is extracted respectively, which can be specifically implemented as: inputting the plurality of first input images into the first network respectively to extract black and white images, and obtaining each first input image A black-and-white image of the first object in the image; extract feature information from the black-and-white image of the first object in each first input image respectively.
  • the first network is used to extract the black and white image of the first object in the image.
  • the black and white image of the first object in the image is determined as the region of the first object in the image, and feature information is extracted in the region, which improves the efficiency and accuracy of feature information extraction.
  • the object registration method provided by the present application may further include a process of optimizing the differentiable renderer, and the process may include: inputting N real images of the first object into the first network respectively, to obtain the first The second pose of the first object in each real image output by a network; N is greater than or equal to 1; the first network is used to identify the pose of the first object in the image; according to the three-dimensional model of the first object, a differentiable Under each second pose, the renderer can obtain N second composite images by micro-rendering; obtain a real image of one second pose corresponding to the second composite image rendered by the one second pose; intercept them respectively In each real image, the area at the same position of the first object in the second composite image corresponding to the real object is taken as the foreground image of each real image; a difference information to construct a first loss function; wherein, the first difference information is used to indicate the difference between the foreground image and its corresponding second composite image; the differentiable renderer is updated according to the first loss function
  • the first difference information may include one or more of the following information: feature map difference, pixel color difference, and difference between extracted feature descriptors.
  • the first loss function may be the sum of calculated values of multiple first difference information of the foreground images of the N real images and their corresponding second synthetic images.
  • the object registration method provided by the present application may further include a method for training an object pose detection network, which may specifically include: acquiring multiple second input images including the first object, the multiple first The second input image includes a real image of the first object and/or a plurality of third composite images of the first object; the plurality of third composite images are obtained by rendering the three-dimensional model of the first object in a plurality of third poses; The multiple third poses are different; the multiple second input images are respectively input into the second network for pose recognition, and the fourth pose of the first object in each second input image output by the second network is obtained; the second network It is used to identify the pose of the first object in the image; according to the three-dimensional model of the first object, microrendering obtains the fourth composite image of the first object under each fourth pose; obtains a second image of the fourth pose The input image corresponds to a fourth composite image rendered by a fourth pose; a second loss function is constructed according to the second difference information of each fourth composite image and its corresponding second input image; the second
  • the difference of the real pose is smaller than the difference between the pose of the first object in the image recognized by the second network and the real pose of the first object in the image.
  • the second loss function Loss 2 can satisfy the following expression: X is greater than or equal to 1, ⁇ i is a weight value, and L i is used to represent a calculated value of the second difference information between the fourth composite image and its corresponding second input image.
  • This implementation provides a specific expression of the second loss function, so that the object pose detection network is trained, and the recognition accuracy of the pose detection network is improved.
  • the second difference information may include one or more of the following contents: the black-and-white image of the first object in the fourth composite image and the black-and-white image of the first object in the corresponding second input image
  • a display method may include: acquiring a first image; if the first image includes one or more identifiable objects, outputting first information, where the first information is used to prompt detection of the first image
  • the image includes recognizable objects; the pose detection network corresponding to each recognizable object included in the first image is used to obtain the pose of each recognizable object in the first image; according to the pose of each recognizable object, display Virtual content corresponding to each recognizable object; if the first image does not include any recognizable object, output the second information, the second information is used to prompt that no recognizable object is detected, adjust the viewing angle to obtain the second image, and the first The second image is different from the first image.
  • a prompt is output to the user whether the image includes a identifiable object, so that the user can intuitively obtain whether the image includes a identifiable object, which improves user experience.
  • the display method provided by the present application may further include: extracting feature information in the first image, where the feature information is used to indicate recognizable features in the first image; judging whether the feature library exists Feature information whose matching distance with the extracted feature information satisfies a preset condition; wherein, one or more feature information of different objects is stored in the feature library; if there is a feature whose matching distance with the feature information satisfies the preset condition in the feature library information, it is determined that the first image includes one or more identifiable objects; if there is no feature information whose matching distance with the feature information satisfies the preset condition in the feature library, it is determined that the first image does not include any identifiable objects.
  • the preset condition may include less than or equal to a preset threshold.
  • the display method provided by the present application may further include: acquiring one or more first local feature points in the first image, descriptors of the first local feature points and local features in the feature library The matching distance of the descriptors of the points is less than or equal to the first threshold; the descriptors of the local feature points of different objects are stored in the feature library; according to the first local feature points, one or more regions of interest (regions) in the first image are determined.
  • ROI an object is included in an ROI; global features in each ROI are extracted; if there are one or more first global features in the global features in each ROI, it is determined that the first image includes the first global feature The identifiable object corresponding to the feature; wherein, the matching distance between the descriptor of the first global feature and the descriptor of the global feature in the feature library is less than or equal to the second threshold; the descriptor of the global feature of different objects is also stored in the feature library ; If the first global feature does not exist in the global features in each ROI, it is determined that the first image does not include any identifiable object.
  • the ROI area is determined by comparing the local feature points of the image with the feature library, and then the global features are extracted in the ROI area and compared with the feature library, which can improve the efficiency of judging whether the image contains identifiable objects, and also improves the judgment of whether the image contains identifiable objects. Identifiable object accuracy.
  • the display method provided by the present application may further include using in combination with the object registration method provided in the foregoing first aspect, and the display method provided by the present application may further include: acquiring a plurality of first objects including the first object.
  • an input image the plurality of first input images include a real image of the first object and/or a plurality of first composite images of the first object; the plurality of first composite images are composed of a three-dimensional model of the first object in a plurality of first Differentiable rendering is obtained under the pose; multiple first poses are different; feature information of multiple first input images is extracted respectively, and the feature information is used to indicate the feature of the first object in the first input image where it is located;
  • the feature information extracted from each first input image is stored in the feature library corresponding to the identifier of the first object, and the first object is registered.
  • Object registration is performed by extracting the feature information of objects in the image.
  • the registration time is short, and the recognition performance of other registered objects is not affected. Even if many new objects are registered, the registration time will not be increased too much, and the registered objects are also guaranteed. detection performance.
  • the above feature information may include descriptors of local feature points and descriptors of global features.
  • the display method provided by the present application may further include a process of optimizing the differentiable renderer, and the process may include: inputting N real images of the first object into the first network respectively, to obtain the first network The second pose of the first object in each output real image; N is greater than or equal to 1; the first network is used to identify the pose of the first object in the image; according to the three-dimensional model of the first object, a differentiable renderer is used Under each second pose, N second composite images are obtained by differentiable rendering; a real image of a second pose is obtained corresponding to the second composite image rendered by the one second pose; each image is intercepted separately In the real image, the area at the same position of the first object in the second synthetic image corresponding to the real object is used as the foreground image of each real image; according to the first difference between the foreground images of the N real images and their corresponding second synthetic images information to construct a first loss function; wherein, the first difference information is used to indicate the difference between the foreground image
  • the first difference information may include one or more of the following information: feature map difference, pixel color difference, and difference between extracted feature descriptors.
  • the first loss function may be the sum of calculated values of multiple first difference information of the foreground images of the N real images and their corresponding second synthetic images.
  • the object registration method provided by the present application may further include a method for training an object pose detection network, which may specifically include: acquiring multiple second input images including the first object, the multiple first The second input image includes a real image of the first object and/or a plurality of third composite images of the first object; the plurality of third composite images are obtained by rendering the three-dimensional model of the first object in a plurality of third poses; The multiple third poses are different; the multiple second input images are respectively input into the second network for pose recognition, and the fourth pose of the first object in each second input image output by the second network is obtained; the second network It is used to identify the pose of the first object in the image; according to the three-dimensional model of the first object, microrendering obtains the fourth composite image of the first object under each fourth pose; obtains a second image of the fourth pose The input image corresponds to a fourth composite image rendered by a fourth pose; a second loss function is constructed according to the second difference information of each fourth composite image and its corresponding second input image; the second
  • the difference of the real pose is smaller than the difference between the pose of the first object in the image recognized by the second network and the real pose of the first object in the image.
  • the second loss function Loss 2 can satisfy the following expression: X is greater than or equal to 1, ⁇ i is a weight value, and L i is used to represent a calculated value of the second difference information between the fourth composite image and its corresponding second input image.
  • This implementation provides a specific expression of the second loss function, optimizes the pose detection network, and improves the recognition accuracy of the pose detection network.
  • the second difference information may include one or more of the following contents: the black-and-white image of the first object in the fourth composite image and the black-and-white image of the first object in the corresponding second input image
  • the difference between the IOUs of the black and white images, the difference between the fourth pose obtained from the fourth composite image and the pose obtained by the fourth composite image through the first network, the fourth composite image and the corresponding second input image and the fourth composite image The similarity of the region images in the same position of the first object.
  • the present application provides a method for training an object pose detection network, the method may specifically include: acquiring a plurality of second input images including a first object, the plurality of second input images including a real image of the first object image and/or a plurality of third composite images of the first object; the plurality of third composite images are rendered by the three-dimensional model of the first object in a plurality of third poses; the plurality of third poses are different; the A plurality of second input images are respectively input to the second network for pose recognition, and the fourth pose of the first object in each second input image output by the second network is obtained; the second network is used to identify the pose of the first object in the image.
  • microrendering obtains the fourth composite image of the first object under each fourth pose; obtains a second input image of the fourth pose and a fourth pose rendering
  • the obtained fourth composite image corresponds to; according to the second difference information of each fourth composite image and its corresponding second input image, a second loss function is constructed; the second difference information is used to indicate the fourth composite image and its corresponding No.
  • the difference between the two input images; the second network is updated according to the second loss function to obtain the first network; the difference between the pose of the first object in the image recognized by the first network and the real pose of the first object in the image is smaller than the second The pose of the first object in the image recognized by the network is different from the real pose of the first object in the image.
  • the recognition accuracy of the pose detection network is improved, and the difference between the output of the pose detection network and the real pose of the object in the image is reduced.
  • an object registration device in a fourth aspect, includes: a first acquisition unit, an extraction unit, and a registration unit. in:
  • a first acquiring unit configured to acquire a plurality of first input images including a first object, the plurality of first input images including a real image of the first object and/or a plurality of first composite images of the first object; a plurality of The first composite image is obtained by differentiable rendering of the three-dimensional model of the first object in multiple first poses; the multiple first poses are different.
  • the extraction unit is configured to extract feature information of a plurality of first input images respectively, where the feature information is used to indicate features of the first object in the first input image where it is located.
  • the registration unit is used for registering the first object by corresponding the feature information extracted by the extraction unit in each first input image to the identifier of the first object.
  • the feature information of the object in the image is extracted to register the object, and the registration time is short, and the recognition performance of other registered objects is not affected. time, also guarantees the detection performance of registered objects.
  • the feature information includes descriptors of local feature points and descriptors of global features.
  • the extraction unit may be specifically configured to: input a plurality of first input images into the first network respectively for pose recognition, and obtain the pose of the first object in each first input image; Obtaining the pose of the first object, project the three-dimensional model of the first object to each first input image, respectively, to obtain the projection area in each first input image; respectively in the projection area in each first input image , extract feature information.
  • the first network is used to identify the pose of the first object in the image. Through the pose of the first object in the image, the region of the first object in the image is determined, and feature information is extracted in the region, which improves the efficiency and accuracy of feature information extraction.
  • the extraction unit may be specifically configured to: input a plurality of first input images into a first network respectively to extract black and white images, and obtain a black and white image of the first object in each of the first input images; Within the black and white image of the first object in each of the first input images, feature information is extracted.
  • the first network is used to extract the black and white image of the first object in the image.
  • the black and white image of the first object in the image is determined as the region of the first object in the image, and feature information is extracted in the region, which improves the efficiency and accuracy of feature information extraction.
  • the apparatus may further include: a processing unit, a differentiable renderer, a screenshot unit, a construction unit, and an update unit. in:
  • the processing unit is configured to input the N real images of the first object into the first network respectively, and obtain the second pose of the first object in each real image output by the first network.
  • N is greater than or equal to 1.
  • the first network is used to identify the pose of the first object in the image.
  • the differentiable renderer is used to obtain N second composite images by differentiable rendering under each second pose according to the three-dimensional model of the first object; obtain a real image of the second pose and a second pose The rendered second composite image corresponds to.
  • the intercepting unit is used for intercepting, in each real image, an area at the same position of the first object in the second composite image corresponding to the real object, as the foreground image of each real image.
  • the construction unit is configured to construct a first loss function according to the first difference information of the foreground images of the N real images and their corresponding second synthetic images.
  • the first difference information is used to indicate the difference between the foreground image and its corresponding second composite image.
  • the updating unit is configured to update the differentiable renderer according to the first loss function, so that the synthetic image output by the differentiable renderer approximates the real image of the object.
  • the rendering authenticity of the differentiable renderer is improved, and the difference between the synthetic image obtained by differentiable rendering and the real image is reduced.
  • the first difference information may include one or more of the following information: feature map difference, pixel color difference, and difference between extracted feature descriptors.
  • the first loss function may be the sum of calculated values of multiple first difference information of the foreground images of the N real images and their corresponding second synthetic images.
  • the apparatus may further include: a second acquisition unit, a processing unit, a differentiable renderer, a construction unit, and an update unit. in:
  • a second acquiring unit configured to acquire a plurality of second input images including the first object, the plurality of second input images include a real image of the first object and/or a plurality of third composite images of the first object;
  • the three composite images are rendered by the three-dimensional model of the first object in multiple third poses.
  • the multiple third poses are different.
  • the processing unit is used for inputting a plurality of second input images into the second network respectively for pose recognition, and obtaining the fourth pose of the first object in each second input image output by the second network; the second network is used for identifying The pose of the first object in the image.
  • a differentiable renderer configured to obtain a fourth composite image of the first object in each fourth pose by differentiable rendering according to the three-dimensional model of the first object. Obtaining a second input image in a fourth pose corresponds to a fourth composite image rendered in a fourth pose.
  • the construction unit is configured to construct a second loss function according to the second difference information of each fourth composite image and its corresponding second input image.
  • the second difference information is used to indicate the difference between the fourth composite image and its corresponding second input image.
  • the updating unit is configured to update the second network according to the second loss function to obtain the first network.
  • the difference between the pose of the first object in the image recognized by the first network and the real pose of the first object in the image is smaller than the pose of the first object in the image recognized by the second network and the real pose of the first object in the image difference.
  • the recognition accuracy of the pose detection network is improved, and the difference between the output of the pose detection network and the real pose of the object in the image is reduced.
  • the second loss function Loss 2 can satisfy the following expression: X is greater than or equal to 1, ⁇ i is a weight value, and L i is used to represent a calculated value of the second difference information between the fourth composite image and its corresponding second input image.
  • This implementation provides a specific expression of the second loss function, optimizes the pose detection network, and improves the recognition accuracy of the pose detection network.
  • the second difference information may include one or more of the following contents: the black-and-white image of the first object in the fourth composite image and the black-and-white image of the first object in the corresponding second input image
  • the difference between the IOUs of the black and white images, the difference between the fourth pose obtained from the fourth composite image and the pose obtained by the fourth composite image through the first network, the fourth composite image and the corresponding second input image and the fourth composite image The similarity of the region images in the same position of the first object.
  • object registration apparatus provided in the fourth aspect is used to implement the object registration method provided in the first aspect, and the specific implementation can refer to the specific implementation of the foregoing first aspect, which will not be repeated here.
  • a display device in a fifth aspect, includes a first acquisition unit, an output unit and a processing unit; wherein:
  • the first acquisition unit is used to acquire the first image.
  • the output unit is configured to output first information if the first image includes one or more identifiable objects, where the first information is used for prompting detection that the first image includes identifiable objects. If the first image does not include any recognizable object, output second information, the second information is used to prompt that no recognizable object is detected, adjust the viewing angle so that the first acquisition unit acquires the second image, the second image and the The first image is different.
  • the processing unit is configured to obtain the pose of each recognizable object in the first image through the pose detection network corresponding to each recognizable object included in the first image if the first image includes one or more recognizable objects ; According to the pose of each recognizable object, display the virtual content corresponding to each recognizable object.
  • a prompt is output to the user whether the image includes a identifiable object, so that the user can intuitively obtain whether the image includes an identifiable object, which improves user experience.
  • the apparatus may further include: an extraction unit, a judgment unit, and a first determination unit. in:
  • An extraction unit configured to extract feature information in the first image, where the feature information is used to indicate recognizable features in the first image.
  • the judgment unit is used for judging whether there is feature information whose matching distance with the feature information satisfies a preset condition in the feature library.
  • the feature library stores one or more feature information of different objects.
  • the first determination unit is configured to determine that one or more identifiable objects are included in the first image if there is feature information whose matching distance with the feature information meets a preset condition in the feature library; The feature information whose distance satisfies the preset condition is matched, and it is determined that the first image does not include any identifiable object.
  • the preset condition may include less than or equal to a preset threshold.
  • the apparatus may further include: a second acquiring unit, a second determining unit, and a first determining unit. in:
  • the second acquiring unit is configured to acquire one or more first local feature points in the first image, and the matching distance between the descriptors of the first local feature points and the descriptors of the local feature points in the feature library is less than or equal to the first Threshold; descriptors of local feature points of different objects are stored in the feature library.
  • the second determining unit is configured to determine one or more ROIs in the first image according to the first local feature points; one ROI includes one object.
  • the extraction unit can also be used to extract global features in each ROI.
  • a first determining unit configured to determine that the first image includes an identifiable object corresponding to the first global feature if there are one or more first global features in the global features in each ROI; if the global features in each ROI If the first global feature does not exist in the first image, it is determined that the first image does not include any identifiable object.
  • the matching distance between the descriptor of the first global feature and the descriptor of the global feature in the feature library is less than or equal to the second threshold. Descriptors of global features of different objects are also stored in the feature library.
  • the ROI area is determined by comparing the local feature points of the image with the feature library, and then the global features are extracted in the ROI area and compared with the feature library, which can improve the efficiency of judging whether the image contains identifiable objects, and also improves the judgment of whether the image contains identifiable objects. Identifiable object accuracy.
  • the apparatus may further include: a third acquisition unit, an extraction unit, and a registration unit. in:
  • a third acquiring unit configured to acquire a plurality of first input images including a first object, the plurality of first input images include a real image of the first object and/or a plurality of first composite images of the first object; a plurality of first input images A composite image is obtained by differentiable rendering of the three-dimensional model of the first object in multiple first poses; the multiple first poses are different.
  • the registration unit is configured to store the feature information extracted from each first input image corresponding to the identifier of the first object in the feature database, and register the first object.
  • the above feature information may include descriptors of local feature points and descriptors of global features.
  • the apparatus may further include: a fourth acquisition unit, a processing unit, a differentiable renderer, a construction unit, and an update unit. in:
  • a fourth acquisition unit configured to acquire a plurality of second input images including the first object, the plurality of second input images including a real image of the first object and/or a plurality of third composite images of the first object; a plurality of The third composite image is obtained by rendering the three-dimensional model of the first object in multiple third poses; the multiple third poses are different.
  • the processing unit is used for inputting a plurality of second input images into the second network respectively for pose recognition, and obtaining the fourth pose of the first object in each second input image output by the second network; the second network is used for identifying The pose of the first object in the image.
  • the differentiable renderer is used for obtaining a fourth composite image of the first object in each fourth pose according to the three-dimensional model of the first object, and obtaining a second input image of a fourth pose and a first
  • the fourth composite image obtained by the four pose rendering corresponds to.
  • the construction unit is configured to construct a second loss function according to the second difference information of each fourth composite image and its corresponding second input image; the second difference information is used to indicate the difference between the fourth composite image and its corresponding second input image difference.
  • the updating unit is used to update the second network according to the second loss function to obtain the first network; the difference between the pose of the first object in the image recognized by the first network and the real pose of the first object in the image is smaller than that of the second network The recognized pose of the first object in the image is different from the real pose of the first object in the image.
  • the second loss function Loss 2 can satisfy the following expression: X is greater than or equal to 1, ⁇ i is a weight value, and L i is used to represent a calculated value of the second difference information between the fourth composite image and its corresponding second input image.
  • This implementation provides a specific expression of the second loss function, trains the pose detection network, and improves the recognition accuracy of the pose detection network.
  • the second difference information may include one or more of the following contents: the black-and-white image of the first object in the fourth composite image and the black-and-white image of the first object in the corresponding second input image
  • the difference between the IOUs of the black and white images, the difference between the fourth pose obtained from the fourth composite image and the pose obtained by the fourth composite image through the first network, the fourth composite image and the corresponding second input image and the fourth composite image The similarity of the region images in the same position of the first object.
  • the display device provided in the fifth aspect is used to implement the display method provided in the above-mentioned second aspect, and the specific implementation thereof can refer to the specific implementation of the foregoing second aspect, which will not be repeated here.
  • an apparatus for training an object pose detection network may include: an acquisition unit, a processing unit, a differentiable renderer, a construction unit, and an update unit. in:
  • an acquiring unit configured to acquire multiple second input images including the first object, the multiple second input images including the real image of the first object and/or multiple third composite images of the first object; multiple third The composite image is rendered by the three-dimensional model of the first object in multiple third poses; the multiple third poses are different.
  • the processing unit is used for inputting a plurality of second input images into the second network respectively for pose recognition, and obtaining the fourth pose of the first object in each second input image output by the second network; the second network is used for identifying The pose of the first object in the image.
  • the differentiable renderer is used for obtaining a fourth composite image of the first object in each fourth pose according to the three-dimensional model of the first object, and obtaining a second input image of a fourth pose and a first
  • the fourth composite image obtained by the four pose rendering corresponds to.
  • the construction unit is configured to construct a second loss function according to the second difference information of each fourth composite image and its corresponding second input image; the second difference information is used to indicate the difference between the fourth composite image and its corresponding second input image difference.
  • the updating unit is used to update the second network according to the second loss function to obtain the first network; the difference between the pose of the first object in the image recognized by the first network and the real pose of the first object in the image is smaller than that of the second network The recognized pose of the first object in the image is different from the real pose of the first object in the image.
  • the recognition accuracy of the pose detection network is improved, and the difference between the output of the pose detection network and the real pose of the object in the image is reduced.
  • the device for optimizing the pose recognition network provided in the sixth aspect is used to implement the method for optimizing the pose recognition network provided in the third aspect.
  • the specific implementation can refer to the specific implementation of the third aspect. Here No longer.
  • the present application provides an electronic device, which can implement the functions in the method examples described in the first aspect or the second aspect or the third aspect, and the functions can be implemented by hardware or by hardware. Execute the corresponding software implementation.
  • the hardware or software includes one or more modules corresponding to the above functions.
  • the electronic device may exist in the form of a chip product.
  • a computer-readable storage medium comprising instructions that, when run on a computer, cause the computer to execute the object registration method or display method described in any one of the above aspects or any possible implementation manner, Or a method for training a network for object pose detection.
  • a ninth aspect provides a computer program product that, when running on a computer, enables the computer to execute the object registration method described in any of the above aspects or any possible implementation manner, or the display method, or the training object pose A method to detect the network.
  • a chip system in a tenth aspect, includes a processor, and may also include a memory, for implementing the functions in the above method.
  • the chip system can be composed of chips, and can also include chips and other discrete devices.
  • FIG. 1 is a schematic structural diagram of a terminal device according to an embodiment of the present application.
  • FIG. 2 is a schematic diagram of a software structure of a terminal device according to an embodiment of the present application
  • FIG. 3 is a schematic diagram of a virtual-real fusion effect of explanation information and a real object provided by an embodiment of the present application;
  • 5a is a schematic flowchart of a method for incremental learning provided by an embodiment of the present application.
  • FIG. 5b is a schematic diagram of a system architecture provided by an embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of a convolutional neural network provided by an embodiment of the present application.
  • FIG. 7a is a schematic diagram of a chip hardware structure provided by an embodiment of the application.
  • FIG. 7b is a schematic diagram of the overall system framework of the solution provided by the embodiment of the present application.
  • Embodiment 8 is a schematic flowchart of an object registration method provided in Embodiment 1 of the present application.
  • FIG. 9 is a schematic diagram of a spherical surface provided by an embodiment of the present application.
  • FIG. 10 is a schematic flowchart of another object registration method provided in Embodiment 1 of the present application.
  • FIG. 11 is a schematic diagram of an image area provided by an embodiment of the present application.
  • FIG. 13 is a schematic flowchart of another optimization method provided in Embodiment 2 of the present application.
  • FIG. 15 is a schematic flowchart of a method for training an object pose detection network according to Embodiment 3 of the present application.
  • FIG. 16 is a schematic flowchart of another method for training an object pose detection network provided in Embodiment 3 of the present application.
  • FIG. 17 is a schematic flowchart of a display method provided in Embodiment 4 of the present application.
  • 19 is a schematic diagram of a mobile phone interface provided by an embodiment of the application.
  • FIG. 20a is a schematic diagram of another mobile phone interface provided by an embodiment of the present application.
  • 20b is a schematic diagram of still another mobile phone interface provided by the embodiment of the application.
  • FIG. 21 is a schematic structural diagram of an object registration apparatus provided by an embodiment of the present application.
  • 22 is a schematic structural diagram of an optimization device provided by an embodiment of the application.
  • FIG. 24 is a schematic structural diagram of a display device according to an embodiment of the present application.
  • FIG. 25 is a schematic structural diagram of an apparatus provided by an embodiment of the present application.
  • words such as “exemplary” or “for example” are used to represent examples, illustrations or illustrations. Any embodiments or designs described in the embodiments of the present application as “exemplary” or “such as” should not be construed as preferred or advantageous over other embodiments or designs. Rather, the use of words such as “exemplary” or “such as” is intended to present the related concepts in a specific manner to facilitate understanding.
  • At least one may also be described as one or more, and the multiple may be two, three, four or more, which is not limited in this application.
  • the network architecture and scenarios described in the embodiments of the present application are for the purpose of illustrating the technical solutions of the embodiments of the present application more clearly, and do not constitute a limitation on the technical solutions provided by the embodiments of the present application.
  • the evolution of the network architecture and the emergence of new service scenarios, the technical solutions provided in the embodiments of the present application are also applicable to similar technical problems.
  • An image also known as a picture, is a picture with visual effects.
  • the image referred to in this application may be a still image, a video frame in a video stream, or other, which is not limited.
  • Object a person or thing that can exist.
  • things can be buildings, commodities, plants, animals, etc., which are not listed here.
  • the pose refers to the pose of the object in the camera coordinate system.
  • the poses can include 6DoF poses, that is, the translation pose and rotation pose of the object relative to the camera.
  • Pose detection refers to detecting and recognizing the pose of an object in an image.
  • the real image of an object refers to a static image with visual effects and a drawing image of the background area containing the object.
  • the real image of the object can be in red, green, blue, RGB format or RGBD (red, green, blue, depth) format.
  • Rendering is the process of converting a 3D model of an object into a 2D image through a renderer.
  • scenes and entities are represented in three-dimensional form, which can be closer to the real world and facilitate manipulation and transformation, while graphics display devices are mostly two-dimensional rasterized displays and dot-matrix printers.
  • a raster display can be regarded as a pixel matrix, and any graphic displayed on a raster display is actually a collection of pixels with one or more colors and grayscales.
  • the representation of a three-dimensional solid scene through rasterization and lattice is image rendering -- that is, rasterization.
  • Conventional rendering refers to a rendering method in which rasterization is not differentiable.
  • Differentiable rendering refers to a rendering method in which rasterization is differentiable. Since the rendering process is differentiable, a loss function can be constructed according to the difference between the rendered image and the real image, the parameters of the differentiable rendering can be updated, and the authenticity of the differentiable rendering results can be improved.
  • the composite image of an object refers to an image containing only the object obtained by rendering the 3D model of the object in a desired pose.
  • the composite image of an object in a certain pose is equivalent to the image obtained by taking pictures of the object in this pose.
  • the image corresponding to the composite image refers to the pose of rendering the composite image, which is obtained by inputting the corresponding image to the neural network. It should be understood that the composite image and the source image for rendering the pose of the composite image correspond to each other, which will not be described in detail below.
  • the image corresponding to the composite image may be a real image of the object, or may be other composite images of the object.
  • a black-and-white image of an object is a black-and-white pixel image that contains only the object and no background.
  • the black-and-white image of the object can be represented by a binarized image, the pixel value of the region including the object is 1, and the pixel value of other regions is 0.
  • Local feature points are local expressions of image features, which reflect the local characteristics of the image. Local feature points are points on the image that are clearly distinguishable from other pixels, including but not limited to corner points, key points, and the like. In image processing, local feature points mainly refer to scale-invariant points or blocks. Scale invariance means that the same object or scene is collected from different angles, and the same place can be identified as the same.
  • the local feature points may include SIFT feature points, SURF feature points, DAISY feature points, and the like.
  • the local feature points of the image can be extracted by methods such as FAST and DOG.
  • the descriptor of a local feature point is a high-dimensional vector representing the local image information of the feature point.
  • the global feature refers to the feature that can represent the entire image.
  • the global feature is relative to the local features of the image and is used to describe the overall features such as the color and shape of the image or target.
  • global features may include color features, texture features, shape features, and the like.
  • the bag-of-words method can be used to extract the global features of the image.
  • the descriptor of the global feature is a high-dimensional vector that represents the image information of the entire image or a larger area.
  • the calculated value may refer to a mathematical calculation value of multiple data, and the mathematical calculation may be an average, a maximum, or a minimum, or others.
  • a terminal device can implement a virtual reality (virtual reality, VR) function, so that the user is in the virtual world and experiences the virtual world.
  • a terminal device can implement an augmented reality (AR) function, combine virtual objects with real scenes, and enable users to interact with virtual objects.
  • VR virtual reality
  • AR augmented reality
  • the terminal device may be a smartphone, a tablet computer, a wearable device, an AR/VR device, or the like. This application does not limit the specific form of the terminal device.
  • Wearable devices can also be called wearable smart devices, which are the general term for the intelligent design of daily wear and the development of wearable devices using wearable technology, such as glasses, gloves, watches, clothing and shoes.
  • a wearable device is a portable device that is worn directly on the body or integrated into the user's clothing or accessories. Wearable device is not only a hardware device, but also realizes powerful functions through software support, data interaction, and cloud interaction.
  • wearable smart devices include full-featured, large-scale, complete or partial functions without relying on smart phones, such as smart watches or smart glasses, and only focus on a certain type of application function, which needs to cooperate with other devices such as smart phones.
  • the structure of the terminal device may be as shown in FIG. 1 .
  • the terminal device 100 may include a processor 110 , an external memory interface 120 , an internal memory 121 , a universal serial bus (USB) interface 130 , a charging management module 140 , a power management module 141 , and a battery 142 , Antenna 1, Antenna 2, Mobile Communication Module 150, Wireless Communication Module 160, Audio Module 170, Speaker 170A, Receiver 170B, Microphone 170C, Headphone Interface 170D, Sensor Module 180, Key 190, Motor 191, Indicator 192, Camera 193 , a display screen 194, and a subscriber identification module (subscriber identification module, SIM) card interface 195 and the like.
  • SIM subscriber identification module
  • the sensor module 180 may include a pressure sensor 180A, a gyroscope sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity light sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, and ambient light. Sensor 180L, bone conduction sensor 180M, etc.
  • the structure illustrated in this embodiment does not constitute a specific limitation on the terminal device 100 .
  • the terminal device 100 may include more or less components than shown, or combine some components, or separate some components, or arrange different components.
  • the illustrated components may be implemented in hardware, software, or a combination of software and hardware.
  • the processor 110 may include one or more processing units, for example, the processor 110 may include an application processor (application processor, AP), a modem processor, a graphics processor (graphics processing unit, GPU), an image signal processor (image signal processor, ISP), controller, video codec, digital signal processor (digital signal processor, DSP), baseband processor, and/or neural-network processing unit (neural-network processing unit, NPU), etc. Wherein, different processing units may be independent devices, or may be integrated in one or more processors. For example, in the present application, the processor 110 may control to turn on other cameras when it is determined that the first image satisfies the abnormal condition.
  • application processor application processor
  • AP application processor
  • modem processor graphics processor
  • image signal processor image signal processor
  • ISP image signal processor
  • controller video codec
  • digital signal processor digital signal processor
  • baseband processor baseband processor
  • neural-network processing unit neural-network processing unit
  • the controller may be the nerve center and command center of the terminal device 100 .
  • the controller can generate an operation control signal according to the instruction operation code and timing signal, and complete the control of fetching and executing instructions.
  • a memory may also be provided in the processor 110 for storing instructions and data.
  • the memory in processor 110 is cache memory. This memory may hold instructions or data that have just been used or recycled by the processor 110 . If the processor 110 needs to use the instruction or data again, it can be called directly from the memory. Repeated accesses are avoided and the latency of the processor 110 is reduced, thereby increasing the efficiency of the system.
  • the processor 110 may include one or more interfaces.
  • the interface may include an integrated circuit (inter-integrated circuit, I2C) interface, an integrated circuit built-in audio (inter-integrated circuit sound, I2S) interface, a pulse code modulation (pulse code modulation, PCM) interface, a universal asynchronous transceiver (universal asynchronous transmitter) receiver/transmitter, UART) interface, mobile industry processor interface (MIPI), general-purpose input/output (GPIO) interface, subscriber identity module (SIM) interface, and / or universal serial bus (universal serial bus, USB) interface, etc.
  • I2C integrated circuit
  • I2S integrated circuit built-in audio
  • PCM pulse code modulation
  • PCM pulse code modulation
  • UART universal asynchronous transceiver
  • MIPI mobile industry processor interface
  • GPIO general-purpose input/output
  • SIM subscriber identity module
  • USB universal serial bus
  • the MIPI interface can be used to connect the processor 110 with peripheral devices such as the display screen 194 and the camera 193 .
  • MIPI interfaces include camera serial interface (CSI), display serial interface (DSI), etc.
  • the processor 110 communicates with the camera 193 through a CSI interface to implement the shooting function of the terminal device 100.
  • the processor 110 communicates with the display screen 194 through the DSI interface to implement the display function of the terminal device 100 .
  • the GPIO interface can be configured by software.
  • the GPIO interface can be configured as a control signal or as a data signal.
  • the GPIO interface may be used to connect the processor 110 with the camera 193, the display screen 194, the wireless communication module 160, the audio module 170, the sensor module 180, and the like.
  • the GPIO interface can also be configured as I2C interface, I2S interface, UART interface, MIPI interface, etc.
  • the USB interface 130 is an interface that conforms to the USB standard specification, and may specifically be a Mini USB interface, a Micro USB interface, a USB Type C interface, and the like.
  • the USB interface 130 can be used to connect a charger to charge the terminal device 100, and can also be used to transmit data between the terminal device 100 and peripheral devices. It can also be used to connect headphones to play audio through the headphones. This interface can also be used to connect other terminal devices, such as AR devices.
  • the interface connection relationship between the modules illustrated in this embodiment is only a schematic illustration, and does not constitute a structural limitation of the terminal device 100 .
  • the terminal device 100 may also adopt different interface connection manners in the foregoing embodiments, or a combination of multiple interface connection manners.
  • the power management module 141 is used for connecting the battery 142 , the charging management module 140 and the processor 110 .
  • the power management module 141 receives input from the battery 142 and/or the charging management module 140, and supplies power to the processor 110, the internal memory 121, the display screen 194, the camera 193, and the wireless communication module 160.
  • the power management module 141 can also be used to monitor parameters such as battery capacity, battery cycle times, battery health status (leakage, impedance).
  • the power management module 141 may also be provided in the processor 110 .
  • the power management module 141 and the charging management module 140 may also be provided in the same device.
  • the wireless communication function of the terminal device 100 may be implemented by the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, the modulation and demodulation processor, the baseband processor, and the like.
  • the terminal device 100 implements a display function through a GPU, a display screen 194, an application processor, and the like.
  • the GPU is a microprocessor for image processing, and is connected to the display screen 194 and the application processor.
  • the GPU is used to perform mathematical and geometric calculations for graphics rendering.
  • Processor 110 may include one or more GPUs that execute program instructions to generate or alter display information.
  • Display screen 194 is used to display images, videos, and the like.
  • Display screen 194 includes a display panel.
  • the display panel can be a liquid crystal display (LCD), an organic light-emitting diode (OLED), an active-matrix organic light-emitting diode or an active-matrix organic light-emitting diode (active-matrix organic light).
  • LED diode AMOLED
  • flexible light-emitting diode flexible light-emitting diode (flex light-emitting diode, FLED), Miniled, MicroLed, Micro-oled, quantum dot light-emitting diode (quantum dot light emitting diodes, QLED) and so on.
  • the terminal device 100 may include one or N display screens 194 , where N is a positive integer greater than one.
  • a series of graphical user interfaces may be displayed on the display screen 194 of the terminal device 100 , and these GUIs are the main screens of the terminal device 100 .
  • GUIs graphical user interfaces
  • the size of the display screen 194 of the terminal device 100 is fixed, and only limited controls can be displayed in the display screen 194 of the terminal device 100 .
  • a control is a GUI element, which is a software component that is included in an application and controls all the data processed by the application and the interaction with this data. The user can interact with the control through direct manipulation (direct manipulation). , so as to read or edit the relevant information of the application.
  • controls may include icons, buttons, menus, tabs, text boxes, dialog boxes, status bars, navigation bars, widgets, and other visual interface elements.
  • the terminal device 100 can realize the shooting function through the ISP, the camera 193, the video codec, the GPU, the display screen 194 and the application processor.
  • the ISP is used to process the data fed back by the camera 193 .
  • the shutter is opened, the light is transmitted to the camera photosensitive element through the lens, the light signal is converted into an electrical signal, and the camera photosensitive element transmits the electrical signal to the ISP for processing, and converts it into an image visible to the naked eye.
  • ISP can also perform algorithm optimization on image noise, brightness, and skin tone.
  • ISP can also optimize the exposure, color temperature and other parameters of the shooting scene.
  • the ISP may be provided in the camera 193 .
  • Camera 193 is used to capture still images or video.
  • the object is projected through the lens to generate an optical image onto the photosensitive element.
  • the photosensitive element may be a charge coupled device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor.
  • CMOS complementary metal-oxide-semiconductor
  • the photosensitive element converts the optical signal into an electrical signal, and then transmits the electrical signal to the ISP to convert it into a digital image signal.
  • the ISP outputs the digital image signal to the DSP for processing.
  • DSP converts digital image signals into standard RGB, YUV and other formats of image signals.
  • the terminal device 100 may include 1 or N cameras 193 , where N is a positive integer greater than 1.
  • a digital signal processor is used to process digital signals, in addition to processing digital image signals, it can also process other digital signals. For example, when the terminal device 100 selects a frequency point, the digital signal processor is used to perform Fourier transform on the frequency point energy, and the like.
  • Video codecs are used to compress or decompress digital video.
  • the terminal device 100 may support one or more video codecs.
  • the terminal device 100 can play or record videos in various encoding formats, for example, moving picture experts group (moving picture experts group, MPEG) 1, MPEG2, MPEG3, MPEG4 and so on.
  • MPEG moving picture experts group
  • the NPU is a neural-network (NN) computing processor.
  • NN neural-network
  • Applications such as intelligent cognition of the terminal device 100 can be implemented through the NPU, such as image recognition, face recognition, speech recognition, text understanding, and the like.
  • the external memory interface 120 can be used to connect an external memory card, such as a Micro SD card, to expand the storage capacity of the terminal device 100 .
  • the external memory card communicates with the processor 110 through the external memory interface 120 to realize the data storage function. For example to save files like music, video etc in external memory card.
  • Internal memory 121 may be used to store computer executable program code, which includes instructions.
  • the processor 110 executes various functional applications and data processing of the terminal device 100 by executing the instructions stored in the internal memory 121 .
  • the processor 110 may acquire the pose of the terminal device 100 by executing the instructions stored in the internal memory 121 .
  • the internal memory 121 may include a storage program area and a storage data area.
  • the storage program area can store an operating system, an application program required for at least one function (such as a sound playback function, an image playback function, etc.), and the like.
  • the storage data area may store data (such as audio data, phone book, etc.) created during the use of the terminal device 100 and the like.
  • the internal memory 121 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, universal flash storage (UFS), and the like.
  • the processor 110 executes various functional applications and data processing of the terminal device 100 by executing instructions stored in the internal memory 121 and/or instructions stored in a memory provided in the processor.
  • the audio module 170 is used for converting digital audio information into analog audio signal output, and also for converting analog audio input into digital audio signal. Audio module 170 may also be used to encode and decode audio signals. In some embodiments, the audio module 170 may be provided in the processor 110 , or some functional modules of the audio module 170 may be provided in the processor 110 .
  • Speaker 170A also referred to as a "speaker" is used to convert audio electrical signals into sound signals.
  • the terminal device 100 can listen to music through the speaker 170A, or listen to a hands-free call.
  • the microphone 170C also called “microphone” or “microphone” is used to convert sound signals into electrical signals.
  • the user can make a sound by approaching the microphone 170C through a human mouth, and input the sound signal into the microphone 170C.
  • the terminal device 100 may be provided with at least one microphone 170C.
  • the terminal device 100 may be provided with two microphones 170C, which may implement a noise reduction function in addition to collecting sound signals.
  • the terminal device 100 may further be provided with three, four or more microphones 170C to collect sound signals, reduce noise, identify sound sources, and implement directional recording functions.
  • the earphone jack 170D is used to connect wired earphones.
  • the earphone interface 170D can be the USB interface 130, or can be a 3.5mm open mobile terminal platform (OMTP) standard interface, a cellular telecommunications industry association of the USA (CTIA) standard interface.
  • OMTP open mobile terminal platform
  • CTIA cellular telecommunications industry association of the USA
  • the pressure sensor 180A is used to sense pressure signals, and can convert the pressure signals into electrical signals.
  • the pressure sensor 180A may be provided on the display screen 194 .
  • the capacitive pressure sensor may be comprised of at least two parallel plates of conductive material. When a force is applied to the pressure sensor 180A, the capacitance between the electrodes changes.
  • the terminal device 100 determines the intensity of the pressure according to the change in capacitance. When a touch operation acts on the display screen 194, the terminal device 100 detects the intensity of the touch operation according to the pressure sensor 180A.
  • the terminal device 100 may also calculate the touched position according to the detection signal of the pressure sensor 180A.
  • touch operations acting on the same touch position but with different touch operation intensities may correspond to different operation instructions. For example, when a touch operation whose intensity is less than the first pressure threshold acts on the short message application icon, the instruction for viewing the short message is executed. When a touch operation with a touch operation intensity greater than or equal to the first pressure threshold acts on the short message application icon, the instruction to create a new short message is executed.
  • the gyro sensor 180B may be used to determine the motion attitude of the terminal device 100 .
  • the angular velocity of the end device 100 about three axes ie, the x, y and z axes
  • the gyro sensor 180B can be used for image stabilization.
  • the gyro sensor 180B detects the shaking angle of the terminal device 100, calculates the distance to be compensated by the lens module according to the angle, and allows the lens to offset the shaking of the terminal device 100 through reverse motion to achieve anti-shake.
  • the gyro sensor 180B can also be used for navigation and somatosensory game scenarios.
  • the air pressure sensor 180C is used to measure air pressure.
  • the terminal device 100 calculates the altitude through the air pressure value measured by the air pressure sensor 180C to assist in positioning and navigation.
  • the magnetic sensor 180D includes a Hall sensor.
  • the terminal device 100 can detect the opening and closing of the flip holster using the magnetic sensor 180D.
  • the terminal device 100 can detect the opening and closing of the flip according to the magnetic sensor 180D. Further, according to the detected opening and closing state of the leather case or the opening and closing state of the flip cover, characteristics such as automatic unlocking of the flip cover are set.
  • the acceleration sensor 180E can detect the magnitude of the acceleration of the terminal device 100 in various directions (generally three axes).
  • the magnitude and direction of gravity can be detected when the terminal device 100 is stationary. It can also be used to identify the posture of terminal devices, and can be used in applications such as horizontal and vertical screen switching, pedometers, etc.
  • the terminal device 100 can measure the distance through infrared or laser. In some embodiments, when shooting a scene, the terminal device 100 can use the distance sensor 180F to measure the distance to achieve fast focusing.
  • Proximity light sensor 180G may include, for example, light emitting diodes (LEDs) and light detectors, such as photodiodes.
  • the light emitting diodes may be infrared light emitting diodes.
  • the terminal device 100 emits infrared light to the outside through the light emitting diode.
  • the terminal device 100 detects infrared reflected light from nearby objects using a photodiode. When sufficient reflected light is detected, it can be determined that there is an object near the terminal device 100 . When insufficient reflected light is detected, the terminal device 100 may determine that there is no object near the terminal device 100 .
  • the terminal device 100 can use the proximity light sensor 180G to detect that the user holds the terminal device 100 close to the ear to talk, so as to automatically turn off the screen to save power.
  • Proximity light sensor 180G can also be used in holster mode, pocket mode automatically unlocks and locks the screen.
  • the ambient light sensor 180L is used to sense ambient light brightness.
  • the terminal device 100 can adaptively adjust the brightness of the display screen 194 according to the perceived ambient light brightness.
  • the ambient light sensor 180L can also be used to automatically adjust the white balance when taking pictures.
  • the ambient light sensor 180L can also cooperate with the proximity light sensor 180G to detect whether the terminal device 100 is in a pocket, so as to prevent accidental touch.
  • the fingerprint sensor 180H is used to collect fingerprints.
  • the terminal device 100 can use the collected fingerprint characteristics to realize fingerprint unlocking, accessing application locks, taking photos with fingerprints, answering incoming calls with fingerprints, and the like.
  • the temperature sensor 180J is used to detect the temperature.
  • the terminal device 100 uses the temperature detected by the temperature sensor 180J to execute the temperature processing strategy. For example, when the temperature reported by the temperature sensor 180J exceeds a threshold value, the terminal device 100 reduces the performance of the processor located near the temperature sensor 180J, so as to reduce power consumption and implement thermal protection.
  • the terminal device 100 when the temperature is lower than another threshold, the terminal device 100 heats the battery 142 to avoid abnormal shutdown of the terminal device 100 caused by the low temperature.
  • the terminal device 100 boosts the output voltage of the battery 142 to avoid abnormal shutdown caused by low temperature.
  • Touch sensor 180K also called “touch device”.
  • the touch sensor 180K may be disposed on the display screen 194 , and the touch sensor 180K and the display screen 194 form a touch screen, also called a “touch screen”.
  • the touch sensor 180K is used to detect a touch operation on or near it.
  • the touch sensor can pass the detected touch operation to the application processor to determine the type of touch event.
  • Visual output related to touch operations may be provided through display screen 194 .
  • the touch sensor 180K may also be disposed on the surface of the terminal device 100 , which is different from the position where the display screen 194 is located.
  • the bone conduction sensor 180M can acquire vibration signals.
  • the bone conduction sensor 180M can acquire the vibration signal of the vibrating bone mass of the human voice.
  • the bone conduction sensor 180M can also contact the pulse of the human body and receive the blood pressure beating signal.
  • the bone conduction sensor 180M can also be disposed in the earphone, combined with the bone conduction earphone.
  • the audio module 170 can analyze the voice signal based on the vibration signal of the voice vibration bone block obtained by the bone conduction sensor 180M, and realize the voice function.
  • the application processor can analyze the heart rate information based on the blood pressure beat signal obtained by the bone conduction sensor 180M, and realize the function of heart rate detection.
  • the keys 190 include a power-on key, a volume key, and the like. Keys 190 may be mechanical keys. It can also be a touch key.
  • the terminal device 100 may receive key input and generate key signal input related to user settings and function control of the terminal device 100 .
  • Motor 191 can generate vibrating cues.
  • the motor 191 can be used for vibrating alerts for incoming calls, and can also be used for touch vibration feedback.
  • touch operations acting on different applications can correspond to different vibration feedback effects.
  • the motor 191 can also correspond to different vibration feedback effects for touch operations on different areas of the display screen 194 .
  • Different application scenarios for example: time reminder, receiving information, alarm clock, games, etc.
  • the touch vibration feedback effect can also support customization.
  • the indicator 192 can be an indicator light, which can be used to indicate the charging state, the change of the power, and can also be used to indicate a message, a missed call, a notification, and the like.
  • an operating system runs on the above-mentioned components.
  • the iOS operating system developed by Apple the Android open source operating system developed by Google
  • the Windows operating system developed by Microsoft the Windows operating system developed by Microsoft.
  • Applications can be installed and run on this operating system.
  • the operating system of the terminal device 100 may adopt a layered architecture, an event-driven architecture, a microkernel architecture, a microservice architecture, or a cloud architecture.
  • the embodiments of the present application take an Android system with a layered architecture as an example to exemplarily describe the software structure of the terminal device 100 .
  • FIG. 2 is a block diagram of a software structure of a terminal device 100 according to an embodiment of the present application.
  • the layered architecture divides the software into several layers, and each layer has a clear role and division of labor. Layers communicate with each other through software interfaces.
  • the Android system is divided into four layers, which are, from top to bottom, an application layer, an application framework layer, an Android runtime (Android runtime) and a system library, and a kernel layer.
  • the application layer can include a series of application packages.
  • the application package can include applications such as camera, gallery, calendar, call, map, navigation, WLAN, Bluetooth, music, video, short message and so on.
  • the camera application can access the camera interface management service provided by the application framework layer.
  • the application framework layer provides an application programming interface (application programming interface, API) and a programming framework for applications in the application layer.
  • the application framework layer includes some predefined functions.
  • the application framework layer may include window managers, content providers, view systems, telephony managers, resource managers, notification managers, and the like.
  • the application framework layer may provide the application layer with APIs related to the photographing function, and provide the application layer with a camera interface management service to realize the photographing function.
  • a window manager is used to manage window programs.
  • the window manager can get the size of the display screen, determine whether there is a status bar, lock the screen, take screenshots, etc.
  • Content providers are used to store and retrieve data and make these data accessible to applications.
  • the data may include video, images, audio, calls made and received, browsing history and bookmarks, phone book, etc.
  • the view system includes visual controls, such as controls for displaying text, controls for displaying pictures, and so on. View systems can be used to build applications.
  • a display interface can consist of one or more views.
  • the display interface including the short message notification icon may include a view for displaying text and a view for displaying pictures.
  • the telephony manager is used to provide the communication function of the terminal device 100 .
  • the management of call status including connecting, hanging up, etc.).
  • the resource manager provides various resources for the application, such as localization strings, icons, pictures, layout files, video files and so on.
  • the notification manager enables applications to display notification information in the status bar, which can be used to convey notification-type messages, and can disappear automatically after a brief pause without user interaction. For example, the notification manager is used to notify download completion, message reminders, etc.
  • the notification manager can also display notifications in the status bar at the top of the system in the form of graphs or scroll bar text, such as notifications of applications running in the background, and notifications on the screen in the form of dialog windows. For example, text information is prompted in the status bar, a prompt sound is issued, the terminal device vibrates, and the indicator light flashes.
  • Android Runtime includes core libraries and a virtual machine. Android runtime is responsible for scheduling and management of the Android system.
  • the core library consists of two parts: one is the function functions that the java language needs to call, and the other is the core library of Android.
  • the application layer and the application framework layer run in virtual machines.
  • the virtual machine executes the java files of the application layer and the application framework layer as binary files.
  • the virtual machine is used to perform functions such as object lifecycle management, stack management, thread management, safety and exception management, and garbage collection.
  • a system library can include multiple functional modules. For example: surface manager (surface manager), media library (Media Libraries), 3D graphics processing library (eg: OpenGL ES), 2D graphics engine (eg: SGL), etc.
  • surface manager surface manager
  • media library Media Libraries
  • 3D graphics processing library eg: OpenGL ES
  • 2D graphics engine eg: SGL
  • the Surface Manager is used to manage the display subsystem and provides a fusion of 2D and 3D layers for multiple applications.
  • the media library supports playback and recording of a variety of commonly used audio and video formats, as well as still image files.
  • the media library can support a variety of audio and video encoding formats, such as: moving picture experts group (MPEG) 4, H.264, MP3, advanced audio coding (AAC), adaptive multi-rate (adaptive) multi rate, AMR), joint photographic experts group (joint photo graphic experts group, JPEG), portable network graphic format (portable network graphic format, PNG) and so on.
  • MPEG moving picture experts group
  • AAC advanced audio coding
  • AMR adaptive multi-rate (adaptive) multi rate
  • JPEG joint photographic experts group
  • portable network graphic format portable network graphic format
  • PNG portable network graphic format
  • the 3D graphics processing library is used to implement 3D graphics drawing, image rendering, compositing, and layer processing.
  • Two-dimensional (2 dimensions, 2D) graphics engine is a drawing engine for 2D drawing.
  • the kernel layer is the layer between hardware and software.
  • the kernel layer contains at least display drivers, camera drivers, audio drivers, and sensor drivers.
  • the three-dimensional object registration method provided by the embodiments of the present application can be applied to AR, VR, and scenarios where virtual content of an object needs to be displayed.
  • the three-dimensional object registration method in this embodiment of the present application can be applied in an AR scene.
  • the following describes the workflow of the software and hardware of the terminal device 100 with reference to FIG. 1 and the AR scene.
  • the touch sensor 180K receives the touch operation and reports it to the processor 110 , so that the processor starts the AR application in response to the above-mentioned touch operation, and displays the user interface of the AR application on the display screen 194 .
  • the touch sensor 180K reports the touch operation on the AR icon to the processor 110, so that the processor 110 responds to the above touch operation, starts the AR application corresponding to the AR icon, and displays it on the display screen.
  • 194 shows the user interface for AR.
  • the terminal may also enable AR in other manners, and display the AR user interface on the display screen 194.
  • the terminal when the terminal is on a black screen, displays a lock screen interface, or displays a certain user interface after being unlocked, the terminal may start AR in response to a user's voice command or a shortcut operation, and display the AR user interface on the display screen 194 .
  • a network for detecting the pose of the object is configured.
  • the terminal device 100 captures an image in the field of view through the camera 193, and the network that detects the pose of the object recognizes the identifiable objects included in the image. and acquire the pose of the recognizable object, and then superimpose and display the virtual content corresponding to the recognizable object on the captured image through the display screen 194 according to the acquired pose.
  • the terminal device 100 recognizes the identifiable objects in the image, and uses the explanation information of the identifiable objects according to the pose of the identifiable object, according to the position and position of the identifiable object. Gesture presents the corresponding explanation content to the user by superimposing it on different positions of the three-dimensional object accurately in the three-dimensional space.
  • Figure 3 shows the virtual-real fusion effect of the explanation information and the real object, which can intuitively and vividly present the information of the real object in the image to the user.
  • the network configured in the terminal device 100 for detecting the pose of the object is usually trained offline according to the three-dimensional model of the object, so that the network supports the Pose recognition, that is, the registration of the object to be recognized in the network is completed.
  • Pose recognition that is, the registration of the object to be recognized in the network is completed.
  • recognizable objects are usually added based on machine learning methods. For each new object, all recognizable objects need to be retrained, which will lead to a linear increase in training time and affect the trained objects. Identify the effect.
  • Figure 4 illustrates an existing pose detection process.
  • a picture containing an object is input to a multi-object pose estimation network, and the multi-object pose estimation network outputs the pose of the recognizable objects contained in the picture. and categories, and perform pose optimization on the output pose.
  • the multi-object pose estimation network used in this process is generated by offline training.
  • the user needs to add a new identifiable object, it needs to be retrained together with all the identifiable objects supported by the original network to obtain a new Multi-object pose estimation network, enabling support for pose detection of newly identifiable objects.
  • re-training of new objects will lead to a sharp increase in training time, and new objects will affect the pose detection effect of the trained objects, resulting in a decrease in detection accuracy and success rate.
  • the process of this method can be shown in Figure 5a.
  • the user submits a 3D model of the object to be recognized. , train a multi-object pose estimation network M0 according to the submitted 3D model; when adding a new identifiable object, the user submits a new 3D model of the object to be recognized, based on the trained network M0, using the trained object's 3D model
  • a small amount of data is incrementally trained to obtain a new network M1; when recognizable objects are added, the user continues to submit the 3D model of the object expected to be recognized, and based on the trained network M1, a small amount of data of the trained object is used Perform incremental training to get a new network M2, and so on.
  • this scheme only uses a small amount of data of the trained object, and performs incremental learning on the basis of the existing model, which can greatly reduce the retraining time.
  • incremental learning faces the problem of catastrophic forgetting, that is, the training of new models only refers to a small amount of data of recognized objects, and as the number of new objects increases, the performance of already trained objects will drop sharply.
  • the present application provides a three-dimensional object registration method, which specifically includes: configuring a single-object pose detection network, and constructing a loss function using the difference between a real image of a three-dimensional object and a composite image obtained by differentiable rendering under multiple poses, to train a single-object pose detection network to obtain a pose detection network for extracting the three-dimensional object, and then use the trained pose detection network to extract the real image of the three-dimensional object and the differentiable rendering composite image in multiple poses
  • the feature of the three-dimensional object record the feature and the identification of the three-dimensional object, and complete the registration of the three-dimensional object.
  • the three-dimensional object registration method provided in this application adopts a single-object pose detection network, even if a new recognizable object is added, the training time is short, and the recognition effect of other recognizable objects will not be affected; Differentiable rendering of synthetic images in multiple poses builds a loss function that improves the accuracy of single-object pose detection networks.
  • the method for obtaining the pose of an object involves computer vision processing, and can be specifically applied to data processing methods such as data training, machine learning, and deep learning. Symbolized and formalized intelligent information modeling, extraction, preprocessing, training, etc., finally obtain a trained single-object pose detection network; and, the object registration method provided in the embodiment of the present application can use the above-trained single-object Pose detection network, input the input data (such as the image including the object to be recognized in this application) into the trained single object pose detection network corresponding to the object to be recognized, and obtain output data (such as in this application) pose of recognizable objects in the image).
  • input data such as the image including the object to be recognized in this application
  • output data such as in this application
  • training method and the object registration method of the single-object pose detection network are inventions based on the same concept, and can also be understood as two parts in a system, or as part of an overall process. Two stages: such as model training stage and model application stage.
  • Neural network is a machine learning model, which is a kind of machine learning technology that simulates the neural network of the human brain to realize artificial intelligence.
  • the input and output of the neural network can be configured according to actual needs, and the neural network can be trained through sample data, so that the error between its output and the real output corresponding to the sample data is minimized.
  • a neural network can be composed of neural units, and a neural unit can refer to an operation unit that takes x s and an intercept 1 as input, and the output of the operation unit can be:
  • W s is the weight of x s
  • b is the bias of the neural unit.
  • f is an activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into an output signal. The output signal of this activation function can be used as the input of the next convolutional layer.
  • the activation function can be a sigmoid function.
  • a neural network is a network formed by connecting many of the above single neural units together, that is, the output of one neural unit can be the input of another neural unit.
  • the input of each neural unit can be connected with the local receptive field of the previous layer to extract the features of the local receptive field, and the local receptive field can be an area composed of several neural units.
  • Deep neural network also known as multi-layer neural network
  • DNN Deep neural network
  • the neural network inside DNN can be divided into three categories: input layer, hidden layer, and output layer.
  • the first layer is the input layer
  • the last layer is the output layer
  • the middle layers are all hidden layers.
  • the layers are fully connected, that is, any neuron in the i-th layer must be connected to any neuron in the i+1-th layer.
  • the coefficient from the kth neuron in the L-1 layer to the jth neuron in the Lth layer is defined as It should be noted that the input layer does not have a W parameter.
  • more hidden layers allow the network to better capture the complexities of the real world.
  • a model with more parameters is more complex and has a larger "capacity", which means that it can complete more complex learning tasks.
  • Training the deep neural network is the process of learning the weight matrix, and its ultimate goal is to obtain the weight matrix of all layers of the trained deep neural network (the weight matrix formed by the vectors W of many layers).
  • Convolutional neural network is a deep neural network with a convolutional structure.
  • a convolutional neural network consists of a feature extractor consisting of convolutional and subsampling layers.
  • the feature extractor can be viewed as a filter, and the convolution process can be viewed as convolution with an input image or a convolutional feature map using a trainable filter.
  • the convolutional layer refers to the neuron layer in the convolutional neural network that convolves the input signal.
  • a neuron can only be connected to some of its neighbors.
  • a convolutional layer usually contains several feature planes, and each feature plane can be composed of some neural units arranged in a rectangle.
  • Neural units in the same feature plane share weights, and the shared weights here are convolution kernels.
  • Shared weights can be understood as the way to extract image information is independent of location. The underlying principle is that the statistics of one part of the image are the same as the other parts. This means that image information learned in one part can also be used in another part. So for all positions on the image, the same learned image information can be used.
  • multiple convolution kernels can be used to extract different image information. Generally, the more the number of convolution kernels, the richer the image information reflected by the convolution operation.
  • the convolution kernel can be initialized in the form of a matrix of random size, and the convolution kernel can obtain reasonable weights by learning during the training process of the convolutional neural network.
  • the immediate benefit of sharing weights is to reduce the connections between the layers of the convolutional neural network, while reducing the risk of overfitting.
  • Recurrent neural networks are used to process sequence data.
  • RNN Recurrent neural networks
  • the layers are fully connected, and each node in each layer is unconnected.
  • this ordinary neural network solves many problems, it is still powerless for many problems. For example, if you want to predict the next word of a sentence, you generally need to use the previous words, because the front and rear words in a sentence are not independent. The reason why RNN is called a recurrent neural network is that the current output of a sequence is also related to the previous output.
  • RNNs can process sequence data of any length.
  • the training of RNN is the same as the training of traditional CNN or DNN.
  • the error back-propagation algorithm is also used, but there is a difference: that is, if the RNN is expanded, the parameters, such as W, are shared; while the traditional neural network mentioned above is not the case.
  • the output of each step depends not only on the network of the current step, but also on the state of the network in the previous steps.
  • This learning algorithm is called the Back propagation Through Time (BPTT) algorithm based on time.
  • the pixel value of the image can be a red-green-blue (RGB) color value, and the pixel value can be a long integer representing the color.
  • the pixel value is 256*Red+100*Green+76Blue, where Blue represents the blue component, Green represents the green component, and Red represents the red component. In each color component, the smaller the value, the lower the brightness, and the larger the value, the higher the brightness.
  • the pixel values can be grayscale values.
  • an embodiment of the present invention provides a system architecture 500.
  • the data collection device 560 is used to collect training data.
  • the training data includes: real images and/or synthetic images of objects to be recognized;
  • the device 520 obtains the target model/rule 501 by training based on the training data maintained in the database 530 .
  • the following will describe in more detail how the training device 520 obtains the target model/rule 501 based on the training data with the second embodiment and the third embodiment.
  • the target model/rule 501 may be the single-object pose detection network ( The first network for extracting the pose of the object in the image), that is, inputting the image into the target model/rule 501 to obtain the pose of the identifiable object included in the image; or, the target model/rule 501 may be an embodiment of the application
  • the differentiable renderer described in , inputting the 3D model and the preset pose of the object into the target model/rule 501 can obtain the composite image of the object in the preset pose.
  • the target model/rule 501 in the embodiment of the present application may specifically be a single-object pose detection network or a differentiable renderer.
  • the single-object pose detection network is performed by training a single-object basic pose Detected by the network.
  • the training data maintained in the database 530 may not necessarily all come from the collection of the data collection device 560, and may also be received from other devices.
  • the training device 520 may not necessarily train the target model/rule 501 entirely based on the training data maintained by the database 530, and may also obtain training data from the cloud or other places for model training. The above description should not be used as a reference to this application Limitations of Examples.
  • the target model/rule 501 trained according to the training device 520 can be applied to different systems or devices, such as the execution device 510 shown in FIG. 5b, the execution device 510 can be a terminal, such as a mobile phone terminal, a tablet Notebook computer, AR/VR, vehicle terminal, etc., it can also be a server or cloud, etc.
  • the execution device 510 is configured with an I/O interface 512, which is used for data interaction with external devices.
  • the user can input data to the I/O interface 512 through the client device 540, and the input data is described in the embodiments of the present application.
  • the execution device 510 may call the data, codes, etc. in the data storage system 550 for corresponding processing, or may store the data, instructions, etc. obtained from the corresponding processing. Stored in data storage system 150 .
  • the I/O interface 512 returns the processing result, such as the virtual content and pose of the recognizable object in the obtained image, to the client device 140, so as to provide the user with the virtual content displayed according to the pose, so as to realize the experience of combining virtual and real .
  • the training device 520 can generate corresponding target models/rules 501 based on different training data for different goals or tasks, and the corresponding target models/rules 501 can be used to achieve the above goals or complete The above task, thus providing the user with the desired result.
  • the user can manually specify input data, which can be operated through the interface provided by the I/O interface 512 .
  • the client device 540 can automatically send the input data to the I/O interface 512 . If the user's authorization is required to require the client device 540 to automatically send the input data, the user can set the corresponding permission in the client device 140 .
  • the user can view the result output by the execution device 510 on the client device 540, and the specific presentation form can be a specific manner such as display, sound, and action.
  • the client device 540 can also be used as a data collection terminal to collect the input data of the input I/O interface 512 and the output result of the output I/O interface 512 as new sample data as shown in FIG.
  • the I/O interface 512 directly uses the input data input into the I/O interface 512 and the output result of the output I/O interface 512 as shown in the figure as a new sample The data is stored in database 530.
  • FIG. 5b is only a schematic diagram of a system architecture provided by an embodiment of the present invention, and the positional relationship among the devices, devices, modules, etc. shown in the figure does not constitute any limitation.
  • the data The storage system 550 is an external memory relative to the execution device 110 , and in other cases, the data storage system 550 may also be placed in the execution device 510 .
  • the I/O interface 512 of the execution device 510 can convert the images processed by the execution device (such as the objects to be recognized in different poses)
  • the rendered synthetic image) and the real image of the object to be recognized input by the user are sent to the database 530 as a pair of training data, so that the training data maintained by the database 530 is more abundant, thereby providing more abundant training for the training work of the training device 520 data.
  • a target model/rule 501 is obtained by training according to the training device 520, and the target model/rule 501 may be a single-object pose recognition network (the first network, second network).
  • the single-object pose recognition networks provided in the embodiments of the present application may all be convolutional neural networks, recurrent neural networks, or others.
  • a convolutional neural network is a deep neural network with a convolutional structure and a deep learning architecture. learning at multiple levels of abstraction.
  • CNN is a feed-forward artificial neural network in which individual neurons can respond to images fed into it.
  • a convolutional neural network (CNN) 600 may include an input layer 610 , a convolutional/pooling layer 620 (where the pooling layer is optional), and a neural network layer 630 .
  • the convolution layer 621 may include many convolution operators.
  • the convolution operator is also called a kernel, and its role in image processing is equivalent to a filter that extracts specific information from the input image matrix.
  • the convolution operator is essentially Can be a weight matrix, which is usually pre-defined, usually one pixel by one pixel (or two pixels by two pixels) along the horizontal direction on the input image during the convolution operation on the image. ...It depends on the value of the stride step) to process, so as to complete the work of extracting specific features from the image.
  • the size of the weight matrix should be related to the size of the image. It should be noted that the depth dimension of the weight matrix is the same as the depth dimension of the input image.
  • weight values in these weight matrices need to be obtained through a lot of training in practical applications, and each weight matrix formed by the weight values obtained by training can be used to extract information from the input image, so that the convolutional neural network 600 can make correct predictions .
  • the initial convolutional layer eg 621
  • the features extracted by the later convolutional layers eg 626
  • the features extracted by the later convolutional layers become more and more complex, such as features such as high-level semantics.
  • the pooling layer can be a convolutional layer followed by a layer.
  • the pooling layer can also be a multi-layer convolutional layer followed by one or more pooling layers.
  • the pooling layer may include an average pooling operator and/or a max pooling operator for sampling the input image to obtain a smaller size image.
  • the average pooling operator can calculate the pixel values in the image within a certain range to produce an average value as the result of average pooling.
  • the max pooling operator can take the pixel with the largest value within a specific range as the result of max pooling. Also, just as the size of the weight matrix used in the convolutional layer should be related to the size of the image, the operators in the pooling layer should also be related to the size of the image.
  • the size of the output image after processing by the pooling layer can be smaller than the size of the image input to the pooling layer, and each pixel in the image output by the pooling layer represents the average or maximum value of the corresponding sub-region of the image input to the pooling layer.
  • Neural network layer 630
  • the convolutional neural network 600 After being processed by the convolutional layer/pooling layer 620, the convolutional neural network 600 is not sufficient to output the required output information. Because as mentioned before, the convolutional layer/pooling layer 620 only extracts features and reduces the parameters brought by the input image. However, in order to generate the final output information (required class information or other relevant information), the convolutional neural network 600 needs to utilize the neural network layer 630 to generate one or a set of outputs of the desired number of classes. Therefore, the neural network layer 630 may include multiple hidden layers (631, 632 to 63n as shown in FIG. 6) and the output layer 640, and the parameters contained in the multiple hidden layers may be based on specific task types The relevant training data is pre-trained, for example, the task type can include image recognition, image classification, image super-resolution reconstruction, etc...
  • the output layer 640 After the multi-layer hidden layers in the neural network layer 630, that is, the last layer of the entire convolutional neural network 600 is the output layer 640, the output layer 640 has a loss function similar to the classification cross entropy, and is specifically used to calculate the prediction error, Once the forward propagation of the entire convolutional neural network 600 (as shown in Fig. 6, the propagation from the direction 610 to 640 is the forward propagation) is completed, the back propagation (as shown in Fig.
  • the propagation from the 640 to 610 direction is the back propagation
  • the convolutional neural network 600 shown in FIG. 6 is only used as an example of a convolutional neural network, and in a specific application, the convolutional neural network may also exist in the form of other network models.
  • FIG. 7a is a hardware structure of a chip according to an embodiment of the present invention, where the chip includes a neural network processor (NPU) 70 .
  • the chip can be set in the execution device 510 as shown in FIG. 5 b to complete the calculation work of the calculation module 511 .
  • the chip can also be set in the training device 520 as shown in FIG. 5 b to complete the training work of the training device 520 and output the target model/rule 501 .
  • the algorithms of each layer in the convolutional neural network shown in Figure 6 can be implemented in the chip shown in Figure 7a.
  • the arithmetic circuit 703 includes multiple processing units (process engines, PEs). In some implementations, arithmetic circuit 703 is a two-dimensional systolic array. The arithmetic circuit 703 may also be a one-dimensional systolic array or other electronic circuitry capable of performing mathematical operations such as multiplication and addition. In some implementations, arithmetic circuit 703 is a general-purpose matrix processor.
  • the operation circuit 703 fetches the data corresponding to the matrix B from the weight memory 702 and buffers it on each PE in the operation circuit.
  • the operation circuit 703 takes the data of the matrix A from the input memory 701 and performs the matrix operation on the matrix B, and stores the partial result or the final result of the matrix in the accumulator 708 (accumulator).
  • the vector calculation unit 707 can further process the output of the operation circuit 703, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison and so on.
  • the vector computing unit 707 can be used for network computation of non-convolutional/non-fully connected layers (FC) layers in the neural network, such as pooling (Pooling), batch normalization (Batch Normalization), local response Normalization (Local Response Normalization), etc.
  • FC non-convolutional/non-fully connected layers
  • the vector computation unit 707 can store the processed output vectors to the unified buffer 706 .
  • the vector calculation unit 707 may apply a nonlinear function to the output of the arithmetic circuit 703, eg, a vector of accumulated values, to generate activation values.
  • vector computation unit 707 generates normalized values, merged values, or both.
  • the vector of processed outputs can be used as activation input to the arithmetic circuit 703, eg, for use in subsequent layers in a neural network.
  • each layer in the convolutional neural network shown in FIG. 6 can be executed by 703 or 707 .
  • the algorithms of the calculation module 511 and the training device 520 in FIG. 5b can all be executed by 703 or 707 .
  • Unified memory 706 is used to store input data and output data.
  • the weight data directly transfers the input data in the external memory to the input memory 701 and/or the unified memory 706 through the storage unit access controller (direct memory access controller, DMAC) 705, and stores the weight data in the external memory into the weight memory 702, and storing the data in the unified memory 706 into the external memory.
  • DMAC direct memory access controller
  • An instruction fetch buffer 709 connected to the controller 704 is used to store the instructions used by the controller 704 .
  • the controller 704 is used for invoking the instructions cached in the memory 709 to control the working process of the operation accelerator.
  • the data here may be descriptive data, may be the input or output data of each layer in the convolutional neural network shown in FIG. Output Data.
  • the unified memory 706 , the input memory 701 , the weight memory 702 and the instruction fetch memory 709 are all on-chip memories, and the external memory is the memory outside the NPU, and the external memory can be double data rate synchronous dynamic random access.
  • Memory double data rate synchronous dynamic random access memory, DDR SDRAM), high bandwidth memory (high bandwidth memory, HBM) or other readable and writable memory.
  • FIG. 5b and FIG. 6 are jointly completed by the main CPU and the NPU.
  • FIG. 7b illustrates the overall system framework of the solution provided by this application.
  • the framework includes two parts: offline registration and online detection.
  • the user inputs the three-dimensional model of the object and the real picture, and then trains the basic single-object pose detection network according to the three-dimensional model and the real picture to obtain the single-object pose detection network of the object, and the single-object pose detection network of the object.
  • the pose detection network is used to detect the pose of the object contained in the image, and the detection accuracy is better than the basic single-object pose detection network.
  • the single-object pose detection network of the object that has been obtained can be used to extract the characteristics of the object for incremental object registration, and the characteristics of the object and the category of the object (which can be represented by identification) are carried out. Register to get a multi-object class classifier.
  • the multi-feature fusion classification is performed using the multi-object category classifier obtained in the offline registration part to obtain the classification results (categories) of the identifiable objects included in the input image, and the objects obtained in the offline registration part are used.
  • the single-object pose detection network is used to obtain the pose of the recognizable object included in the input image; according to the pose of the recognizable object in the input image, the virtual content corresponding to the category of the recognizable object is presented.
  • the display method eg, the display method illustrated in FIG. 17
  • Embodiment 1 provides an object registration method for registering a first object, where the first object is any object to be identified.
  • the registration process of each object is the same.
  • Embodiment 1 of the present application is described by taking the registration of the first object as an example, and the others will not be described in detail.
  • the object registration method provided in the first embodiment of the present application may be executed by the execution device 510 as shown in FIG. 5b, and the real image in the object registration method may be the input data given by the client device 540 as shown in FIG. 5b, so The computing module 511 in the executing device 510 may be used to execute the S801 to S803.
  • the object registration method provided in Embodiment 1 of the present application may be processed by the CPU, or may be jointly processed by the CPU and the GPU, or other processors suitable for neural network computing may be used without the use of the GPU. limit.
  • the object registration method provided in Embodiment 1 of the present application may include:
  • the plurality of first input images include real images of the first object and/or a plurality of first composite images of the first object.
  • the first input image may include multiple real images of the first object.
  • the first input image may include the real image of the first object as well as the first composite image.
  • the first composite image is obtained by differentiable rendering, which can reduce the difference between the first composite image and the real image.
  • the first input image may include multiple first composite images of the first object.
  • the multiple first composite images may be obtained by differentiable rendering of the three-dimensional model of the first object in multiple first poses.
  • the plurality of first poses are different.
  • the multiple first poses being different means that the multiple first poses correspond to different shooting angles of the camera.
  • the plurality of first poses can be taken on a spherical surface.
  • the multiple first poses may be multiple poses obtained by uniform sampling on the spherical surface shown in FIG. 9 , and the density of the sampling poses on the spherical surface is not limited in this embodiment of the present application, and may be selected according to actual needs.
  • the first pose may also be multiple different poses input by the user, which is not limited.
  • a differentiable renderer also referred to as a differentiable rendering engine may be used according to the three-dimensional model of the first object with texture information input by the user) or a differentiable rendering network), synthesizing the 2D images of the 3D model (first composite images) obtained by the camera in multiple first poses.
  • S801 may specifically include S801a and S801b.
  • the first input image obtained in S801b may include a real image of the first object and a plurality of first composite images, or the first input image obtained in S801b may only include a plurality of first composite images.
  • the feature information may include descriptors of local feature points and descriptors of global features.
  • Descriptors are used to describe features and are a data structure that describes features.
  • the dimension of a descriptor can be multi-dimensional.
  • Descriptors may be of multiple types, such as SIFT, SURF, MSER, etc.
  • the embodiments of the present application do not limit the types of descriptors.
  • Descriptors can be in the form of multidimensional vectors.
  • the descriptor of a local feature point may be a high-dimensional vector representing the local image information of the feature point;
  • the descriptor of the global feature may be a high-dimensional vector representing the image information of the entire image or a larger area.
  • the feature information may also include the location of the local feature point.
  • feature information may be extracted from all regions in each first input image.
  • the region of the first object may be determined in each first input image, and then feature information is extracted in the region of the first object.
  • the area of the first object may be an area of the first input image that only includes the first object, or the area of the first object may be an area of the first input image that includes the first object, and the area is included in the first input image. within the image.
  • each first input image may be input into the first network in sections, and the region of the first object in each first input image may be determined according to the output of the first network.
  • the first network is used for recognizing the pose of the first object in the image, or the first network is used for extracting a black and white image of the first object in the image.
  • determining the region of the first object in the first input image, and then extracting feature information in the region of the first object may include but are not limited to the following two possible implementations:
  • the first network is used to recognize the pose of the first object in the image.
  • a plurality of first input images can be input into the first network respectively for pose recognition, and each first input image can be obtained.
  • the obtained pose of the first object project the three-dimensional model of the first object to each first input image respectively, and obtain the projection area in each first input image (as the first area of the object); feature information is extracted from the projection area in each first input image respectively.
  • the first network is used to extract the black-and-white image of the first object in the image.
  • a plurality of first input images can be input into the first network respectively for pose recognition, and each first input image can be obtained.
  • the pixel positions of the visually significant feature points and the corresponding descriptors can be extracted as the feature information of the local feature points.
  • the sub is a multi-dimensional vector; in the area of the first object in each second input image obtained in S802, the feature information of all visual feature points is extracted, and a multi-dimensional vector is output as the feature information of the global feature.
  • the image area shown in FIG. 11 according to the result obtained by inputting the first input image to the first network, it can be determined that the area where the first object is located in the first input image is shown by the dashed bounding box in FIG. 11 .
  • Area. Feature information of visually significant feature points can be extracted in the area shown by the dashed bounding box shown in FIG. 11 , and feature information of all visual feature points can be extracted from the area shown by the dashed bounding box shown in FIG. 11 . .
  • the feature information extracted from each first input image is corresponding to the identifier of the first object, and the registration of the first object is performed.
  • the feature information of the first object and the identifier of the first object are recorded correspondingly to complete the registration of the first object, and the registration content of multiple objects may be referred to as a multi-object category classifier.
  • the multi-object category classifier records the correspondence between the features and identifiers of different objects. In practical applications, by extracting the feature information of the objects in the image and comparing with the features recorded in the multi-object category classifier to determine the The identification of the object completes the identification of the object in the image.
  • the identifier of the first object may be used to indicate the category of the first object.
  • the embodiment of the present application does not limit the form of the identification.
  • the feature information of the first object in each of the first input images extracted in S802, and the identifier of the first object may be stored according to a certain structure to complete the registration of the first object.
  • the feature information and identification of multiple objects stored in the structure are called multi-object category classifiers for efficient search.
  • the features of objects in the image are extracted for object registration, and the registration time is short, and the recognition performance of other registered objects is not affected. Even if many new objects are registered, the registration time will not be increased too much. , which also guarantees the detection performance of registered objects.
  • the above-mentioned method for object registration can obtain the first network offline and complete the registration of the object.
  • the pose of the identifiable object can be obtained through the first network, and determined according to the characteristics of the identifiable object.
  • the registered logo of the object is then displayed according to the pose of the identifiable object, and the virtual content corresponding to the logo of the identifiable object is displayed, thereby completing the presentation effect of the virtual and real results.
  • the second embodiment of the present application provides an optimization method, which can be used in combination with the object registration method shown in FIG. 8 and the method for training an object pose detection network provided in the third embodiment of the present application, or can be used independently.
  • the usage scenarios thereof are not limited in the embodiments of the present application.
  • the optimization method provided in the second embodiment of the present application is shown in Figure 12, including:
  • the first network is used to identify the pose of the first object in the image.
  • a differentiable renderer is used in S1202, and under each second pose obtained in S1201, N second composite images are obtained by differentiable rendering.
  • acquiring a real image of a second pose corresponds to a second composite image rendered by a second pose.
  • the area images at the same position may refer to area images with the same coordinates based on a certain point of the first object in the second composite image.
  • the area image at the same position may refer to a projection area image in which the black and white image of the second synthetic image is projected to the real image based on a certain point of the first object in the second synthetic image.
  • the black-and-white image of the first object in the second composite image may be projected into the real image corresponding to the second composite image, and the projection area is used as the foreground image of the real image.
  • a black-and-white image of the first object in the second composite image may be obtained first, and the black-and-white image is a binarized image, and the black-and-white image of the first object in the second composite image is combined with the first object in the second composite image.
  • the real images corresponding to the two synthetic images are multiplied, and the reserved area obtained after the multiplication is used as the foreground image of the real image.
  • S1204 Construct a first loss function according to the first difference information between the foreground images of the N real images and their corresponding second composite images.
  • the first difference information is used to indicate the difference between the foreground image and its corresponding second composite image.
  • the first difference information may include one or more items of feature map difference, pixel color difference, and difference between extracted feature descriptors.
  • the difference between the feature maps of the two images can be the perceptual loss, that is, the difference between the feature maps encoded by the deep learning network.
  • the deep learning network can use a pre-trained visual geometry group (VGG)16 network.
  • VGG visual geometry group
  • the feature map encoded by the VGG16 network is a C*H*W tensor, and C is The number of channels of the feature map, H and W are the length and width of the feature map.
  • the difference between the feature maps is the distance between the two tensors, and specifically, it can be the L1 norm, the L2 norm or other norm of the difference between the tensors.
  • the pixel color difference may be: the numerical value of the pixel color calculates the difference, and specifically may be the L2 norm of the difference between the pixels of the two images.
  • the difference between the extracted feature descriptors refers to the distance between the vectors representing the descriptors.
  • the distance may include, but is not limited to, Euclidean distance, Hamming distance, and the like.
  • the descriptor can be an N-dimensional floating-point vector, the distance is the Euclidean distance of two N-dimensional vectors, and the L2 norm of the difference between the two N-dimensional vectors; or the descriptor can be an M-dimensional binary vector, then the distance is the L1 norm of the difference between two vectors.
  • the L2 norm referred to above is the sum of the squares of each element of the pointing quantity and then the square root, and the L1 norm is the sum of the absolute values of each element in the pointing quantity.
  • the first difference information when constructing the first loss function in S1204 may be the calculated value of the difference information between the multiple foreground images and their corresponding second composite images.
  • the second composite image corresponding to the foreground image refers to the real image to which the foreground image belongs, and the second composite image is obtained by differentiable rendering through the second pose obtained by the first network.
  • the first network is used to detect the 6DOF pose of the first object in the real image. Then, according to the initial three-dimensional model input by the user and the corresponding texture and illumination information, the differentiable renderer obtains the second composite image R (with the real image I) based on the 6DOF pose of the first object in the detected real image by differentiable rendering. the same number).
  • L p is the calculated value of the difference between the feature maps of each foreground image and its corresponding second composite image
  • L i is the calculated value of the pixel color difference between each foreground image and its corresponding second composite image
  • L f is the calculated value of the difference between the feature descriptors of each foreground image and its corresponding second composite image.
  • the difference between the feature descriptors of the two images can be the difference of the feature descriptors of the same position. poor calculated value.
  • the texture, lighting or other parameters in the differentiable renderer can be updated according to the first loss function constructed in S1204, so as to achieve the smallest difference between the synthetic image output by the differentiable renderer and the real image. change.
  • S1205 can adjust and update textures, lighting or other parameters in the differentiable renderer according to preset rules, and then repeat the optimization process shown in FIG. 12 until the synthetic image output by the differentiable renderer is the same as the real image. The difference is minimal.
  • the preset rule may be that when the preconfigured first loss function satisfies different conditions, different parameters and adjustment values are correspondingly adjusted. After the first loss function is determined in S1204, according to the satisfied conditions, the corresponding adjustment parameters and adjustment values are adjusted according to the conditions. make adjustments.
  • the optimization process shown in FIG. 12 can also be shown in FIG. 13 , the real image of the object is input into the first network, the second pose output by the first network is input into the differentiable renderer, and the differentiable renderer is based on the object's
  • the three-dimensional model, the second composite image and the black-and-white image of the object in the second composite image are output by differentiable rendering, a first loss function is constructed according to the second composite image and the real image, and the differentiable renderer is updated by using the first loss function.
  • a differentiable renderer also called a differentiable rendering engine or a differentiable rendering network
  • the rendering authenticity of the differentiable renderer can be improved, and the composite image obtained by differentiable rendering can be reduced. difference from the real image.
  • the optimization method provided in the second embodiment of the present application may be specifically performed by the training device 520 as shown in FIG. 5b, and the second composite image in the optimization method may be the training data maintained in the database 530 as shown in FIG. 5b.
  • some or all of S1201 to S1203 in the optimization method provided in the second embodiment may be executed in the training device 520, or may be pre-executed by other functional modules before the training device 520, that is, the database
  • the training data received or acquired in 530 is preprocessed, as described in the process of S1201 to S1203, to obtain the foreground image and the second composite image, as the input of the training device 520, and the training device 520 executes S1204 to S1204 to S1205.
  • the optimization method provided in the second embodiment of the present application may be processed by the CPU, or may be jointly processed by the CPU and the GPU, or other processors suitable for neural network computing may be used without the use of the GPU, which is not limited in this application. .
  • the optimization method shown in FIG. 12 may be used to optimize the differentiable renderer to reduce the difference between the composite image obtained by differentiable rendering and the real image.
  • Figure 14 illustrates another object registration method, which may include:
  • S1401 may refer to the process illustrated in FIG. 12 , which will not be repeated here.
  • S1403 may refer to S801b, which will not be repeated here.
  • S1404 refers to the aforementioned S802, which will not be repeated here.
  • S1405 refers to the aforementioned S803, which will not be repeated here.
  • the third embodiment of the present application provides a method for training an object pose detection network, which can train the aforementioned first network to improve the accuracy of the pose of the first object output by the first network.
  • the method for training the object pose detection network can be used in combination with the object registration method shown in FIG. 8 and the optimization method provided in the second embodiment of the present application, or can be used independently, and the usage scenarios of the method are not different from the embodiments of the present application. be limited.
  • the third embodiment of the present application provides a method for training an object pose detection network, which further optimizes the prediction ability of the object pose detection network for the object pose in the real image by using the real image and the synthesized image of the first object (pan chemical).
  • FIG. 15 The method for training an object pose detection network provided in Embodiment 3 of the present application is shown in FIG. 15 , including:
  • the plurality of second input images include real images of the first object and/or a plurality of third composite images of the first object.
  • the second input image may include multiple real images of the first object.
  • the second input image may include multiple real images of the first object and multiple third composite images of the first object.
  • the multiple third composite images are obtained by rendering the three-dimensional model of the first object in multiple third poses. The multiple third poses are different.
  • the plurality of third poses can be taken on a spherical surface.
  • the multiple third poses may be multiple poses obtained by uniform sampling on the spherical surface as shown in FIG. 9 , and the density of the sampling poses on the spherical surface is not limited in this embodiment of the present application, and may be selected according to actual needs.
  • the second input image may include multiple third composite images of the first object.
  • the first object can be synthesized in multiple third poses by conventional rendering, or differentiable rendering, or other rendering methods according to the three-dimensional model with texture information of the first object input by the user 2D images at different angles, recorded as the third composite image.
  • the second network can be used to identify the pose of the first object in the image.
  • the second network may be a basic single-object pose detection network.
  • the basic single-object pose detection network is a general-purpose neural network that only recognizes a single object in an image, and is the initial model of the configuration.
  • the input of the basic single-object pose detection network is an image
  • the output is the pose of a single object recognizable by the network in the image.
  • the output of the basic single-object pose detection network may also include a black-and-white image (mask) of a single object recognizable by the network.
  • the mask of an object refers to the black and white version of the area in the image that contains only that object.
  • the second network may be a neural network after training the basic single-object pose detection network according to the image of the first object.
  • each third composite image is generated by rendering under the known third pose
  • the pose of the first object relative to the camera in each third composite image is known
  • each third composite image can be
  • the three composite images are input to the basic single-object pose detection network, and the predicted pose of the first object in each third composite image is obtained; then, according to the predicted pose, the actual position of the first object in each third composite image is compared pose, calculate the loss in the iterative process of the basic single-object pose detection network, construct a loss function, and then according to the loss function, train the basic single-object pose detection network until convergence, as the second network.
  • the training process of the neural network will not be described in detail in this application.
  • the second network may be the currently used first network.
  • microrendering obtains a fourth composite image of the first object in each fourth pose.
  • the second input image obtained from a fourth pose corresponds to a fourth composite image rendered by the fourth pose.
  • the second difference information is used to indicate the difference between the fourth composite image and its corresponding second input image.
  • the second difference information includes one or more of the following: the intersection ratio (The difference between the intersection over union, IOU), the difference between the fourth pose obtained from the fourth composite image and the pose obtained by the fourth composite image through the first network, the difference between the fourth composite image and its corresponding second input image and the fourth composite image The similarity of the region images in the same position of the first object in the composite image.
  • the intersection ratio The difference between the intersection over union, IOU
  • IOU The difference between the intersection over union
  • the difference between the fourth composite image and its corresponding second input image and the fourth composite image The similarity of the region images in the same position of the first object in the composite image.
  • the difference between the IOUs of the two black and white images refers to the ratio of the area of the intersection of the two masks to the area of the union.
  • the difference between the two poses refers to the difference between the mathematical expressions of the two poses. Specifically, the difference is the sum of the difference between translation and rotation. Given two poses R 1 , T 1 and R 2 , T 2 , the difference in translation is the Euclidean distance of the two vectors T 1 and T 2 ; the difference in rotation is where Tr represents the trace of the matrix, arccos is the inverse cosine function, and R 1 T is the transpose of R 1 .
  • the similarity of the two images may be the difference in pixel color or the difference in feature maps of the two images.
  • the region image of the first object in the second input image can be obtained by cropping the second input image according to the black and white image of the first object in the corresponding fourth composite image.
  • the black-and-white image of the first object in the fourth composite image can be a binarized image, and the binarized image is multiplied by the corresponding second input image, and then the region of the first object in the second input image can be obtained by cropping image.
  • the number of the second input images is M (M is greater than or equal to 2)
  • M is greater than or equal to 2
  • the second difference information between the fourth composite image and its corresponding second input image is also There are M
  • Li can be the calculated value of the second difference information between the fourth composite image and its corresponding second input image.
  • the difference between the poses of the second network on different second input images is calculated and constructed as a second loss function, so that the generalization of the second network on real images can be improved by optimizing these differences .
  • the second loss function Loss 2 satisfies the following expression:
  • L 1 is the calculated value of the difference between the black and white image of the first object in the second input image output by S1502 and the IoU of the black and white image of the first object in the fourth composite image obtained in S1503;
  • L 2 is the output value of S1502 Calculated value of the difference between the pose of the first object in the second input image and the pose of the first object obtained by using the second network to detect the fourth composite image ;
  • L3 is the fourth composite image and its corresponding second input The calculated value of the visual perception similarity of the part of the image in the same area as the first object in the fourth composite image.
  • the preset weight ⁇ i when constructing the second loss function may be configured according to actual requirements, which is not limited in this embodiment of the present application.
  • the difference between the pose of the first object in the image recognized by the first network and the real pose of the first object in the image is smaller than the pose of the first object in the image recognized by the second network and the pose of the first object in the image. Describe the real pose difference of the first object.
  • the process of the method for training the object pose detection network shown in FIG. 15 can also be shown in FIG. 16 , the second input image composed of the unlabeled real image of the object and the third synthetic image is input to the second network, and the output is Fourth pose.
  • the fourth pose and the 3D model of the object are input to the differentiable renderer, and the fourth composite image and the corresponding mask are output.
  • a second loss function is constructed, and the second loss function is used to train the second network to obtain the first network.
  • a single object pose detection network is used, even if a new recognizable object is added, the training time is short, and the recognition effect of other recognizable objects will not be affected;
  • the difference between real images and differentiable-renderable synthetic images in multiple poses constructs a loss function, which improves the accuracy of the single-object pose detection network.
  • the method for training the object pose detection network provided in the third embodiment of the present application may be specifically performed by the training device 520 as shown in FIG. 5b, and the second composite image in the optimization method may be maintained in the database 530 as shown in FIG. 5b. training data.
  • S1501 to S1503 in the method for training an object pose detection network provided in Embodiment 3 may be executed in the training device 520, or may be pre-executed by other functional modules before the training device 520, that is, First, preprocess the training data received or acquired from the database 530, as described in the process of S1501 to S1503, to obtain the second input image and the fourth composite image, which are used as the input of the training device 520, and are sent by The training device 520 performs S1504 to S1505.
  • the optimization method provided in the second embodiment of the present application may be processed by the CPU, or may be jointly processed by the CPU and the GPU, or other processors suitable for neural network computing may be used without the use of the GPU, which is not limited in this application. .
  • the fourth embodiment of the present application further provides a display method, which is applied to a terminal device.
  • the display method may be used in combination with the aforementioned object registration method, optimization method, and method for training an object pose detection network, or may be used alone, which is not specifically limited in this embodiment of the present application.
  • the display method provided by the embodiment of the present application may include:
  • a terminal device acquires a first image.
  • the first image refers to any image acquired by the terminal device.
  • the terminal device may acquire an image in the viewfinder by using a camera to capture the image as the first image.
  • the terminal device may be a mobile phone
  • the user may start an APP in the mobile phone, and input an image acquisition instruction on the APP interface, and the mobile phone starts a camera to capture the first image.
  • the terminal device may be smart glasses. After wearing the smart glasses, the user captures an image in the field of view through a viewfinder as the first image.
  • the terminal device may load a locally stored image as the first image.
  • the manner in which the terminal acquires the first image in S1701 is not limited.
  • the terminal device determines whether a recognizable object is included in the first image.
  • the terminal determines whether the first image contains an identifiable object according to a network for identifying objects configured offline.
  • the network for identifying objects may be the current pose detection network, or may be the first network or the second network described in the foregoing embodiments.
  • the terminal device may determine whether the first image contains an identifiable object according to the pose and the network currently used to identify the object.
  • the terminal device may input the first image into a network for recognizing poses and identifiers of objects, and if the network outputs poses and identifiers of one or more objects, it is determined that the first image includes recognizable objects. , otherwise, no identifiable objects are included in the first image.
  • the terminal device in S1702 may determine whether the first image contains a recognizable object according to the first network described in the foregoing embodiments of the present application and the multi-object category classifier obtained by registering the object.
  • the terminal device may extract the features of the first image, and perform matching with the features in the multi-object category classifier described in the foregoing embodiments of the present application. If there are matching features, it is determined that the first image includes recognizable objects. , otherwise, no identifiable objects are included in the first image.
  • feature information in the first image can be extracted, and the feature information is used to indicate the identifiable features in the first image; Set the feature information of the condition; if there is feature information in the feature library whose matching distance with the feature information satisfies the preset condition, it is determined that one or more identifiable objects are included in the first image; if there is no match with the feature information in the feature library The distance satisfies the feature information of the preset condition, it is determined that the first image does not include any identifiable object.
  • the feature library stores one or more feature information of different objects.
  • the preset condition may include less than or equal to a preset threshold.
  • the value of the preset threshold may be configured according to actual requirements, which is not limited in this embodiment of the present application.
  • S1703 and S1704 are performed. If the terminal device determines in S1702 that no identifiable object is included in the first image, S1705 is executed.
  • the first information may be text information, voice information, or other forms, which are not limited in this embodiment of the present application.
  • the content of the first information may be "a recognizable object has been detected, please keep the current angle", and the content of the first information may be superimposed and displayed on the first image in the form of text, or the content of the first information may be displayed on the first image.
  • the content can be played through the speakers of the terminal device.
  • S1704 Obtain the pose of each recognizable object in the first image through the first network corresponding to each recognizable object included in the first image; and display the corresponding pose of each recognizable object according to the pose of each recognizable object virtual content.
  • the virtual content corresponding to the object may be configured according to actual requirements, which is not limited in this embodiment of the present application.
  • it can be the introduction information of the exhibition item, or it can also be the attribute information of the product, or others.
  • S1705 Output second information, where the second information is used to prompt that no identifiable object is detected, and adjust the viewing angle to acquire the second image.
  • the second image is different from the first image.
  • the second information may be text information, voice information, or other forms, which are not limited in this embodiment of the present application.
  • the content of the second information may be "No identifiable object is detected, please adjust the angle of the device"
  • the content of the second information may be superimposed and displayed on the first image in the form of text, or, the content of the second information may be displayed on the first image.
  • the content can be played through the speakers of the terminal device.
  • a prompt is output to the user whether the image includes a identifiable object, so that the user can intuitively obtain whether the image includes an identifiable object, and the user experience is improved.
  • the terminal device determines whether the first image contains recognizable objects according to the first network described in the foregoing embodiments of the present application and the multi-object category classifier obtained by registering the object, as shown in FIG. 18 .
  • the process can include:
  • the terminal device extracts local feature points in the first image.
  • the terminal device acquires one or more first local feature points in the first image.
  • the matching distance between the descriptor of the first local feature point and the descriptor of the local feature point in the feature library is less than or equal to the first threshold.
  • the value of the first threshold may be configured according to actual requirements, which is not limited in this embodiment of the present application.
  • the matching distance between descriptors refers to the distance between vectors representing the descriptors. For example, Euclidean distance, Hamming distance, etc.
  • descriptors of local feature points of different objects are stored in the feature library.
  • the feature library may be the multi-object category classifier described in the foregoing embodiments.
  • the terminal device determines one or more ROIs in the first image according to the first local feature point.
  • one ROI includes one object.
  • the terminal device can combine the color information and depth information of the image to classify the first local feature points into different objects, and determine the area of the first local feature points of the same object that are centrally distributed as an ROI to obtain a or multiple ROIs.
  • the terminal device extracts global features in each ROI.
  • the terminal device determines whether the first image contains an identifiable object according to the global feature in each ROI.
  • the first image includes an identifiable object corresponding to the first global features.
  • the matching distance between the descriptor of the first global feature and the descriptor of the global feature in the feature library is less than or equal to the second threshold.
  • Descriptors of global features of different objects are also stored in the feature library. If the first global feature does not exist in the global features in each ROI, it is determined that the first image does not include any identifiable object.
  • the value of the second threshold may be configured according to actual requirements, which is not limited in this embodiment of the present application.
  • a merchant provides an APP, which is used for users to capture images through mobile phone framing, identify the products contained in the images, and display the product introduction corresponding to the products.
  • the three-dimensional objects of each commodity are registered offline according to the solutions provided in the embodiments of the present application.
  • the scene captured by the APP in the mobile phone is shown in Figure 1 as shown in the mobile phone interface shown in Figure 19.
  • the APP uses the method shown in Figure 18 to extract the features in the scene Figure 1, identify
  • the scene shown in Figure 1 contains identifiable objects (smart speakers), and outputs the mobile phone interface as shown in Figure 20a, "A identifiable object has been detected, please keep the current angle.”
  • the APP calls the second network corresponding to the smart speaker to obtain the 6DOF pose of the smart speaker in Figure 1 of the scene, and then displays the virtual content corresponding to the smart speaker according to the 6DOF pose "Hello everyone, I'm Xiaoyi, can I A Huawei smart speaker that can listen to songs, tell stories, tell jokes, and understand countless encyclopedic knowledge.”
  • each functional module may be divided corresponding to each function, or two or more functions may be integrated into one processing module. middle.
  • the above-mentioned integrated modules can be implemented in the form of hardware, and can also be implemented in the form of software function modules. It should be noted that, the division of modules in the embodiment of the present invention is schematic, and is only a logical function division, and there may be other division manners in actual implementation.
  • FIG. 21 illustrates an object registration apparatus 210 provided by an embodiment of the present application, which is used to implement the functions in the above-mentioned first embodiment.
  • the object registration apparatus 210 may include: a first acquisition unit 2101 , an extraction unit 2102 and a registration unit 2103 .
  • the first acquiring unit 2101 is configured to execute the process S801 in FIG. 8 , or S801a and S801b in FIG. 10 , or S1402 and S1403 in FIG. 14 ;
  • the extracting unit 2102 is configured to execute the process S802 in FIG. 8 or FIG. 10 , or S1404 in FIG.
  • the registration unit 2103 is configured to execute the process S803 in FIG. 8 or FIG. 10 , or S1405 in FIG. 14 .
  • the registration unit 2103 is configured to execute the process S803 in FIG. 8 or FIG. 10 , or S1405 in FIG. 14 .
  • FIG. 22 illustrates an optimization apparatus provided by an embodiment of the present application, which is used to implement the functions in the foregoing second embodiment.
  • the optimization apparatus 220 may include: a processing unit 2201 , a differentiable renderer 2202 , a screenshot unit 2203 , a construction unit 2204 and an update unit 2205 .
  • the processing unit 2201 is used to execute the process S1201 in FIG. 12 ; the differentiable renderer 2202 is used to execute the process S1202 in FIG. 12 ; the screenshot unit 2203 is used to execute the process S1203 in FIG. 12 ; process S1204; the updating unit 2205 is used to execute the process S1205 in FIG. 12 .
  • all relevant contents of the steps involved in the above method embodiments can be cited in the functional descriptions of the corresponding functional modules, which will not be repeated here.
  • FIG. 23 illustrates an apparatus 230 for training an object pose detection network provided by an embodiment of the present application, which is used to implement the functions in the third embodiment.
  • the apparatus 230 for training an object pose detection network may include: a second acquisition unit 2301 , a processing unit 2302 , a differentiable renderer 2303 , a construction unit 2304 and an update unit 2305 .
  • the second acquisition unit 2301 is used to execute the process S1501 in FIG. 15;
  • the processing unit 2302 is used to execute the process S1502 in FIG. 15;
  • the differentiable renderer 2303 is used to execute the process S1503 in FIG.
  • FIG. 24 illustrates a display device 240 provided by an embodiment of the present application, which is used to implement the functions in the above-mentioned fourth embodiment.
  • the display device 240 may include: a first acquisition unit 2401 , an output unit 1402 and a processing unit 1403 .
  • the first acquisition unit 2401 is used for executing the process S1701 in FIG. 17 ;
  • the output unit 1402 is used for executing the process S1703 or S1705 in FIG. 17 ;
  • the processing unit 1403 is used for executing the process S1704 in FIG. 17 .
  • all relevant contents of the steps involved in the above method embodiments can be cited in the functional descriptions of the corresponding functional modules, which will not be repeated here.
  • FIG. 25 provides a schematic diagram of the hardware structure of an apparatus 250 .
  • the apparatus 250 may be an object registration apparatus provided in this embodiment of the present application, and is configured to execute the object registration method provided in the first embodiment of this application.
  • the apparatus 250 may be the optimization apparatus provided in this embodiment of the present application, and is configured to execute the optimization method provided in the second embodiment of the present application.
  • the apparatus 250 may be an apparatus for training an object pose detection network provided in this embodiment of the present application, and configured to execute the method for training an object pose detection network provided in the third embodiment of the present application.
  • the device 250 may be the display device provided in this embodiment of the present application, and configured to execute the display method provided by the fourth embodiment of the present application.
  • the apparatus 250 may include a memory 2501 , a processor 2502 , a communication interface 2503 and a bus 2504 .
  • the memory 2501 , the processor 2502 , and the communication interface 2503 are connected to each other through the bus 2504 for communication.
  • the memory 2501 may be a read only memory (ROM), a static storage device, a dynamic storage device, or a random access memory (RAM).
  • the memory 2501 can store programs, and when the programs stored in the memory 2501 are executed by the processor 2502, the processor 2502 and the communication interface 2503 are used to execute various steps of the methods provided in Embodiments 1 to 4 of this application.
  • the processor 2502 can be a general-purpose central processing unit (CPU), a microprocessor, an application specific integrated circuit (ASIC), a graphics processor (graphics processing unit, GPU), or one or more An integrated circuit for executing a relevant program to implement the functions required to be performed by the object registration device, the optimization device, the device for training the object pose detection network, and the unit in the display device according to the embodiments of the present application, or to execute the method embodiments of the present application.
  • the method provided in any one of Embodiments 1 to 4.
  • the processor 2502 may also be an integrated circuit chip with signal processing capability. In the implementation process, each step of the method for training an object pose detection network of the present application may be completed by an integrated logic circuit of hardware in the processor 2502 or instructions in the form of software.
  • the above-mentioned processor 2502 can also be a general-purpose processor, a digital signal processor (digital signal processing, DSP), an application-specific integrated circuit (ASIC), an off-the-shelf programmable gate array (field programmable gate array, FPGA) or other programmable logic devices. , discrete gate or transistor logic devices, discrete hardware components.
  • DSP digital signal processing
  • ASIC application-specific integrated circuit
  • FPGA field programmable gate array
  • the methods, steps, and logic block diagrams disclosed in the embodiments of this application can be implemented or executed.
  • a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
  • the steps of the method disclosed in conjunction with the embodiments of the present application may be directly embodied as executed by a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor.
  • the software modules may be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other storage media mature in the art.
  • the storage medium is located in the memory 2501, and the processor 2502 reads the information in the memory 2501, and combines its hardware to complete the object registration device, the optimization device, the device for training the object pose detection network, and the unit included in the display device of the embodiment of the present application. function to be executed, or execute any one of the first to fourth embodiments of the method of the present application.
  • Bus 2504 may include a pathway for communicating information between various components of device 250 (eg, memory 2501, processor 2502, communication interface 2503).
  • the first acquisition unit 2101 , the extraction unit 2102 and the registration unit 2103 in the object registration apparatus 210 are equivalent to the processor 2502 in the apparatus 250 .
  • the processing unit 2201 , the differentiable renderer 2202 , the screenshot unit 2203 , the constructing unit 2204 and the updating unit 2205 in the optimization device 220 are equivalent to the processor 2502 in the device 250 .
  • the second acquiring unit 2301 , the processing unit 2302 , the differentiable renderer 2303 , the constructing unit 2304 and the updating unit 2305 in the apparatus 230 for training the object pose detection network are equivalent to the processor 2502 in the apparatus 250 .
  • the first acquiring unit 2401 and the processing unit 1403 in the display device 240 are equivalent to the processor 2502 in the device 250
  • the output unit 1402 is equivalent to the communication interface 2503 in the device 250 .
  • the apparatus 250 is equivalent to the training device 520 or the execution device 510 in FIG. 5b.
  • the units and algorithm steps of each example described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each particular application, but such implementations should not be considered beyond the scope of this application.
  • the above-mentioned embodiments it may be implemented in whole or in part by software, hardware, firmware or any combination thereof.
  • software it can be implemented in whole or in part in the form of a computer program product.
  • the computer program product includes one or more computer programs or instructions.
  • the processes or functions described in the embodiments of the present application are executed in whole or in part.
  • the computer may be a general purpose computer, a special purpose computer, a computer network, network equipment, user equipment, or other programmable apparatus.
  • the computer program or instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer program or instructions may be downloaded from a website site, computer, A server or data center transmits by wire or wireless to another website site, computer, server or data center.
  • the computer-readable storage medium may be any available medium that can be accessed by a computer, or a data storage device such as a server, data center, or the like that integrates one or more available media.
  • the usable medium can be a magnetic medium, such as a floppy disk, a hard disk, and a magnetic tape; it can also be an optical medium, such as a digital video disc (DVD); it can also be a semiconductor medium, such as a solid state drive (solid state drive). , SSD).
  • a magnetic medium such as a floppy disk, a hard disk, and a magnetic tape
  • an optical medium such as a digital video disc (DVD)
  • DVD digital video disc
  • it can also be a semiconductor medium, such as a solid state drive (solid state drive). , SSD).
  • “at least one” means one or more, and “plurality” means two or more.
  • “And/or”, which describes the relationship of the associated objects, indicates that there can be three kinds of relationships, for example, A and/or B, it can indicate that A exists alone, A and B exist at the same time, and B exists alone, where A, B can be singular or plural.
  • the character “/” generally indicates that the related objects are a kind of "or” relationship; in the formula of this application, the character "/” indicates that the related objects are a kind of "division” Relationship.

Abstract

An object registration method and apparatus, which relate to the field of computer vision and by means of which the problem of how to improve the accuracy of a pose of a terminal device are solved. The solution comprises: acquiring a plurality of first input images that contain a first object, wherein the plurality of first input images comprise a real image of the first object and/or a plurality of first composite images of the first object, the plurality of first composite images are obtained by means of performing differentiable rendering by using a three-dimensional model of the first object under a plurality of first poses, and the plurality of first poses are different; respectively extracting feature information of the plurality of first input images, wherein the feature information is used for indicating features of the first object in the first input images where the first object is located; and making the feature information extracted from each of the first input images correspond to an identifier of the first object, and registering the first object.

Description

一种对象注册方法及装置Object registration method and device
本申请要求于2020年12月29日提交国家知识产权局、申请号为202011607387.9、发明名称为“一种对象注册方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application with the application number of 202011607387.9 and the invention titled "An object registration method and device", which was submitted to the State Intellectual Property Office on December 29, 2020, the entire contents of which are incorporated herein by reference. middle.
技术领域technical field
本申请实施例涉及计算机视觉领域,尤其涉及一种对象注册方法及装置。The embodiments of the present application relate to the field of computer vision, and in particular, to an object registration method and apparatus.
背景技术Background technique
计算机视觉是各个应用领域,如制造业、检验、文档分析、医疗诊断,和军事等领域中各种智能/自主系统中不可分割的一部分,它是一门关于如何运用照相机/摄像机和计算机来获取我们所需的,被拍摄对象的数据与信息的学问。形象地说,就是给计算机安装上眼睛(照相机/摄像机)和大脑(算法)用来代替人眼对目标进行识别、跟踪和测量等,从而使计算机能够感知环境。因为感知可以看作是从感官信号中提取信息,所以计算机视觉也可以看作是研究如何使人工系统从图像或多维数据中“感知”的科学。总的来说,计算机视觉就是用各种成象系统代替视觉器官获取输入信息,再由计算机来代替大脑对这些输入信息完成处理和解释。计算机视觉的最终研究目标就是使计算机能像人那样通过视觉观察和理解世界,具有自主适应环境的能力。Computer vision is an integral part of various intelligent/autonomous systems in various application fields, such as manufacturing, inspection, document analysis, medical diagnosis, and military. What we need is the knowledge of the data and information of the subject being photographed. To put it figuratively, it is to install eyes (cameras/camcorders) and brains (algorithms) on the computer to identify, track and measure the target instead of the human eye, so that the computer can perceive the environment. Because perception can be viewed as extracting information from sensory signals, computer vision can also be viewed as the science of how to make artificial systems "perceive" from images or multidimensional data. In general, computer vision is to use various imaging systems to replace the visual organ to obtain input information, and then use the computer to replace the brain to complete the processing and interpretation of these input information. The ultimate research goal of computer vision is to enable computers to observe and understand the world through vision like humans, and have the ability to adapt to the environment autonomously.
对象(人或物)的位姿检测与跟踪是计算机视觉领域的关键技术,能够赋予机器感知真实环境中物体的三维空间位置及语义的能力,在机器人、自动驾驶、增强现实等领域有着广泛的用途。The pose detection and tracking of objects (people or objects) is a key technology in the field of computer vision, which can endow machines with the ability to perceive the three-dimensional spatial position and semantics of objects in the real environment. use.
在实际应用中,通常是先构建多对象位姿估计网络,用于从输入的图像中识别可识别的对象的位姿;然后利用机器学习的方法,采用用户提供的多个待识别对象的三维(3 dimensions,3D)网络模型,训练该多对象位姿估计网络,将待识别对象注册到多对象位姿估计网络中,实现该多对象位姿估计网络对已经注册对象可识别。在线上应用时,图片输入多对象位姿估计网络,由多对象位姿估计网络识别图片中的对象的位姿。In practical applications, a multi-object pose estimation network is usually constructed first to identify the poses of recognizable objects from the input images; (3 dimensions, 3D) network model, train the multi-object pose estimation network, register the object to be recognized in the multi-object pose estimation network, and realize that the multi-object pose estimation network can recognize the registered objects. When applied online, the image is input to the multi-object pose estimation network, and the multi-object pose estimation network recognizes the pose of the object in the picture.
当需在多对象位姿估计网络中新增待识别对象注册时,由用户提供新增的待识别对象的三维模型,采用新增的三维模型和原有的三维模型一起对多对象位姿估计网络重新训练,导致训练时间的线性增长,并影响多对象位姿估计网络对已经训练好的可识别的对象的识别效果,导致检测准确率、成功率下降。When a new object to be recognized needs to be registered in the multi-object pose estimation network, the user provides the 3D model of the newly added object to be recognized, and the new 3D model and the original 3D model are used to estimate the pose of the multi-object. The retraining of the network leads to a linear increase in the training time, and affects the recognition effect of the multi-object pose estimation network on the recognizable objects that have been trained, resulting in a decrease in the detection accuracy and success rate.
发明内容SUMMARY OF THE INVENTION
本申请提供的对象注册方法及装置,解决了如何提高终端设备的位姿的准确度的问题。The object registration method and device provided by the present application solve the problem of how to improve the accuracy of the pose of the terminal device.
为达到上述目的,本申请采用如下技术方案:To achieve the above object, the application adopts the following technical solutions:
第一方面,提供一种对象注册方法,该方法可以包括:获取包括第一对象的多个第一输入图像,该多个第一输入图像包括第一对象的真实图像和/或第一对象的多个第一合成图像;多个第一合成图像是由第一对象的三维模型在多个第一位姿下可微渲染 得到的;多个第一位姿不同;分别提取多个第一输入图像的特征信息,特征信息用于指示第一对象在其所处的第一输入图像中的特征;将每个第一输入图像中提取的特征信息,与第一对象的标识相对应,进行第一对象的注册。In a first aspect, an object registration method is provided, the method may include: acquiring a plurality of first input images including a first object, the plurality of first input images including a real image of the first object and/or an image of the first object a plurality of first composite images; the plurality of first composite images are obtained by differentiable rendering of the three-dimensional model of the first object in a plurality of first poses; the plurality of first poses are different; a plurality of first inputs are extracted respectively The feature information of the image, the feature information is used to indicate the features of the first object in the first input image where it is located; the feature information extracted from each first input image corresponds to the identification of the first object, and the first Registration of an object.
通过本申请提供的对象注册方法,提取图像中对象的特征信息进行对象注册,注册时间短,且不影响其他已注册对象的识别性能,即使新增很多对象注册,也不会过大的增加注册时间,也保证了已注册对象的检测性能。Through the object registration method provided by this application, the feature information of the object in the image is extracted for object registration, the registration time is short, and the recognition performance of other registered objects is not affected, even if many new objects are registered, the registration will not increase too much. time, also guarantees the detection performance of registered objects.
在一种可能的实现方式中,上述特征信息可以包括局部特征点的描述子,以及全局特征的描述子。In a possible implementation manner, the above feature information may include descriptors of local feature points and descriptors of global features.
在另一种可能的实现方式中,分别提取多个第一输入图像的特征信息,具体可以实现为:将多个第一输入图像分别输入第一网络进行位姿识别,得到每个第一输入图像中第一对象的位姿;按照获取的第一对象的位姿,将第一对象的三维模型分别投影至每个第一输入图像,得到每个第一输入图像中的投影区域;分别在每个第一输入图像中的投影区域,提取特征信息。其中,第一网络用于识别图像中第一对象的位姿。通过图像中第一对象的位姿,确定图像中第一对象的区域,在该区域中提取特征信息,提高了特征信息提取的效率以及准确度。In another possible implementation manner, the feature information of the plurality of first input images is extracted respectively, which can be specifically implemented as: inputting the plurality of first input images into the first network respectively for pose recognition, and obtaining each first input image The pose of the first object in the image; according to the obtained pose of the first object, project the three-dimensional model of the first object to each of the first input images, respectively, to obtain the projection area in each of the first input images; For each projected region in the first input image, feature information is extracted. Wherein, the first network is used to identify the pose of the first object in the image. Through the pose of the first object in the image, the region of the first object in the image is determined, and feature information is extracted in the region, which improves the efficiency and accuracy of feature information extraction.
在另一种可能的实现方式中,分别提取多个第一输入图像的特征信息,具体可以实现为:将多个第一输入图像分别输入第一网络进行黑白图像提取,获取每个第一输入图像中第一对象的黑白图像;分别在每个第一输入图像中第一对象的黑白图像内,提取特征信息。其中,第一网络用于提取图像中第一对象的黑白图像。将图像中第一对象的黑白图像,确定为图像中第一对象的区域,在该区域中提取特征信息,提高了特征信息提取的效率以及准确度。In another possible implementation manner, the feature information of the plurality of first input images is extracted respectively, which can be specifically implemented as: inputting the plurality of first input images into the first network respectively to extract black and white images, and obtaining each first input image A black-and-white image of the first object in the image; extract feature information from the black-and-white image of the first object in each first input image respectively. Wherein, the first network is used to extract the black and white image of the first object in the image. The black and white image of the first object in the image is determined as the region of the first object in the image, and feature information is extracted in the region, which improves the efficiency and accuracy of feature information extraction.
在另一种可能的实现方式中,本申请提供的对象注册方法还可以包括优化可微渲染器的过程,该过程可以包括:将第一对象的N个真实图像分别输入第一网络,得到第一网络输出的每个真实图像中第一对象的第二位姿;N大于或等于1;第一网络用于识别图像中第一对象的位姿;根据第一对象的三维模型,采用可微渲染器在每个第二位姿下,可微渲染得到N个第二合成图像;获取一个第二位姿的真实图像与该一个第二位姿渲染得到的第二合成图像相对应;分别截取每个真实图像中,与真实对象对应的第二合成图像中第一对象相同位置的区域,作为每个真实图像的前景图像;根据N个真实图像的前景图像与其对应的第二合成图像的第一差异信息,构建第一损失函数;其中,第一差异信息用于指示前景图像与其对应的第二合成图像的差异;根据第一损失函数更新可微渲染器,使得可微渲染器的输出的合成图像逼近对象的真实图像。通过对可微渲染器进行优化,提高了可微渲染器的渲染真实性,降低可微渲染得到的合成图像与真实图像之间的差异。In another possible implementation manner, the object registration method provided by the present application may further include a process of optimizing the differentiable renderer, and the process may include: inputting N real images of the first object into the first network respectively, to obtain the first The second pose of the first object in each real image output by a network; N is greater than or equal to 1; the first network is used to identify the pose of the first object in the image; according to the three-dimensional model of the first object, a differentiable Under each second pose, the renderer can obtain N second composite images by micro-rendering; obtain a real image of one second pose corresponding to the second composite image rendered by the one second pose; intercept them respectively In each real image, the area at the same position of the first object in the second composite image corresponding to the real object is taken as the foreground image of each real image; a difference information to construct a first loss function; wherein, the first difference information is used to indicate the difference between the foreground image and its corresponding second composite image; the differentiable renderer is updated according to the first loss function, so that the output of the differentiable renderer has The synthetic image approximates the real image of the object. By optimizing the differentiable renderer, the rendering authenticity of the differentiable renderer is improved, and the difference between the synthetic image obtained by differentiable rendering and the real image is reduced.
在另一种可能的实现方式中,第一差异信息可以包括下述信息中一项或多项:特征图差、像素颜色差、提取的特征描述子之差。In another possible implementation manner, the first difference information may include one or more of the following information: feature map difference, pixel color difference, and difference between extracted feature descriptors.
在另一种可能的实现方式中,第一损失函数可以为N个真实图像的前景图像与其对应的第二合成图像的多个第一差异信息的计算值之和。In another possible implementation manner, the first loss function may be the sum of calculated values of multiple first difference information of the foreground images of the N real images and their corresponding second synthetic images.
在另一种可能的实现方式中,本申请提供的对象注册方法还可以包括训练对象位姿检测网络的方法,具体可以包括:获取包括第一对象的多个第二输入图像,该多个 第二输入图像包括第一对象的真实图像和/或第一对象的多个第三合成图像;多个第三合成图像是由第一对象的三维模型在多个第三位姿下渲染得到的;多个第三位姿不同;将多个第二输入图像分别输入第二网络进行位姿识别,得到第二网络输出的每个第二输入图像中第一对象的第四位姿;第二网络用于识别图像中第一对象的位姿;根据第一对象的三维模型,可微渲染获取第一对象在每个第四位姿下的第四合成图像;获取一个第四位姿的第二输入图像与一个第四位姿渲染得到的第四合成图像相对应;根据每个第四合成图像与其对应的第二输入图像的第二差异信息,构建第二损失函数;第二差异信息用于指示第四合成图像与其对应的第二输入图像的差异;根据第二损失函数更新第二网络,得到第一网络;第一网络识别的图像中第一对象的位姿与图像中第一对象的真实位姿的差异,小于第二网络识别的图像中第一对象的位姿与图像中第一对象的真实位姿差异。通过训练对象位姿检测网络,提高了位姿识别网络识别的准确度,降低位姿识别网络的输出与图像中对象的真实位姿之间的差异。In another possible implementation manner, the object registration method provided by the present application may further include a method for training an object pose detection network, which may specifically include: acquiring multiple second input images including the first object, the multiple first The second input image includes a real image of the first object and/or a plurality of third composite images of the first object; the plurality of third composite images are obtained by rendering the three-dimensional model of the first object in a plurality of third poses; The multiple third poses are different; the multiple second input images are respectively input into the second network for pose recognition, and the fourth pose of the first object in each second input image output by the second network is obtained; the second network It is used to identify the pose of the first object in the image; according to the three-dimensional model of the first object, microrendering obtains the fourth composite image of the first object under each fourth pose; obtains a second image of the fourth pose The input image corresponds to a fourth composite image rendered by a fourth pose; a second loss function is constructed according to the second difference information of each fourth composite image and its corresponding second input image; the second difference information is used for Indicate the difference between the fourth composite image and its corresponding second input image; update the second network according to the second loss function to obtain the first network; the pose of the first object in the image recognized by the first network is different from the first object in the image. The difference of the real pose is smaller than the difference between the pose of the first object in the image recognized by the second network and the real pose of the first object in the image. By training the object pose detection network, the recognition accuracy of the pose recognition network is improved, and the difference between the output of the pose recognition network and the real pose of the object in the image is reduced.
在另一种可能的实现方式中,第二损失函数Loss 2可以满足如下表达式:
Figure PCTCN2021140241-appb-000001
X大于或等于1,λ i为权值,L i用于表示第四合成图像与其对应的第二输入图像的第二差异信息的计算值。该实现方式提供一种具体的第二损失函数的表达,使得对象位姿检测网络进行训练,提高了位姿检测网络识别的准确度。
In another possible implementation, the second loss function Loss 2 can satisfy the following expression:
Figure PCTCN2021140241-appb-000001
X is greater than or equal to 1, λ i is a weight value, and L i is used to represent a calculated value of the second difference information between the fourth composite image and its corresponding second input image. This implementation provides a specific expression of the second loss function, so that the object pose detection network is trained, and the recognition accuracy of the pose detection network is improved.
在另一种可能的实现方式中,第二差异信息可以包括下述内容中一项或多项:第四合成图像中第一对象的黑白图像与其对应的第二输入图像的中第一对象的黑白图像的交并比(intersection of union,IOU)之差、获取第四合成图像的第四位姿与第四合成图像经过第一网络得到的位姿之差、第四合成图像与其对应的第二输入图像中与第四合成图像中的第一对象相同位置的区域图像的相似度。该实现方式提供了第二差异信息的可能实现,丰富了第二差异信息的内容。In another possible implementation manner, the second difference information may include one or more of the following contents: the black-and-white image of the first object in the fourth composite image and the black-and-white image of the first object in the corresponding second input image The difference between the intersection of union (IOU) of the black and white images, the difference between the fourth pose obtained from the fourth composite image and the pose obtained by the fourth composite image through the first network, the fourth composite image and its corresponding 2. The similarity of the region images in the input image at the same position as the first object in the fourth composite image. This implementation provides a possible realization of the second difference information and enriches the content of the second difference information.
第二方面,提供一种显示方法,该方法可以包括:获取第一图像;若第一图像中包括一个或多个可识别对象,输出第一信息,该第一信息用于提示检测到第一图像中包括可识别对象;通过第一图像包括的每个可识别对象对应的位姿检测网络,获取第一图像中每个可识别对象的位姿;按照每个可识别对象的位姿,显示每个可识别对象对应的虚拟内容;若第一图像中未包括任一可识别对象,输出第二信息,第二信息用于提示未检测到可识别对象,调整视角以获取第二图像,第二图像与所述第一图像不同。In a second aspect, a display method is provided, the method may include: acquiring a first image; if the first image includes one or more identifiable objects, outputting first information, where the first information is used to prompt detection of the first image The image includes recognizable objects; the pose detection network corresponding to each recognizable object included in the first image is used to obtain the pose of each recognizable object in the first image; according to the pose of each recognizable object, display Virtual content corresponding to each recognizable object; if the first image does not include any recognizable object, output the second information, the second information is used to prompt that no recognizable object is detected, adjust the viewing angle to obtain the second image, and the first The second image is different from the first image.
通过本申请提供的显示方法,将图像中是否包括可识别对象向用户输出提示,使得用户可以直观获取图像中是否包括可识别对象,提高了用户体验。With the display method provided in the present application, a prompt is output to the user whether the image includes a identifiable object, so that the user can intuitively obtain whether the image includes a identifiable object, which improves user experience.
一种可能的实现方式中,本申请提供的显示方法还可以包括:提取第一图像中的特征信息,该特征信息用于指示所述第一图像中可识别的特征;判断特征库中是否存在与提取的特征信息的匹配距离满足预设条件的特征信息;其中,特征库中存储了不同对象的一个或多个特征信息;若特征库中存在与特征信息的匹配距离满足预设条件的特征信息,确定第一图像中包括一个或多个可识别对象;若特征库中不存在与特征信息的匹配距离满足预设条件的特征信息,确定第一图像未包括任一可识别对象。通过图像的特征信息与特征库对比,可以简单快速的判断图像中是否包括可识别对象。In a possible implementation manner, the display method provided by the present application may further include: extracting feature information in the first image, where the feature information is used to indicate recognizable features in the first image; judging whether the feature library exists Feature information whose matching distance with the extracted feature information satisfies a preset condition; wherein, one or more feature information of different objects is stored in the feature library; if there is a feature whose matching distance with the feature information satisfies the preset condition in the feature library information, it is determined that the first image includes one or more identifiable objects; if there is no feature information whose matching distance with the feature information satisfies the preset condition in the feature library, it is determined that the first image does not include any identifiable objects. By comparing the feature information of the image with the feature library, it can be easily and quickly judged whether the image contains identifiable objects.
一种可能的实现方式中,预设条件可以包括小于或等于预设阈值。In a possible implementation manner, the preset condition may include less than or equal to a preset threshold.
另一种可能的实现方式中,本申请提供的显示方法还可以包括:获取第一图像中的一个或多个第一局部特征点,第一局部特征点的描述子与特征库中的局部特征点的描述子的匹配距离小于或等于第一阈值;特征库中存储了不同对象的局部特征点的描述子;根据第一局部特征点,确定第一图像中一个或多个感兴趣区域(region of interest,ROI);一个ROI中包括一个对象;提取每个ROI中的全局特征;若每个ROI中的全局特征中存在一个或多个第一全局特征,确定第一图像中包括第一全局特征对应的可识别对象;其中,第一全局特征的描述子与特征库中的全局特征的描述子的匹配距离小于或等于第二阈值;特征库中还存储了不同对象的全局特征的描述子;若每个ROI中的全局特征中不存在第一全局特征,确定第一图像未包括任一可识别对象。先通过图像的局部特征点与特征库对比,确定ROI区域,再在ROI区域中提取全局特征与特征库对比,可以提高判断图像中是否包括可识别对象的效率,也提高了判断图像中是否包括可识别对象准确度。In another possible implementation manner, the display method provided by the present application may further include: acquiring one or more first local feature points in the first image, descriptors of the first local feature points and local features in the feature library The matching distance of the descriptors of the points is less than or equal to the first threshold; the descriptors of the local feature points of different objects are stored in the feature library; according to the first local feature points, one or more regions of interest (regions) in the first image are determined. of interest, ROI); an object is included in an ROI; global features in each ROI are extracted; if there are one or more first global features in the global features in each ROI, it is determined that the first image includes the first global feature The identifiable object corresponding to the feature; wherein, the matching distance between the descriptor of the first global feature and the descriptor of the global feature in the feature library is less than or equal to the second threshold; the descriptor of the global feature of different objects is also stored in the feature library ; If the first global feature does not exist in the global features in each ROI, it is determined that the first image does not include any identifiable object. First, the ROI area is determined by comparing the local feature points of the image with the feature library, and then the global features are extracted in the ROI area and compared with the feature library, which can improve the efficiency of judging whether the image contains identifiable objects, and also improves the judgment of whether the image contains identifiable objects. Identifiable object accuracy.
另一种可能的实现方式中,本申请提供的显示方法还可以包括与前述第一方面提供的对象注册方法结合使用,本申请提供的显示方法还可以包括:获取包括第一对象的多个第一输入图像,多个第一输入图像包括第一对象的真实图像和/或第一对象的多个第一合成图像;多个第一合成图像是由第一对象的三维模型在多个第一位姿下可微渲染得到的;多个第一位姿不同;分别提取多个第一输入图像的特征信息,特征信息用于指示第一对象在其所处的第一输入图像中的特征;将每个第一输入图像中提取的特征信息,与第一对象的标识相对应存储于所述特征库,进行第一对象的注册。通过提取图像中对象的特征信息进行对象注册,注册时间短,且不影响其他已注册对象的识别性能,即使新增很多对象注册,也不会过大的增加注册时间,也保证了已注册对象的检测性能。In another possible implementation manner, the display method provided by the present application may further include using in combination with the object registration method provided in the foregoing first aspect, and the display method provided by the present application may further include: acquiring a plurality of first objects including the first object. an input image, the plurality of first input images include a real image of the first object and/or a plurality of first composite images of the first object; the plurality of first composite images are composed of a three-dimensional model of the first object in a plurality of first Differentiable rendering is obtained under the pose; multiple first poses are different; feature information of multiple first input images is extracted respectively, and the feature information is used to indicate the feature of the first object in the first input image where it is located; The feature information extracted from each first input image is stored in the feature library corresponding to the identifier of the first object, and the first object is registered. Object registration is performed by extracting the feature information of objects in the image. The registration time is short, and the recognition performance of other registered objects is not affected. Even if many new objects are registered, the registration time will not be increased too much, and the registered objects are also guaranteed. detection performance.
另一种可能的实现方式中,上述特征信息可以包括局部特征点的描述子,以及全局特征的描述子。In another possible implementation manner, the above feature information may include descriptors of local feature points and descriptors of global features.
需要说明的是,对象注册方法的具体实现可以参照前述第一方面,此处不再赘述。It should be noted that, for the specific implementation of the object registration method, reference may be made to the foregoing first aspect, which will not be repeated here.
另一种可能的实现方式中,本申请提供的显示方法还可以包括优化可微渲染器的过程,该过程可以包括:将第一对象的N个真实图像分别输入第一网络,得到第一网络输出的每个真实图像中第一对象的第二位姿;N大于或等于1;第一网络用于识别图像中第一对象的位姿;根据第一对象的三维模型,采用可微渲染器在每个第二位姿下,可微渲染得到N个第二合成图像;获取一个第二位姿的真实图像与该一个第二位姿渲染得到的第二合成图像相对应;分别截取每个真实图像中,与真实对象对应的第二合成图像中第一对象相同位置的区域,作为每个真实图像的前景图像;根据N个真实图像的前景图像与其对应的第二合成图像的第一差异信息,构建第一损失函数;其中,第一差异信息用于指示前景图像与其对应的第二合成图像的差异;根据第一损失函数更新可微渲染器,使得可微渲染器的输出的合成图像逼近对象的真实图像。通过对可微渲染器进行优化,提高了可微渲染器的渲染真实性,降低可微渲染得到的合成图像与真实图像之间的差异。In another possible implementation manner, the display method provided by the present application may further include a process of optimizing the differentiable renderer, and the process may include: inputting N real images of the first object into the first network respectively, to obtain the first network The second pose of the first object in each output real image; N is greater than or equal to 1; the first network is used to identify the pose of the first object in the image; according to the three-dimensional model of the first object, a differentiable renderer is used Under each second pose, N second composite images are obtained by differentiable rendering; a real image of a second pose is obtained corresponding to the second composite image rendered by the one second pose; each image is intercepted separately In the real image, the area at the same position of the first object in the second synthetic image corresponding to the real object is used as the foreground image of each real image; according to the first difference between the foreground images of the N real images and their corresponding second synthetic images information to construct a first loss function; wherein, the first difference information is used to indicate the difference between the foreground image and its corresponding second composite image; the differentiable renderer is updated according to the first loss function, so that the output composite image of the differentiable renderer is Approach the real image of the object. By optimizing the differentiable renderer, the rendering authenticity of the differentiable renderer is improved, and the difference between the synthetic image obtained by differentiable rendering and the real image is reduced.
在另一种可能的实现方式中,第一差异信息可以包括下述信息中一项或多项:特 征图差、像素颜色差、提取的特征描述子之差。In another possible implementation manner, the first difference information may include one or more of the following information: feature map difference, pixel color difference, and difference between extracted feature descriptors.
在另一种可能的实现方式中,第一损失函数可以为N个真实图像的前景图像与其对应的第二合成图像的多个第一差异信息的计算值之和。In another possible implementation manner, the first loss function may be the sum of calculated values of multiple first difference information of the foreground images of the N real images and their corresponding second synthetic images.
在另一种可能的实现方式中,本申请提供的对象注册方法还可以包括训练对象位姿检测网络的方法,具体可以包括:获取包括第一对象的多个第二输入图像,该多个第二输入图像包括第一对象的真实图像和/或第一对象的多个第三合成图像;多个第三合成图像是由第一对象的三维模型在多个第三位姿下渲染得到的;多个第三位姿不同;将多个第二输入图像分别输入第二网络进行位姿识别,得到第二网络输出的每个第二输入图像中第一对象的第四位姿;第二网络用于识别图像中第一对象的位姿;根据第一对象的三维模型,可微渲染获取第一对象在每个第四位姿下的第四合成图像;获取一个第四位姿的第二输入图像与一个第四位姿渲染得到的第四合成图像相对应;根据每个第四合成图像与其对应的第二输入图像的第二差异信息,构建第二损失函数;第二差异信息用于指示第四合成图像与其对应的第二输入图像的差异;根据第二损失函数更新第二网络,得到第一网络;第一网络识别的图像中第一对象的位姿与图像中第一对象的真实位姿的差异,小于第二网络识别的图像中第一对象的位姿与图像中第一对象的真实位姿差异。通过对位姿检测网络进行训练,提高了位姿检测网络识别的准确度,降低位姿检测网络的输出与图像中对象的真实位姿之间的差异。In another possible implementation manner, the object registration method provided by the present application may further include a method for training an object pose detection network, which may specifically include: acquiring multiple second input images including the first object, the multiple first The second input image includes a real image of the first object and/or a plurality of third composite images of the first object; the plurality of third composite images are obtained by rendering the three-dimensional model of the first object in a plurality of third poses; The multiple third poses are different; the multiple second input images are respectively input into the second network for pose recognition, and the fourth pose of the first object in each second input image output by the second network is obtained; the second network It is used to identify the pose of the first object in the image; according to the three-dimensional model of the first object, microrendering obtains the fourth composite image of the first object under each fourth pose; obtains a second image of the fourth pose The input image corresponds to a fourth composite image rendered by a fourth pose; a second loss function is constructed according to the second difference information of each fourth composite image and its corresponding second input image; the second difference information is used for Indicate the difference between the fourth composite image and its corresponding second input image; update the second network according to the second loss function to obtain the first network; the pose of the first object in the image recognized by the first network is different from the first object in the image. The difference of the real pose is smaller than the difference between the pose of the first object in the image recognized by the second network and the real pose of the first object in the image. By training the pose detection network, the recognition accuracy of the pose detection network is improved, and the difference between the output of the pose detection network and the real pose of the object in the image is reduced.
在另一种可能的实现方式中,第二损失函数Loss 2可以满足如下表达式:
Figure PCTCN2021140241-appb-000002
X大于或等于1,λ i为权值,L i用于表示第四合成图像与其对应的第二输入图像的第二差异信息的计算值。该实现方式提供一种具体的第二损失函数的表达,对位姿检测网络进行优化,提高了位姿检测网络识别的准确度。
In another possible implementation, the second loss function Loss 2 can satisfy the following expression:
Figure PCTCN2021140241-appb-000002
X is greater than or equal to 1, λ i is a weight value, and L i is used to represent a calculated value of the second difference information between the fourth composite image and its corresponding second input image. This implementation provides a specific expression of the second loss function, optimizes the pose detection network, and improves the recognition accuracy of the pose detection network.
在另一种可能的实现方式中,第二差异信息可以包括下述内容中一项或多项:第四合成图像中第一对象的黑白图像与其对应的第二输入图像的中第一对象的黑白图像的IOU之差、获取第四合成图像的第四位姿与第四合成图像经过第一网络得到的位姿之差、第四合成图像与其对应的第二输入图像中与第四合成图像中的第一对象相同位置的区域图像的相似度。该实现方式提供了第二差异信息的可能实现,丰富了第二差异信息的内容。In another possible implementation manner, the second difference information may include one or more of the following contents: the black-and-white image of the first object in the fourth composite image and the black-and-white image of the first object in the corresponding second input image The difference between the IOUs of the black and white images, the difference between the fourth pose obtained from the fourth composite image and the pose obtained by the fourth composite image through the first network, the fourth composite image and the corresponding second input image and the fourth composite image The similarity of the region images in the same position of the first object. This implementation provides a possible realization of the second difference information and enriches the content of the second difference information.
第三方面,本申请提供一种训练对象位姿检测网络的方法,该方法具体可以包括:获取包括第一对象的多个第二输入图像,该多个第二输入图像包括第一对象的真实图像和/或第一对象的多个第三合成图像;多个第三合成图像是由第一对象的三维模型在多个第三位姿下渲染得到的;多个第三位姿不同;将多个第二输入图像分别输入第二网络进行位姿识别,得到第二网络输出的每个第二输入图像中第一对象的第四位姿;第二网络用于识别图像中第一对象的位姿;根据第一对象的三维模型,可微渲染获取第一对象在每个第四位姿下的第四合成图像;获取一个第四位姿的第二输入图像与一个第四位姿渲染得到的第四合成图像相对应;根据每个第四合成图像与其对应的第二输入图像的第二差异信息,构建第二损失函数;第二差异信息用于指示第四合成图像与其对应的第二输入图像的差异;根据第二损失函数更新第二网络,得到第一网络;第一网络识别的图像中第一对象的位姿与图像中第一对象的真实位姿的差异,小于第 二网络识别的图像中第一对象的位姿与图像中第一对象的真实位姿差异。In a third aspect, the present application provides a method for training an object pose detection network, the method may specifically include: acquiring a plurality of second input images including a first object, the plurality of second input images including a real image of the first object image and/or a plurality of third composite images of the first object; the plurality of third composite images are rendered by the three-dimensional model of the first object in a plurality of third poses; the plurality of third poses are different; the A plurality of second input images are respectively input to the second network for pose recognition, and the fourth pose of the first object in each second input image output by the second network is obtained; the second network is used to identify the pose of the first object in the image. pose; according to the three-dimensional model of the first object, microrendering obtains the fourth composite image of the first object under each fourth pose; obtains a second input image of the fourth pose and a fourth pose rendering The obtained fourth composite image corresponds to; according to the second difference information of each fourth composite image and its corresponding second input image, a second loss function is constructed; the second difference information is used to indicate the fourth composite image and its corresponding No. The difference between the two input images; the second network is updated according to the second loss function to obtain the first network; the difference between the pose of the first object in the image recognized by the first network and the real pose of the first object in the image is smaller than the second The pose of the first object in the image recognized by the network is different from the real pose of the first object in the image.
通过对位姿检测网络进行训练,提高了位姿检测网络识别的准确度,降低位姿检测网络的输出与图像中对象的真实位姿之间的差异。By training the pose detection network, the recognition accuracy of the pose detection network is improved, and the difference between the output of the pose detection network and the real pose of the object in the image is reduced.
需要说明的是,第三方面的具体实现,可以参照前述第一方面中描述的训练对象位姿检测网络的具体实现,可以达到相同的有益效果,此处不再赘述。It should be noted that, for the specific implementation of the third aspect, reference may be made to the specific implementation of the training object pose detection network described in the foregoing first aspect, which can achieve the same beneficial effects, which will not be repeated here.
第四方面,提供一种对象注册装置,该装置包括:第一获取单元、提取单元以及注册单元。其中:In a fourth aspect, an object registration device is provided, the device includes: a first acquisition unit, an extraction unit, and a registration unit. in:
第一获取单元,用于获取包括第一对象的多个第一输入图像,该多个第一输入图像包括第一对象的真实图像和/或第一对象的多个第一合成图像;多个第一合成图像是由第一对象的三维模型在多个第一位姿下可微渲染得到的;多个第一位姿不同。a first acquiring unit, configured to acquire a plurality of first input images including a first object, the plurality of first input images including a real image of the first object and/or a plurality of first composite images of the first object; a plurality of The first composite image is obtained by differentiable rendering of the three-dimensional model of the first object in multiple first poses; the multiple first poses are different.
提取单元,用于分别提取多个第一输入图像的特征信息,特征信息用于指示第一对象在其所处的第一输入图像中的特征。The extraction unit is configured to extract feature information of a plurality of first input images respectively, where the feature information is used to indicate features of the first object in the first input image where it is located.
注册单元,用于将提取单元在每个第一输入图像中提取的特征信息,与第一对象的标识相对应,进行第一对象的注册。The registration unit is used for registering the first object by corresponding the feature information extracted by the extraction unit in each first input image to the identifier of the first object.
通过本申请提供的对象注册装置,提取图像中对象的特征信息进行对象注册,注册时间短,且不影响其他已注册对象的识别性能,即使新增很多对象注册,也不会过大的增加注册时间,也保证了已注册对象的检测性能。With the object registration device provided in this application, the feature information of the object in the image is extracted to register the object, and the registration time is short, and the recognition performance of other registered objects is not affected. time, also guarantees the detection performance of registered objects.
在一种可能的实现方式中,特征信息包括局部特征点的描述子,以及全局特征的描述子。In a possible implementation manner, the feature information includes descriptors of local feature points and descriptors of global features.
在另一种可能的实现方式中,提取单元具体可以用于:将多个第一输入图像分别输入第一网络进行位姿识别,得到每个第一输入图像中第一对象的位姿;按照获取的第一对象的位姿,将第一对象的三维模型分别投影至每个第一输入图像,得到每个第一输入图像中的投影区域;分别在每个第一输入图像中的投影区域,提取特征信息。其中,第一网络用于识别图像中第一对象的位姿。通过图像中第一对象的位姿,确定图像中第一对象的区域,在该区域中提取特征信息,提高了特征信息提取的效率以及准确度。In another possible implementation manner, the extraction unit may be specifically configured to: input a plurality of first input images into the first network respectively for pose recognition, and obtain the pose of the first object in each first input image; Obtaining the pose of the first object, project the three-dimensional model of the first object to each first input image, respectively, to obtain the projection area in each first input image; respectively in the projection area in each first input image , extract feature information. Wherein, the first network is used to identify the pose of the first object in the image. Through the pose of the first object in the image, the region of the first object in the image is determined, and feature information is extracted in the region, which improves the efficiency and accuracy of feature information extraction.
在另一种可能的实现方式中,提取单元具体可以用于:将多个第一输入图像分别输入第一网络进行黑白图像提取,获取每个第一输入图像中第一对象的黑白图像;分别在每个第一输入图像中第一对象的黑白图像内,提取特征信息。其中,第一网络用于提取图像中第一对象的黑白图像。将图像中第一对象的黑白图像,确定为图像中第一对象的区域,在该区域中提取特征信息,提高了特征信息提取的效率以及准确度。In another possible implementation manner, the extraction unit may be specifically configured to: input a plurality of first input images into a first network respectively to extract black and white images, and obtain a black and white image of the first object in each of the first input images; Within the black and white image of the first object in each of the first input images, feature information is extracted. Wherein, the first network is used to extract the black and white image of the first object in the image. The black and white image of the first object in the image is determined as the region of the first object in the image, and feature information is extracted in the region, which improves the efficiency and accuracy of feature information extraction.
在另一种可能的实现方式中,该装置还可以包括:处理单元、可微渲染器、截图单元、构建单元以及更新单元。其中:In another possible implementation manner, the apparatus may further include: a processing unit, a differentiable renderer, a screenshot unit, a construction unit, and an update unit. in:
处理单元,用于将第一对象的N个真实图像分别输入第一网络,得到第一网络输出的每个真实图像中所述第一对象的第二位姿。N大于或等于1。第一网络用于识别图像中第一对象的位姿。The processing unit is configured to input the N real images of the first object into the first network respectively, and obtain the second pose of the first object in each real image output by the first network. N is greater than or equal to 1. The first network is used to identify the pose of the first object in the image.
可微渲染器,用于根据第一对象的三维模型,在每个第二位姿下,可微渲染得到N个第二合成图像;获取一个第二位姿的真实图像与一个第二位姿渲染得到的第二合成图像相对应。The differentiable renderer is used to obtain N second composite images by differentiable rendering under each second pose according to the three-dimensional model of the first object; obtain a real image of the second pose and a second pose The rendered second composite image corresponds to.
截取单元,用于分别截取每个真实图像中,与真实对象对应的第二合成图像中第一对象相同位置的区域,作为每个真实图像的前景图像。The intercepting unit is used for intercepting, in each real image, an area at the same position of the first object in the second composite image corresponding to the real object, as the foreground image of each real image.
构建单元,用于根据N个真实图像的前景图像与其对应的第二合成图像的第一差异信息,构建第一损失函数。其中,第一差异信息用于指示前景图像与其对应的第二合成图像的差异。The construction unit is configured to construct a first loss function according to the first difference information of the foreground images of the N real images and their corresponding second synthetic images. The first difference information is used to indicate the difference between the foreground image and its corresponding second composite image.
更新单元,用于根据第一损失函数更新可微渲染器,使得可微渲染器的输出的合成图像逼近对象的真实图像。The updating unit is configured to update the differentiable renderer according to the first loss function, so that the synthetic image output by the differentiable renderer approximates the real image of the object.
通过对可微渲染器进行优化,提高了可微渲染器的渲染真实性,降低可微渲染得到的合成图像与真实图像之间的差异。By optimizing the differentiable renderer, the rendering authenticity of the differentiable renderer is improved, and the difference between the synthetic image obtained by differentiable rendering and the real image is reduced.
在另一种可能的实现方式中,第一差异信息可以包括下述信息中一项或多项:特征图差、像素颜色差、提取的特征描述子之差。In another possible implementation manner, the first difference information may include one or more of the following information: feature map difference, pixel color difference, and difference between extracted feature descriptors.
在另一种可能的实现方式中,第一损失函数可以为N个真实图像的前景图像与其对应的第二合成图像的多个第一差异信息的计算值之和。In another possible implementation manner, the first loss function may be the sum of calculated values of multiple first difference information of the foreground images of the N real images and their corresponding second synthetic images.
在另一种可能的实现方式中,该装置还可以包括:第二获取单元、处理单元、可微渲染器、构建单元以及更新单元。其中:In another possible implementation manner, the apparatus may further include: a second acquisition unit, a processing unit, a differentiable renderer, a construction unit, and an update unit. in:
第二获取单元,用于获取包括第一对象的多个第二输入图像,多个第二输入图像包括第一对象的真实图像和/或第一对象的多个第三合成图像;多个第三合成图像是由第一对象的三维模型在多个第三位姿下渲染得到的。多个第三位姿不同。a second acquiring unit, configured to acquire a plurality of second input images including the first object, the plurality of second input images include a real image of the first object and/or a plurality of third composite images of the first object; The three composite images are rendered by the three-dimensional model of the first object in multiple third poses. The multiple third poses are different.
处理单元,用于将多个第二输入图像分别输入第二网络进行位姿识别,得到第二网络输出的每个第二输入图像中第一对象的第四位姿;第二网络用于识别图像中第一对象的位姿。The processing unit is used for inputting a plurality of second input images into the second network respectively for pose recognition, and obtaining the fourth pose of the first object in each second input image output by the second network; the second network is used for identifying The pose of the first object in the image.
可微渲染器,用于根据第一对象的三维模型,可微渲染获取第一对象在每个第四位姿下的第四合成图像。获取一个第四位姿的第二输入图像与一个第四位姿渲染得到的第四合成图像相对应。A differentiable renderer, configured to obtain a fourth composite image of the first object in each fourth pose by differentiable rendering according to the three-dimensional model of the first object. Obtaining a second input image in a fourth pose corresponds to a fourth composite image rendered in a fourth pose.
构建单元,用于根据每个第四合成图像与其对应的第二输入图像的第二差异信息,构建第二损失函数。第二差异信息用于指示第四合成图像与其对应的第二输入图像的差异。The construction unit is configured to construct a second loss function according to the second difference information of each fourth composite image and its corresponding second input image. The second difference information is used to indicate the difference between the fourth composite image and its corresponding second input image.
更新单元,用于根据第二损失函数更新第二网络,得到第一网络。第一网络识别的图像中第一对象的位姿与图像中第一对象的真实位姿的差异,小于第二网络识别的图像中第一对象的位姿与图像中第一对象的真实位姿差异。The updating unit is configured to update the second network according to the second loss function to obtain the first network. The difference between the pose of the first object in the image recognized by the first network and the real pose of the first object in the image is smaller than the pose of the first object in the image recognized by the second network and the real pose of the first object in the image difference.
通过对位姿检测网络进行训练,提高了位姿检测网络识别的准确度,降低位姿检测网络的输出与图像中对象的真实位姿之间的差异。By training the pose detection network, the recognition accuracy of the pose detection network is improved, and the difference between the output of the pose detection network and the real pose of the object in the image is reduced.
在另一种可能的实现方式中,第二损失函数Loss 2可以满足如下表达式:
Figure PCTCN2021140241-appb-000003
X大于或等于1,λ i为权值,L i用于表示第四合成图像与其对应的第二输入图像的第二差异信息的计算值。该实现方式提供一种具体的第二损失函数的表达,对位姿检测网络进行优化,提高了位姿检测网络识别的准确度。
In another possible implementation, the second loss function Loss 2 can satisfy the following expression:
Figure PCTCN2021140241-appb-000003
X is greater than or equal to 1, λ i is a weight value, and L i is used to represent a calculated value of the second difference information between the fourth composite image and its corresponding second input image. This implementation provides a specific expression of the second loss function, optimizes the pose detection network, and improves the recognition accuracy of the pose detection network.
在另一种可能的实现方式中,第二差异信息可以包括下述内容中一项或多项:第四合成图像中第一对象的黑白图像与其对应的第二输入图像的中第一对象的黑白图像 的IOU之差、获取第四合成图像的第四位姿与第四合成图像经过第一网络得到的位姿之差、第四合成图像与其对应的第二输入图像中与第四合成图像中的第一对象相同位置的区域图像的相似度。该实现方式提供了第二差异信息的可能实现,丰富了第二差异信息的内容。In another possible implementation manner, the second difference information may include one or more of the following contents: the black-and-white image of the first object in the fourth composite image and the black-and-white image of the first object in the corresponding second input image The difference between the IOUs of the black and white images, the difference between the fourth pose obtained from the fourth composite image and the pose obtained by the fourth composite image through the first network, the fourth composite image and the corresponding second input image and the fourth composite image The similarity of the region images in the same position of the first object. This implementation provides a possible realization of the second difference information and enriches the content of the second difference information.
需要说明的是,第四方面提供的对象注册装置,用于实现上述第一方面提供的对象注册方法,其具体实现可以参照前述第一方面的具体实现,此处不再赘述。It should be noted that the object registration apparatus provided in the fourth aspect is used to implement the object registration method provided in the first aspect, and the specific implementation can refer to the specific implementation of the foregoing first aspect, which will not be repeated here.
第五方面,提供一种显示装置,该装置包括第一获取单元、输出单元以及处理单元;其中:In a fifth aspect, a display device is provided, the device includes a first acquisition unit, an output unit and a processing unit; wherein:
第一获取单元,用于获取第一图像。The first acquisition unit is used to acquire the first image.
输出单元,用于若第一图像中包括一个或多个可识别对象,输出第一信息,第一信息用于提示检测到第一图像中包括可识别对象。若第一图像中未包括任一可识别对象,输出第二信息,第二信息用于提示未检测到可识别对象,调整视角以使得第一获取单元获取第二图像,第二图像与所述第一图像不同。The output unit is configured to output first information if the first image includes one or more identifiable objects, where the first information is used for prompting detection that the first image includes identifiable objects. If the first image does not include any recognizable object, output second information, the second information is used to prompt that no recognizable object is detected, adjust the viewing angle so that the first acquisition unit acquires the second image, the second image and the The first image is different.
处理单元,用于若第一图像中包括一个或多个可识别对象,通过第一图像包括的每个可识别对象对应的位姿检测网络,获取第一图像中每个可识别对象的位姿;按照每个可识别对象的位姿,显示每个可识别对象对应的虚拟内容。The processing unit is configured to obtain the pose of each recognizable object in the first image through the pose detection network corresponding to each recognizable object included in the first image if the first image includes one or more recognizable objects ; According to the pose of each recognizable object, display the virtual content corresponding to each recognizable object.
通过本申请提供的显示装置,将图像中是否包括可识别对象向用户输出提示,使得用户可以直观获取图像中是否包括可识别对象,提高了用户体验。With the display device provided in the present application, a prompt is output to the user whether the image includes a identifiable object, so that the user can intuitively obtain whether the image includes an identifiable object, which improves user experience.
一种可能的实现方式中,该装置还可以包括:提取单元、判断单元以及第一确定单元。其中:In a possible implementation manner, the apparatus may further include: an extraction unit, a judgment unit, and a first determination unit. in:
提取单元,用于提取第一图像中的特征信息,特征信息用于指示第一图像中可识别的特征。An extraction unit, configured to extract feature information in the first image, where the feature information is used to indicate recognizable features in the first image.
判断单元,用于判断特征库中是否存在与特征信息的匹配距离满足预设条件的特征信息。其中,特征库中存储了不同对象的一个或多个特征信息。The judgment unit is used for judging whether there is feature information whose matching distance with the feature information satisfies a preset condition in the feature library. Among them, the feature library stores one or more feature information of different objects.
第一确定单元,用于若特征库中存在与特征信息的匹配距离满足预设条件的特征信息,确定第一图像中包括一个或多个可识别对象;若特征库中不存在与特征信息的匹配距离满足预设条件的特征信息,确定第一图像未包括任一可识别对象。The first determination unit is configured to determine that one or more identifiable objects are included in the first image if there is feature information whose matching distance with the feature information meets a preset condition in the feature library; The feature information whose distance satisfies the preset condition is matched, and it is determined that the first image does not include any identifiable object.
通过图像的特征信息与特征库对比,可以简单快速的判断图像中是否包括可识别对象。By comparing the feature information of the image with the feature library, it can be easily and quickly judged whether the image contains identifiable objects.
另一种可能的实现方式中,预设条件可以包括小于或等于预设阈值。In another possible implementation manner, the preset condition may include less than or equal to a preset threshold.
另一种可能的实现方式中,该装置还可以包括:第二获取单元、第二确定单元以及第一确定单元。其中:In another possible implementation manner, the apparatus may further include: a second acquiring unit, a second determining unit, and a first determining unit. in:
第二获取单元,用于获取第一图像中的一个或多个第一局部特征点,第一局部特征点的描述子与特征库中的局部特征点的描述子的匹配距离小于或等于第一阈值;特征库中存储了不同对象的局部特征点的描述子。The second acquiring unit is configured to acquire one or more first local feature points in the first image, and the matching distance between the descriptors of the first local feature points and the descriptors of the local feature points in the feature library is less than or equal to the first Threshold; descriptors of local feature points of different objects are stored in the feature library.
第二确定单元,用于根据第一局部特征点,确定第一图像中一个或多个ROI;一个ROI中包括一个对象。The second determining unit is configured to determine one or more ROIs in the first image according to the first local feature points; one ROI includes one object.
相应的,提取单元还可以用于,提取每个ROI中的全局特征。Correspondingly, the extraction unit can also be used to extract global features in each ROI.
第一确定单元,用于若每个ROI中的全局特征中存在一个或多个第一全局特征, 确定第一图像中包括第一全局特征对应的可识别对象;若每个ROI中的全局特征中不存在第一全局特征,确定第一图像未包括任一可识别对象。其中,第一全局特征的描述子与特征库中的全局特征的描述子的匹配距离小于或等于第二阈值。特征库中还存储了不同对象的全局特征的描述子。a first determining unit, configured to determine that the first image includes an identifiable object corresponding to the first global feature if there are one or more first global features in the global features in each ROI; if the global features in each ROI If the first global feature does not exist in the first image, it is determined that the first image does not include any identifiable object. The matching distance between the descriptor of the first global feature and the descriptor of the global feature in the feature library is less than or equal to the second threshold. Descriptors of global features of different objects are also stored in the feature library.
先通过图像的局部特征点与特征库对比,确定ROI区域,再在ROI区域中提取全局特征与特征库对比,可以提高判断图像中是否包括可识别对象的效率,也提高了判断图像中是否包括可识别对象准确度。First, the ROI area is determined by comparing the local feature points of the image with the feature library, and then the global features are extracted in the ROI area and compared with the feature library, which can improve the efficiency of judging whether the image contains identifiable objects, and also improves the judgment of whether the image contains identifiable objects. Identifiable object accuracy.
另一种可能的实现方式中,该装置还可以包括:第三获取单元、提取单元以及注册单元。其中:In another possible implementation manner, the apparatus may further include: a third acquisition unit, an extraction unit, and a registration unit. in:
第三获取单元,用于获取包括第一对象的多个第一输入图像,多个第一输入图像包括第一对象的真实图像和/或第一对象的多个第一合成图像;多个第一合成图像是由第一对象的三维模型在多个第一位姿下可微渲染得到的;多个第一位姿不同。a third acquiring unit, configured to acquire a plurality of first input images including a first object, the plurality of first input images include a real image of the first object and/or a plurality of first composite images of the first object; a plurality of first input images A composite image is obtained by differentiable rendering of the three-dimensional model of the first object in multiple first poses; the multiple first poses are different.
提取单元,分别提取多个第一输入图像的特征信息,特征信息用于指示第一对象在其所处的第一输入图像中的特征。The extraction unit extracts feature information of a plurality of first input images respectively, where the feature information is used to indicate features of the first object in the first input image where it is located.
注册单元,用于将每个第一输入图像中提取的特征信息,与第一对象的标识相对应存储于所述特征库,进行第一对象的注册。The registration unit is configured to store the feature information extracted from each first input image corresponding to the identifier of the first object in the feature database, and register the first object.
通过提取图像中对象的特征信息进行对象注册,注册时间短,且不影响其他已注册对象的识别性能,即使新增很多对象注册,也不会过大的增加注册时间,也保证了已注册对象的检测性能。Object registration is performed by extracting the feature information of objects in the image. The registration time is short, and the recognition performance of other registered objects is not affected. Even if many new objects are registered, the registration time will not be increased too much, and the registered objects are also guaranteed. detection performance.
另一种可能的实现方式中,上述特征信息可以包括局部特征点的描述子,以及全局特征的描述子。In another possible implementation manner, the above feature information may include descriptors of local feature points and descriptors of global features.
另一种可能的实现方式中,该装置还可以包括:第四获取单元、处理单元、可微渲染器、构建单元以及更新单元。其中:In another possible implementation manner, the apparatus may further include: a fourth acquisition unit, a processing unit, a differentiable renderer, a construction unit, and an update unit. in:
第四获取单元,用于获取包括第一对象的多个第二输入图像,该多个第二输入图像包括第一对象的真实图像和/或第一对象的多个第三合成图像;多个第三合成图像是由第一对象的三维模型在多个第三位姿下渲染得到的;多个第三位姿不同。a fourth acquisition unit, configured to acquire a plurality of second input images including the first object, the plurality of second input images including a real image of the first object and/or a plurality of third composite images of the first object; a plurality of The third composite image is obtained by rendering the three-dimensional model of the first object in multiple third poses; the multiple third poses are different.
处理单元,用于将多个第二输入图像分别输入第二网络进行位姿识别,得到第二网络输出的每个第二输入图像中第一对象的第四位姿;第二网络用于识别图像中第一对象的位姿。The processing unit is used for inputting a plurality of second input images into the second network respectively for pose recognition, and obtaining the fourth pose of the first object in each second input image output by the second network; the second network is used for identifying The pose of the first object in the image.
可微渲染器,用于根据第一对象的三维模型,可微渲染获取第一对象在每个第四位姿下的第四合成图像;获取一个第四位姿的第二输入图像与一个第四位姿渲染得到的第四合成图像相对应。The differentiable renderer is used for obtaining a fourth composite image of the first object in each fourth pose according to the three-dimensional model of the first object, and obtaining a second input image of a fourth pose and a first The fourth composite image obtained by the four pose rendering corresponds to.
构建单元,用于根据每个第四合成图像与其对应的第二输入图像的第二差异信息,构建第二损失函数;第二差异信息用于指示第四合成图像与其对应的第二输入图像的差异。The construction unit is configured to construct a second loss function according to the second difference information of each fourth composite image and its corresponding second input image; the second difference information is used to indicate the difference between the fourth composite image and its corresponding second input image difference.
更新单元,用于根据第二损失函数更新第二网络,得到第一网络;第一网络识别的图像中第一对象的位姿与图像中第一对象的真实位姿的差异,小于第二网络识别的图像中第一对象的位姿与图像中第一对象的真实位姿差异。The updating unit is used to update the second network according to the second loss function to obtain the first network; the difference between the pose of the first object in the image recognized by the first network and the real pose of the first object in the image is smaller than that of the second network The recognized pose of the first object in the image is different from the real pose of the first object in the image.
在另一种可能的实现方式中,第二损失函数Loss 2可以满足如下表达式:
Figure PCTCN2021140241-appb-000004
X大于或等于1,λ i为权值,L i用于表示第四合成图像与其对应的第二输入图像的第二差异信息的计算值。该实现方式提供一种具体的第二损失函数的表达,对位姿检测网络进行训练,提高了位姿检测网络识别的准确度。
In another possible implementation, the second loss function Loss 2 can satisfy the following expression:
Figure PCTCN2021140241-appb-000004
X is greater than or equal to 1, λ i is a weight value, and L i is used to represent a calculated value of the second difference information between the fourth composite image and its corresponding second input image. This implementation provides a specific expression of the second loss function, trains the pose detection network, and improves the recognition accuracy of the pose detection network.
在另一种可能的实现方式中,第二差异信息可以包括下述内容中一项或多项:第四合成图像中第一对象的黑白图像与其对应的第二输入图像的中第一对象的黑白图像的IOU之差、获取第四合成图像的第四位姿与第四合成图像经过第一网络得到的位姿之差、第四合成图像与其对应的第二输入图像中与第四合成图像中的第一对象相同位置的区域图像的相似度。该实现方式提供了第二差异信息的可能实现,丰富了第二差异信息的内容。In another possible implementation manner, the second difference information may include one or more of the following contents: the black-and-white image of the first object in the fourth composite image and the black-and-white image of the first object in the corresponding second input image The difference between the IOUs of the black and white images, the difference between the fourth pose obtained from the fourth composite image and the pose obtained by the fourth composite image through the first network, the fourth composite image and the corresponding second input image and the fourth composite image The similarity of the region images in the same position of the first object. This implementation provides a possible realization of the second difference information and enriches the content of the second difference information.
需要说明的是,第五方面提供的显示装置,用于实现上述第二方面提供的显示方法,其具体实现可以参照前述第二方面的具体实现,此处不再赘述。It should be noted that the display device provided in the fifth aspect is used to implement the display method provided in the above-mentioned second aspect, and the specific implementation thereof can refer to the specific implementation of the foregoing second aspect, which will not be repeated here.
第六方面,提供一种训练对象位姿检测网络的装置,该装置可以包括:获取单元、处理单元、可微渲染器、构建单元以及更新单元。其中:In a sixth aspect, an apparatus for training an object pose detection network is provided, the apparatus may include: an acquisition unit, a processing unit, a differentiable renderer, a construction unit, and an update unit. in:
获取单元,用于获取包括第一对象的多个第二输入图像,该多个第二输入图像包括第一对象的真实图像和/或第一对象的多个第三合成图像;多个第三合成图像是由第一对象的三维模型在多个第三位姿下渲染得到的;多个第三位姿不同。an acquiring unit, configured to acquire multiple second input images including the first object, the multiple second input images including the real image of the first object and/or multiple third composite images of the first object; multiple third The composite image is rendered by the three-dimensional model of the first object in multiple third poses; the multiple third poses are different.
处理单元,用于将多个第二输入图像分别输入第二网络进行位姿识别,得到第二网络输出的每个第二输入图像中第一对象的第四位姿;第二网络用于识别图像中第一对象的位姿。The processing unit is used for inputting a plurality of second input images into the second network respectively for pose recognition, and obtaining the fourth pose of the first object in each second input image output by the second network; the second network is used for identifying The pose of the first object in the image.
可微渲染器,用于根据第一对象的三维模型,可微渲染获取第一对象在每个第四位姿下的第四合成图像;获取一个第四位姿的第二输入图像与一个第四位姿渲染得到的第四合成图像相对应。The differentiable renderer is used for obtaining a fourth composite image of the first object in each fourth pose according to the three-dimensional model of the first object, and obtaining a second input image of a fourth pose and a first The fourth composite image obtained by the four pose rendering corresponds to.
构建单元,用于根据每个第四合成图像与其对应的第二输入图像的第二差异信息,构建第二损失函数;第二差异信息用于指示第四合成图像与其对应的第二输入图像的差异。The construction unit is configured to construct a second loss function according to the second difference information of each fourth composite image and its corresponding second input image; the second difference information is used to indicate the difference between the fourth composite image and its corresponding second input image difference.
更新单元,用于根据第二损失函数更新第二网络,得到第一网络;第一网络识别的图像中第一对象的位姿与图像中第一对象的真实位姿的差异,小于第二网络识别的图像中第一对象的位姿与图像中第一对象的真实位姿差异。The updating unit is used to update the second network according to the second loss function to obtain the first network; the difference between the pose of the first object in the image recognized by the first network and the real pose of the first object in the image is smaller than that of the second network The recognized pose of the first object in the image is different from the real pose of the first object in the image.
通过对位姿检测网络进行训练,提高了位姿检测网络识别的准确度,降低位姿检测网络的输出与图像中对象的真实位姿之间的差异。By training the pose detection network, the recognition accuracy of the pose detection network is improved, and the difference between the output of the pose detection network and the real pose of the object in the image is reduced.
需要说明的是,第六方面提供的优化位姿识别网络的装置,用于实现上述第三方面提供的优化位姿识别网络的方法,其具体实现可以参照前述第三方面的具体实现,此处不再赘述。It should be noted that the device for optimizing the pose recognition network provided in the sixth aspect is used to implement the method for optimizing the pose recognition network provided in the third aspect. The specific implementation can refer to the specific implementation of the third aspect. Here No longer.
第七方面,本申请提供了一种电子设备,该电子设备可以实现上述第一方面或第二方面或第三方面描述的方法示例中的功能,所述功能可以通过硬件实现,也可以通过硬件执行相应的软件实现。所述硬件或软件包括一个或多个上述功能相应的模块。该电子设备可以以芯片的产品形态存在。In a seventh aspect, the present application provides an electronic device, which can implement the functions in the method examples described in the first aspect or the second aspect or the third aspect, and the functions can be implemented by hardware or by hardware. Execute the corresponding software implementation. The hardware or software includes one or more modules corresponding to the above functions. The electronic device may exist in the form of a chip product.
在一种可能的实现方式中,该电子设备可以包括处理器和传输接口。其中,传输接口用于接收和发送数据。处理器被配置为调用存储在存储器中的程序指令,以使得 该电子设备执行上述第一方面或第二方面或第三方面描述的方法示例中的功能。In one possible implementation, the electronic device may include a processor and a transmission interface. Among them, the transmission interface is used to receive and send data. The processor is configured to invoke program instructions stored in the memory to cause the electronic device to perform the functions in the method examples described in the first aspect or the second aspect or the third aspect above.
第八方面,提供一种计算机可读存储介质,包括指令,当其在计算机上运行时,使得计算机执行上述任一方面或任意一种可能的实现方式所述的对象注册方法,或显示方法,或训练对象位姿检测网络的方法。In an eighth aspect, a computer-readable storage medium is provided, comprising instructions that, when run on a computer, cause the computer to execute the object registration method or display method described in any one of the above aspects or any possible implementation manner, Or a method for training a network for object pose detection.
第九方面,提供一种计算机程序产品,当其在计算机上运行时,使得计算机执行上述任一方面或任意一种可能的实现方式所述的对象注册方法,或显示方法,或训练对象位姿检测网络的方法。A ninth aspect provides a computer program product that, when running on a computer, enables the computer to execute the object registration method described in any of the above aspects or any possible implementation manner, or the display method, or the training object pose A method to detect the network.
第十方面,提供一种芯片系统,该芯片系统包括处理器,还可以包括存储器,用于实现上述方法中的功能。该芯片系统可以由芯片构成,也可以包含芯片和其他分立器件。In a tenth aspect, a chip system is provided, the chip system includes a processor, and may also include a memory, for implementing the functions in the above method. The chip system can be composed of chips, and can also include chips and other discrete devices.
上述第四方面至第十方面提供的方案,用于实现上述第一方面或第二方面或第三方面提供的方法,因此可以与第一方面或第二方面或第三方面达到相同的有益效果,此处不再进行赘述。The solutions provided in the fourth aspect to the tenth aspect are used to realize the method provided in the first aspect or the second aspect or the third aspect, so the same beneficial effects as the first aspect or the second aspect or the third aspect can be achieved. , and will not be repeated here.
需要说明的是,上述各个方面中的任意一个方面的各种可能的实现方式,在方案不矛盾的前提下,均可以进行组合。It should be noted that, various possible implementation manners of any one of the above aspects can be combined on the premise that the solutions are not contradictory.
附图说明Description of drawings
图1为本申请实施例提供的一种终端设备的结构示意图;FIG. 1 is a schematic structural diagram of a terminal device according to an embodiment of the present application;
图2为本申请实施例提供的一种终端设备的软件结构示意图;FIG. 2 is a schematic diagram of a software structure of a terminal device according to an embodiment of the present application;
图3为本申请实施例提供的一种讲解信息与真实物体的虚实融合效果示意图;3 is a schematic diagram of a virtual-real fusion effect of explanation information and a real object provided by an embodiment of the present application;
图4为本申请实施例提供的一种位姿检测流程示意图;FIG. 4 is a schematic flowchart of a pose detection process provided by an embodiment of the present application;
图5a为本申请实施例提供的一种增量式学习的方法流程示意图;5a is a schematic flowchart of a method for incremental learning provided by an embodiment of the present application;
图5b为本申请实施例提供的一种系统架构示意图;FIG. 5b is a schematic diagram of a system architecture provided by an embodiment of the present application;
图6为本申请实施例提供的一种卷积神经网络的结构示意图;6 is a schematic structural diagram of a convolutional neural network provided by an embodiment of the present application;
图7a为本申请实施例提供的一种芯片硬件结构的示意图;7a is a schematic diagram of a chip hardware structure provided by an embodiment of the application;
图7b为本申请实施例提供的方案的系统总体框架示意图;FIG. 7b is a schematic diagram of the overall system framework of the solution provided by the embodiment of the present application;
图8为本申请实施例一提供的一种对象注册方法的流程示意图;8 is a schematic flowchart of an object registration method provided in Embodiment 1 of the present application;
图9为本申请实施例提供的球面示意图;9 is a schematic diagram of a spherical surface provided by an embodiment of the present application;
图10为本申请实施例一提供的另一种对象注册方法的流程示意图;10 is a schematic flowchart of another object registration method provided in Embodiment 1 of the present application;
图11为本申请实施例提供的一种图像区域示意图;11 is a schematic diagram of an image area provided by an embodiment of the present application;
图12为本申请实施例二提供的一种优化方法的流程示意图;12 is a schematic flowchart of an optimization method provided in Embodiment 2 of the present application;
图13为本申请实施例二提供的另一种优化方法的流程示意图;13 is a schematic flowchart of another optimization method provided in Embodiment 2 of the present application;
图14为本申请实施例提供的再一种对象注册方法的流程示意图;14 is a schematic flowchart of still another object registration method provided by an embodiment of the present application;
图15为本申请实施例三提供的一种训练对象位姿检测网络的方法的流程示意图;15 is a schematic flowchart of a method for training an object pose detection network according to Embodiment 3 of the present application;
图16为本申请实施例三提供的另一种训练对象位姿检测网络的方法的流程示意图;16 is a schematic flowchart of another method for training an object pose detection network provided in Embodiment 3 of the present application;
图17为本申请实施例四提供的一种显示方法的流程示意图;17 is a schematic flowchart of a display method provided in Embodiment 4 of the present application;
图18为本申请实施例四提供的一种判断第一图像中是否包含可识别对象的流程示意图;18 is a schematic flowchart of determining whether a first image contains an identifiable object according to Embodiment 4 of the present application;
图19为本申请实施例提供的一种手机界面示意图;19 is a schematic diagram of a mobile phone interface provided by an embodiment of the application;
图20a为本申请实施例提供的另一种手机界面示意图;FIG. 20a is a schematic diagram of another mobile phone interface provided by an embodiment of the present application;
图20b为本申请实施例提供的再一种手机界面示意图;20b is a schematic diagram of still another mobile phone interface provided by the embodiment of the application;
图21为本申请实施例提供的一种对象注册装置的结构示意图;FIG. 21 is a schematic structural diagram of an object registration apparatus provided by an embodiment of the present application;
图22为本申请实施例提供的一种优化装置的结构示意图;22 is a schematic structural diagram of an optimization device provided by an embodiment of the application;
图23为本申请实施例提供的一种训练对象位姿检测网络的装置的结构示意图;23 is a schematic structural diagram of an apparatus for training an object pose detection network provided by an embodiment of the application;
图24为本申请实施例提供的一种显示装置的结构示意图;FIG. 24 is a schematic structural diagram of a display device according to an embodiment of the present application;
图25为本申请实施例提供的装置的结构示意图。FIG. 25 is a schematic structural diagram of an apparatus provided by an embodiment of the present application.
具体实施方式Detailed ways
在本申请实施例中,为了便于清楚描述本申请实施例的技术方案,采用了“第一”、“第二”等字样对功能和作用基本相同的相同项或相似项进行区分。本领域技术人员可以理解“第一”、“第二”等字样并不对数量和执行次序进行限定,并且“第一”、“第二”等字样也并不限定一定不同。该“第一”、第二”描述的技术特征间无先后顺序或者大小顺序。In the embodiments of the present application, in order to clearly describe the technical solutions of the embodiments of the present application, words such as "first" and "second" are used to distinguish the same or similar items with basically the same functions and functions. Those skilled in the art can understand that the words "first", "second" and the like do not limit the quantity and execution order, and the words "first", "second" and the like are not necessarily different. The technical features described in the "first" and second" have no sequence or order of magnitude.
在本申请实施例中,“示例性的”或者“例如”等词用于表示作例子、例证或说明。本申请实施例中被描述为“示例性的”或者“例如”的任何实施例或设计方案不应被解释为比其它实施例或设计方案更优选或更具优势。确切而言,使用“示例性的”或者“例如”等词旨在以具体方式呈现相关概念,便于理解。In the embodiments of the present application, words such as "exemplary" or "for example" are used to represent examples, illustrations or illustrations. Any embodiments or designs described in the embodiments of the present application as "exemplary" or "such as" should not be construed as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "exemplary" or "such as" is intended to present the related concepts in a specific manner to facilitate understanding.
在本申请实施例中,至少一个还可以描述为一个或多个,多个可以是两个、三个、四个或者更多个,本申请不做限制。In the embodiments of the present application, at least one may also be described as one or more, and the multiple may be two, three, four or more, which is not limited in this application.
此外,本申请实施例描述的网络架构以及场景是为了更加清楚的说明本申请实施例的技术方案,并不构成对于本申请实施例提供的技术方案的限定,本领域普通技术人员可知,随着网络架构的演变和新业务场景的出现,本申请实施例提供的技术方案对于类似的技术问题,同样适用。In addition, the network architecture and scenarios described in the embodiments of the present application are for the purpose of illustrating the technical solutions of the embodiments of the present application more clearly, and do not constitute a limitation on the technical solutions provided by the embodiments of the present application. The evolution of the network architecture and the emergence of new service scenarios, the technical solutions provided in the embodiments of the present application are also applicable to similar technical problems.
在描述本申请的实施例之前,此处先对本申请涉及的名词统一进行解释说明,后续不再一一进行说明。Before describing the embodiments of the present application, the terms involved in the present application will be uniformly explained and explained here, and will not be explained one by one later.
图像,也可称为图片,是指具有视觉效果的画面。本申请所称图像,可以为静态图像,也可以为视频流中的视频帧,或者其他,不予限定。An image, also known as a picture, is a picture with visual effects. The image referred to in this application may be a still image, a video frame in a video stream, or other, which is not limited.
对象,可以存在的人或者事物。例如,事物可以为建筑物、商品、植物、动物等等,此处不再一一列举。Object, a person or thing that can exist. For example, things can be buildings, commodities, plants, animals, etc., which are not listed here.
位姿,是指相机坐标系下物体的姿态。位姿可以包括6DoF位姿,即物体相对于相机的平移姿态和旋转姿态。The pose refers to the pose of the object in the camera coordinate system. The poses can include 6DoF poses, that is, the translation pose and rotation pose of the object relative to the camera.
位姿检测,是指检测识别图像中物体的位姿。Pose detection refers to detecting and recognizing the pose of an object in an image.
对象的真实图像,是指包含该对象的静态的具有视觉效果的画面以及背景区域画图的图像。对象的真实图像可以为红绿蓝(red、green、blue,RGB)格式或者RGBD(red、green、blue,depth)格式。The real image of an object refers to a static image with visual effects and a drawing image of the background area containing the object. The real image of the object can be in red, green, blue, RGB format or RGBD (red, green, blue, depth) format.
渲染,是通过渲染器,将对象的3D模型,转变为2D图像的过程。通常,场景和实体用三维形式表示,可以更接近于现实世界,便于操纵和变换,而图形的显示设备大多是二维的光栅化显示器和点阵化打印机。光栅显示器可以看作是一个像素矩阵,在光栅显示器上显示的任何一个图形,实际上都是一些具有一种或多种颜色和灰度像 素的集合。将三维实体场景通过光栅和点阵化的表示就是图像渲染--即光栅化。Rendering is the process of converting a 3D model of an object into a 2D image through a renderer. Usually, scenes and entities are represented in three-dimensional form, which can be closer to the real world and facilitate manipulation and transformation, while graphics display devices are mostly two-dimensional rasterized displays and dot-matrix printers. A raster display can be regarded as a pixel matrix, and any graphic displayed on a raster display is actually a collection of pixels with one or more colors and grayscales. The representation of a three-dimensional solid scene through rasterization and lattice is image rendering -- that is, rasterization.
常规渲染,是指光栅化不可微分的一种渲染方式。Conventional rendering refers to a rendering method in which rasterization is not differentiable.
可微渲染,是指光栅化可微分的一种渲染方式。由于渲染过程可微分,则可以根据渲染得到的图像与真实图像的差异构建损失函数,更新可微渲染的参数,提高可微渲染结果的真实性。Differentiable rendering refers to a rendering method in which rasterization is differentiable. Since the rendering process is differentiable, a loss function can be constructed according to the difference between the rendered image and the real image, the parameters of the differentiable rendering can be updated, and the authenticity of the differentiable rendering results can be improved.
对象的合成图像,是指在某一期望位姿下对该对象的三维模型渲染得到的仅包含该对象的图像。对象在某一位姿下的合成图像,等效于该对象在该位姿下拍照得到的图像。The composite image of an object refers to an image containing only the object obtained by rendering the 3D model of the object in a desired pose. The composite image of an object in a certain pose is equivalent to the image obtained by taking pictures of the object in this pose.
合成图像对应的图像,是指渲染该合成图像的位姿,是由该对应的图像输入神经网络获取的。应理解,合成图像与渲染该合成图像的位姿的源图像相互对应,后文不再一一赘述。其中,合成图像对应的图像可以为对象的真实图像,也可以为对象的其他合成图像。The image corresponding to the composite image refers to the pose of rendering the composite image, which is obtained by inputting the corresponding image to the neural network. It should be understood that the composite image and the source image for rendering the pose of the composite image correspond to each other, which will not be described in detail below. The image corresponding to the composite image may be a real image of the object, or may be other composite images of the object.
对象的黑白图像,是指仅包含该对象不包含背景的黑白像素图像。具体的,对象的黑白图像可以通过二值化图表示,包含对象的区域像素值为1,其他区域像素值为0。A black-and-white image of an object is a black-and-white pixel image that contains only the object and no background. Specifically, the black-and-white image of the object can be represented by a binarized image, the pixel value of the region including the object is 1, and the pixel value of other regions is 0.
局部特征点,是图像特征的局部表达,它反映了图像上具有的局部特性。局部特征点即图像上与其他像素具有明显区分度的点,包括但不限于角点、关键点等。在图像处理中,局部特征点主要指尺度不变性的点或块。尺度不变性,是同一个物体或场景,从不同的角度采集多幅图片,相同的地方能够被识别出来是相同的。局部特征点可以包括SIFT特征点、SURF特征点、DAISY特征点等。通常,可以采用FAST、DOG等方法提取图像的局部特征点。局部特征点的描述子是表征该特征点局部图像信息的高维向量。Local feature points are local expressions of image features, which reflect the local characteristics of the image. Local feature points are points on the image that are clearly distinguishable from other pixels, including but not limited to corner points, key points, and the like. In image processing, local feature points mainly refer to scale-invariant points or blocks. Scale invariance means that the same object or scene is collected from different angles, and the same place can be identified as the same. The local feature points may include SIFT feature points, SURF feature points, DAISY feature points, and the like. Usually, the local feature points of the image can be extracted by methods such as FAST and DOG. The descriptor of a local feature point is a high-dimensional vector representing the local image information of the feature point.
全局特征是指是指能表示整幅图像上的特征,全局特征是相对于图像局部特征而言的,用于描述图像或目标的颜色和形状等整体特征。例如,全局特征可以包括颜色特征、纹理特征、形状特征等。通常,可以采用词袋树的方法提取图像的全局特征。全局特征的描述子是表征整副图像或者较大区域的图像信息的高维向量。The global feature refers to the feature that can represent the entire image. The global feature is relative to the local features of the image and is used to describe the overall features such as the color and shape of the image or target. For example, global features may include color features, texture features, shape features, and the like. Usually, the bag-of-words method can be used to extract the global features of the image. The descriptor of the global feature is a high-dimensional vector that represents the image information of the entire image or a larger area.
计算值,可以指多个数据的数学计算值,该数学计算可以为取平均、或者取最大、或者取最小,或者其他。The calculated value may refer to a mathematical calculation value of multiple data, and the mathematical calculation may be an average, a maximum, or a minimum, or others.
为了下述各实施例的描述清楚简洁,首先给出相关技术的简要介绍:In order to describe the following embodiments clearly and concisely, a brief introduction of the related technology is given first:
近年来,终端设备的功能越来越丰富,给用户带来了更好的使用体验。例如,终端设备可以实现虚拟现实(virtual reality,VR)功能,使用户处于虚拟世界中,体验虚拟世界。又如,终端设备可以实现增强现实(augmented reality,AR)功能,将虚拟物体与现实场景结合,并实现用户与虚拟物体互动。In recent years, the functions of terminal devices have become more and more abundant, which brings better user experience to users. For example, a terminal device can implement a virtual reality (virtual reality, VR) function, so that the user is in the virtual world and experiences the virtual world. For another example, a terminal device can implement an augmented reality (AR) function, combine virtual objects with real scenes, and enable users to interact with virtual objects.
其中,终端设备可以是智能手机、平板电脑、可穿戴设备、AR/VR设备等。本申请对终端设备的具体形态不予限定。可穿戴设备也可以称为穿戴式智能设备,是应用穿戴式技术对日常穿戴进行智能化设计、开发出可以穿戴的设备的总称,如眼镜、手套、手表、服饰及鞋等。可穿戴设备即直接穿在身上,或是整合到用户的衣服或配件的一种便携式设备。可穿戴设备不仅仅是一种硬件设备,更是通过软件支持以及数据交互、云端交互来实现强大的功能。广义穿戴式智能设备包括功能全、尺寸大、可不依赖智能手机实现完整或者部分的功能,例如:智能手表或智能眼镜等,以及只专注 于某一类应用功能,需要和其它设备如智能手机配合使用,如各类进行体征监测的智能手环、智能首饰等。The terminal device may be a smartphone, a tablet computer, a wearable device, an AR/VR device, or the like. This application does not limit the specific form of the terminal device. Wearable devices can also be called wearable smart devices, which are the general term for the intelligent design of daily wear and the development of wearable devices using wearable technology, such as glasses, gloves, watches, clothing and shoes. A wearable device is a portable device that is worn directly on the body or integrated into the user's clothing or accessories. Wearable device is not only a hardware device, but also realizes powerful functions through software support, data interaction, and cloud interaction. In a broad sense, wearable smart devices include full-featured, large-scale, complete or partial functions without relying on smart phones, such as smart watches or smart glasses, and only focus on a certain type of application function, which needs to cooperate with other devices such as smart phones. Use, such as all kinds of smart bracelets, smart jewelry, etc. for physical sign monitoring.
在本申请中,终端设备的结构可以如图1所示。如图1所示,终端设备100可以包括处理器110,外部存储器接口120,内部存储器121,通用串行总线(universal serial bus,USB)接口130,充电管理模块140,电源管理模块141,电池142,天线1,天线2,移动通信模块150,无线通信模块160,音频模块170,扬声器170A,受话器170B,麦克风170C,耳机接口170D,传感器模块180,按键190,马达191,指示器192,摄像头193,显示屏194,以及用户标识模块(subscriber identification module,SIM)卡接口195等。其中传感器模块180可以包括压力传感器180A,陀螺仪传感器180B,气压传感器180C,磁传感器180D,加速度传感器180E,距离传感器180F,接近光传感器180G,指纹传感器180H,温度传感器180J,触摸传感器180K,环境光传感器180L,骨传导传感器180M等。In this application, the structure of the terminal device may be as shown in FIG. 1 . As shown in FIG. 1 , the terminal device 100 may include a processor 110 , an external memory interface 120 , an internal memory 121 , a universal serial bus (USB) interface 130 , a charging management module 140 , a power management module 141 , and a battery 142 , Antenna 1, Antenna 2, Mobile Communication Module 150, Wireless Communication Module 160, Audio Module 170, Speaker 170A, Receiver 170B, Microphone 170C, Headphone Interface 170D, Sensor Module 180, Key 190, Motor 191, Indicator 192, Camera 193 , a display screen 194, and a subscriber identification module (subscriber identification module, SIM) card interface 195 and the like. The sensor module 180 may include a pressure sensor 180A, a gyroscope sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity light sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, and ambient light. Sensor 180L, bone conduction sensor 180M, etc.
可以理解的是,本实施例示意的结构并不构成对终端设备100的具体限定。在另一些实施例中,终端设备100可以包括比图示更多或更少的部件,或者组合某些部件,或者拆分某些部件,或者不同的部件布置。图示的部件可以以硬件,软件或软件和硬件的组合实现。It can be understood that the structure illustrated in this embodiment does not constitute a specific limitation on the terminal device 100 . In other embodiments, the terminal device 100 may include more or less components than shown, or combine some components, or separate some components, or arrange different components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.
处理器110可以包括一个或多个处理单元,例如:处理器110可以包括应用处理器(application processor,AP),调制解调处理器,图形处理器(graphics processing unit,GPU),图像信号处理器(image signal processor,ISP),控制器,视频编解码器,数字信号处理器(digital signal processor,DSP),基带处理器,和/或神经网络处理器(neural-network processing unit,NPU)等。其中,不同的处理单元可以是独立的器件,也可以集成在一个或多个处理器中。例如,在本申请中,处理器110可以确定第一图像满足异常条件的情况下,控制开启其他摄像头。The processor 110 may include one or more processing units, for example, the processor 110 may include an application processor (application processor, AP), a modem processor, a graphics processor (graphics processing unit, GPU), an image signal processor (image signal processor, ISP), controller, video codec, digital signal processor (digital signal processor, DSP), baseband processor, and/or neural-network processing unit (neural-network processing unit, NPU), etc. Wherein, different processing units may be independent devices, or may be integrated in one or more processors. For example, in the present application, the processor 110 may control to turn on other cameras when it is determined that the first image satisfies the abnormal condition.
其中,控制器可以是终端设备100的神经中枢和指挥中心。控制器可以根据指令操作码和时序信号,产生操作控制信号,完成取指令和执行指令的控制。The controller may be the nerve center and command center of the terminal device 100 . The controller can generate an operation control signal according to the instruction operation code and timing signal, and complete the control of fetching and executing instructions.
处理器110中还可以设置存储器,用于存储指令和数据。在一些实施例中,处理器110中的存储器为高速缓冲存储器。该存储器可以保存处理器110刚用过或循环使用的指令或数据。如果处理器110需要再次使用该指令或数据,可从所述存储器中直接调用。避免了重复存取,减少了处理器110的等待时间,因而提高了系统的效率。A memory may also be provided in the processor 110 for storing instructions and data. In some embodiments, the memory in processor 110 is cache memory. This memory may hold instructions or data that have just been used or recycled by the processor 110 . If the processor 110 needs to use the instruction or data again, it can be called directly from the memory. Repeated accesses are avoided and the latency of the processor 110 is reduced, thereby increasing the efficiency of the system.
在一些实施例中,处理器110可以包括一个或多个接口。接口可以包括集成电路(inter-integrated circuit,I2C)接口,集成电路内置音频(inter-integrated circuit sound,I2S)接口,脉冲编码调制(pulse code modulation,PCM)接口,通用异步收发传输器(universal asynchronous receiver/transmitter,UART)接口,移动产业处理器接口(mobile industry processor interface,MIPI),通用输入输出(general-purpose input/output,GPIO)接口,用户标识模块(subscriber identity module,SIM)接口,和/或通用串行总线(universal serial bus,USB)接口等。In some embodiments, the processor 110 may include one or more interfaces. The interface may include an integrated circuit (inter-integrated circuit, I2C) interface, an integrated circuit built-in audio (inter-integrated circuit sound, I2S) interface, a pulse code modulation (pulse code modulation, PCM) interface, a universal asynchronous transceiver (universal asynchronous transmitter) receiver/transmitter, UART) interface, mobile industry processor interface (MIPI), general-purpose input/output (GPIO) interface, subscriber identity module (SIM) interface, and / or universal serial bus (universal serial bus, USB) interface, etc.
MIPI接口可以被用于连接处理器110与显示屏194,摄像头193等外围器件。MIPI接口包括摄像头串行接口(camera serial interface,CSI),显示屏串行接口(display serial interface,DSI)等。在一些实施例中,处理器110和摄像头193通过CSI接口通信,实 现终端设备100的拍摄功能。处理器110和显示屏194通过DSI接口通信,实现终端设备100的显示功能。The MIPI interface can be used to connect the processor 110 with peripheral devices such as the display screen 194 and the camera 193 . MIPI interfaces include camera serial interface (CSI), display serial interface (DSI), etc. In some embodiments, the processor 110 communicates with the camera 193 through a CSI interface to implement the shooting function of the terminal device 100. The processor 110 communicates with the display screen 194 through the DSI interface to implement the display function of the terminal device 100 .
GPIO接口可以通过软件配置。GPIO接口可以被配置为控制信号,也可被配置为数据信号。在一些实施例中,GPIO接口可以用于连接处理器110与摄像头193,显示屏194,无线通信模块160,音频模块170,传感器模块180等。GPIO接口还可以被配置为I2C接口,I2S接口,UART接口,MIPI接口等。The GPIO interface can be configured by software. The GPIO interface can be configured as a control signal or as a data signal. In some embodiments, the GPIO interface may be used to connect the processor 110 with the camera 193, the display screen 194, the wireless communication module 160, the audio module 170, the sensor module 180, and the like. The GPIO interface can also be configured as I2C interface, I2S interface, UART interface, MIPI interface, etc.
USB接口130是符合USB标准规范的接口,具体可以是Mini USB接口,Micro USB接口,USB Type C接口等。USB接口130可以用于连接充电器为终端设备100充电,也可以用于终端设备100与外围设备之间传输数据。也可以用于连接耳机,通过耳机播放音频。该接口还可以用于连接其他终端设备,例如AR设备等。The USB interface 130 is an interface that conforms to the USB standard specification, and may specifically be a Mini USB interface, a Micro USB interface, a USB Type C interface, and the like. The USB interface 130 can be used to connect a charger to charge the terminal device 100, and can also be used to transmit data between the terminal device 100 and peripheral devices. It can also be used to connect headphones to play audio through the headphones. This interface can also be used to connect other terminal devices, such as AR devices.
可以理解的是,本实施例示意的各模块间的接口连接关系,只是示意性说明,并不构成对终端设备100的结构限定。在本申请另一些实施例中,终端设备100也可以采用上述实施例中不同的接口连接方式,或多种接口连接方式的组合。It can be understood that the interface connection relationship between the modules illustrated in this embodiment is only a schematic illustration, and does not constitute a structural limitation of the terminal device 100 . In other embodiments of the present application, the terminal device 100 may also adopt different interface connection manners in the foregoing embodiments, or a combination of multiple interface connection manners.
电源管理模块141用于连接电池142,充电管理模块140与处理器110。电源管理模块141接收电池142和/或充电管理模块140的输入,为处理器110,内部存储器121,显示屏194,摄像头193,和无线通信模块160等供电。电源管理模块141还可以用于监测电池容量,电池循环次数,电池健康状态(漏电,阻抗)等参数。在其他一些实施例中,电源管理模块141也可以设置于处理器110中。在另一些实施例中,电源管理模块141和充电管理模块140也可以设置于同一个器件中。The power management module 141 is used for connecting the battery 142 , the charging management module 140 and the processor 110 . The power management module 141 receives input from the battery 142 and/or the charging management module 140, and supplies power to the processor 110, the internal memory 121, the display screen 194, the camera 193, and the wireless communication module 160. The power management module 141 can also be used to monitor parameters such as battery capacity, battery cycle times, battery health status (leakage, impedance). In some other embodiments, the power management module 141 may also be provided in the processor 110 . In other embodiments, the power management module 141 and the charging management module 140 may also be provided in the same device.
终端设备100的无线通信功能可以通过天线1,天线2,移动通信模块150,无线通信模块160,调制解调处理器以及基带处理器等实现。The wireless communication function of the terminal device 100 may be implemented by the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, the modulation and demodulation processor, the baseband processor, and the like.
终端设备100通过GPU,显示屏194,以及应用处理器等实现显示功能。GPU为图像处理的微处理器,连接显示屏194和应用处理器。GPU用于执行数学和几何计算,用于图形渲染。处理器110可包括一个或多个GPU,其执行程序指令以生成或改变显示信息。The terminal device 100 implements a display function through a GPU, a display screen 194, an application processor, and the like. The GPU is a microprocessor for image processing, and is connected to the display screen 194 and the application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. Processor 110 may include one or more GPUs that execute program instructions to generate or alter display information.
显示屏194用于显示图像,视频等。显示屏194包括显示面板。显示面板可以采用液晶显示屏(liquid crystal display,LCD),有机发光二极管(organic light-emitting diode,OLED),有源矩阵有机发光二极体或主动矩阵有机发光二极体(active-matrix organic light emitting diode的,AMOLED),柔性发光二极管(flex light-emitting diode,FLED),Miniled,MicroLed,Micro-oled,量子点发光二极管(quantum dot light emitting diodes,QLED)等。在一些实施例中,终端设备100可以包括1个或N个显示屏194,N为大于1的正整数。Display screen 194 is used to display images, videos, and the like. Display screen 194 includes a display panel. The display panel can be a liquid crystal display (LCD), an organic light-emitting diode (OLED), an active-matrix organic light-emitting diode or an active-matrix organic light-emitting diode (active-matrix organic light). emitting diode, AMOLED), flexible light-emitting diode (flex light-emitting diode, FLED), Miniled, MicroLed, Micro-oled, quantum dot light-emitting diode (quantum dot light emitting diodes, QLED) and so on. In some embodiments, the terminal device 100 may include one or N display screens 194 , where N is a positive integer greater than one.
终端设备100的显示屏194上可以显示一系列图形用户界面(graphical user interface,GUI),这些GUI都是该终端设备100的主屏幕。一般来说,终端设备100的显示屏194的尺寸是固定的,只能在该终端设备100的显示屏194中显示有限的控件。控件是一种GUI元素,它是一种软件组件,包含在应用程序中,控制着该应用程序处理的所有数据以及关于这些数据的交互操作,用户可以通过直接操作(direct manipulation)来与控件交互,从而对应用程序的有关信息进行读取或者编辑。一般而言,控件可以包括图标、按钮、菜单、选项卡、文本框、对话框、状态栏、导航栏、 Widget等可视的界面元素。A series of graphical user interfaces (graphical user interfaces, GUIs) may be displayed on the display screen 194 of the terminal device 100 , and these GUIs are the main screens of the terminal device 100 . Generally speaking, the size of the display screen 194 of the terminal device 100 is fixed, and only limited controls can be displayed in the display screen 194 of the terminal device 100 . A control is a GUI element, which is a software component that is included in an application and controls all the data processed by the application and the interaction with this data. The user can interact with the control through direct manipulation (direct manipulation). , so as to read or edit the relevant information of the application. In general, controls may include icons, buttons, menus, tabs, text boxes, dialog boxes, status bars, navigation bars, widgets, and other visual interface elements.
终端设备100可以通过ISP,摄像头193,视频编解码器,GPU,显示屏194以及应用处理器等实现拍摄功能。The terminal device 100 can realize the shooting function through the ISP, the camera 193, the video codec, the GPU, the display screen 194 and the application processor.
ISP用于处理摄像头193反馈的数据。例如,拍照时,打开快门,光线通过镜头被传递到摄像头感光元件上,光信号转换为电信号,摄像头感光元件将所述电信号传递给ISP处理,转化为肉眼可见的图像。ISP还可以对图像的噪点,亮度,肤色进行算法优化。ISP还可以对拍摄场景的曝光,色温等参数优化。在一些实施例中,ISP可以设置在摄像头193中。The ISP is used to process the data fed back by the camera 193 . For example, when taking a photo, the shutter is opened, the light is transmitted to the camera photosensitive element through the lens, the light signal is converted into an electrical signal, and the camera photosensitive element transmits the electrical signal to the ISP for processing, and converts it into an image visible to the naked eye. ISP can also perform algorithm optimization on image noise, brightness, and skin tone. ISP can also optimize the exposure, color temperature and other parameters of the shooting scene. In some embodiments, the ISP may be provided in the camera 193 .
摄像头193用于捕获静态图像或视频。物体通过镜头生成光学图像投射到感光元件。感光元件可以是电荷耦合器件(charge coupled device,CCD)或互补金属氧化物半导体(complementary metal-oxide-semiconductor,CMOS)光电晶体管。感光元件把光信号转换成电信号,之后将电信号传递给ISP转换成数字图像信号。ISP将数字图像信号输出到DSP加工处理。DSP将数字图像信号转换成标准的RGB,YUV等格式的图像信号。在一些实施例中,终端设备100可以包括1个或N个摄像头193,N为大于1的正整数。Camera 193 is used to capture still images or video. The object is projected through the lens to generate an optical image onto the photosensitive element. The photosensitive element may be a charge coupled device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor. The photosensitive element converts the optical signal into an electrical signal, and then transmits the electrical signal to the ISP to convert it into a digital image signal. The ISP outputs the digital image signal to the DSP for processing. DSP converts digital image signals into standard RGB, YUV and other formats of image signals. In some embodiments, the terminal device 100 may include 1 or N cameras 193 , where N is a positive integer greater than 1.
数字信号处理器用于处理数字信号,除了可以处理数字图像信号,还可以处理其他数字信号。例如,当终端设备100在频点选择时,数字信号处理器用于对频点能量进行傅里叶变换等。A digital signal processor is used to process digital signals, in addition to processing digital image signals, it can also process other digital signals. For example, when the terminal device 100 selects a frequency point, the digital signal processor is used to perform Fourier transform on the frequency point energy, and the like.
视频编解码器用于对数字视频压缩或解压缩。终端设备100可以支持一种或多种视频编解码器。这样,终端设备100可以播放或录制多种编码格式的视频,例如:动态图像专家组(moving picture experts group,MPEG)1,MPEG2,MPEG3,MPEG4等。Video codecs are used to compress or decompress digital video. The terminal device 100 may support one or more video codecs. In this way, the terminal device 100 can play or record videos in various encoding formats, for example, moving picture experts group (moving picture experts group, MPEG) 1, MPEG2, MPEG3, MPEG4 and so on.
NPU为神经网络(neural-network,NN)计算处理器,通过借鉴生物神经网络结构,例如借鉴人脑神经元之间传递模式,对输入信息快速处理,还可以不断的自学习。通过NPU可以实现终端设备100的智能认知等应用,例如:图像识别,人脸识别,语音识别,文本理解等。The NPU is a neural-network (NN) computing processor. By drawing on the structure of biological neural networks, such as the transfer mode between neurons in the human brain, it can quickly process the input information, and can continuously learn by itself. Applications such as intelligent cognition of the terminal device 100 can be implemented through the NPU, such as image recognition, face recognition, speech recognition, text understanding, and the like.
外部存储器接口120可以用于连接外部存储卡,例如Micro SD卡,实现扩展终端设备100的存储能力。外部存储卡通过外部存储器接口120与处理器110通信,实现数据存储功能。例如将音乐,视频等文件保存在外部存储卡中。The external memory interface 120 can be used to connect an external memory card, such as a Micro SD card, to expand the storage capacity of the terminal device 100 . The external memory card communicates with the processor 110 through the external memory interface 120 to realize the data storage function. For example to save files like music, video etc in external memory card.
内部存储器121可以用于存储计算机可执行程序代码,所述可执行程序代码包括指令。处理器110通过运行存储在内部存储器121的指令,从而执行终端设备100的各种功能应用以及数据处理。例如,在本实施例中,处理器110可以通过执行存储在内部存储器121中的指令,获取终端设备100的位姿。内部存储器121可以包括存储程序区和存储数据区。其中,存储程序区可存储操作系统,至少一个功能所需的应用程序(比如声音播放功能,图像播放功能等)等。存储数据区可存储终端设备100使用过程中所创建的数据(比如音频数据,电话本等)等。此外,内部存储器121可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件,闪存器件,通用闪存存储器(universal flash storage,UFS)等。处理器110通过运行存储在内部存储器121的指令,和/或存储在设置于处理器中的存储器的指令,执行终端设备100的各种功能应用以及数据处理。Internal memory 121 may be used to store computer executable program code, which includes instructions. The processor 110 executes various functional applications and data processing of the terminal device 100 by executing the instructions stored in the internal memory 121 . For example, in this embodiment, the processor 110 may acquire the pose of the terminal device 100 by executing the instructions stored in the internal memory 121 . The internal memory 121 may include a storage program area and a storage data area. The storage program area can store an operating system, an application program required for at least one function (such as a sound playback function, an image playback function, etc.), and the like. The storage data area may store data (such as audio data, phone book, etc.) created during the use of the terminal device 100 and the like. In addition, the internal memory 121 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, universal flash storage (UFS), and the like. The processor 110 executes various functional applications and data processing of the terminal device 100 by executing instructions stored in the internal memory 121 and/or instructions stored in a memory provided in the processor.
终端设备100可以通过音频模块170,扬声器170A,受话器170B,麦克风170C,耳机接口170D,以及应用处理器等实现音频功能。例如音乐播放,录音等。The terminal device 100 may implement audio functions through an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, an application processor, and the like. Such as music playback, recording, etc.
音频模块170用于将数字音频信息转换成模拟音频信号输出,也用于将模拟音频输入转换为数字音频信号。音频模块170还可以用于对音频信号编码和解码。在一些实施例中,音频模块170可以设置于处理器110中,或将音频模块170的部分功能模块设置于处理器110中。The audio module 170 is used for converting digital audio information into analog audio signal output, and also for converting analog audio input into digital audio signal. Audio module 170 may also be used to encode and decode audio signals. In some embodiments, the audio module 170 may be provided in the processor 110 , or some functional modules of the audio module 170 may be provided in the processor 110 .
扬声器170A,也称“喇叭”,用于将音频电信号转换为声音信号。终端设备100可以通过扬声器170A收听音乐,或收听免提通话。 Speaker 170A, also referred to as a "speaker", is used to convert audio electrical signals into sound signals. The terminal device 100 can listen to music through the speaker 170A, or listen to a hands-free call.
受话器170B,也称“听筒”,用于将音频电信号转换成声音信号。当终端设备100接听电话或语音信息时,可以通过将受话器170B靠近人耳接听语音。The receiver 170B, also referred to as "earpiece", is used to convert audio electrical signals into sound signals. When the terminal device 100 answers a call or a voice message, the voice can be answered by placing the receiver 170B close to the human ear.
麦克风170C,也称“话筒”,“传声器”,用于将声音信号转换为电信号。当拨打电话或发送语音信息时,用户可以通过人嘴靠近麦克风170C发声,将声音信号输入到麦克风170C。终端设备100可以设置至少一个麦克风170C。在另一些实施例中,终端设备100可以设置两个麦克风170C,除了采集声音信号,还可以实现降噪功能。在另一些实施例中,终端设备100还可以设置三个,四个或更多麦克风170C,实现采集声音信号,降噪,还可以识别声音来源,实现定向录音功能等。The microphone 170C, also called "microphone" or "microphone", is used to convert sound signals into electrical signals. When making a call or sending a voice message, the user can make a sound by approaching the microphone 170C through a human mouth, and input the sound signal into the microphone 170C. The terminal device 100 may be provided with at least one microphone 170C. In other embodiments, the terminal device 100 may be provided with two microphones 170C, which may implement a noise reduction function in addition to collecting sound signals. In other embodiments, the terminal device 100 may further be provided with three, four or more microphones 170C to collect sound signals, reduce noise, identify sound sources, and implement directional recording functions.
耳机接口170D用于连接有线耳机。耳机接口170D可以是USB接口130,也可以是3.5mm的开放移动电子设备平台(open mobile terminal platform,OMTP)标准接口,美国蜂窝电信工业协会(cellular telecommunications industry association of the USA,CTIA)标准接口。The earphone jack 170D is used to connect wired earphones. The earphone interface 170D can be the USB interface 130, or can be a 3.5mm open mobile terminal platform (OMTP) standard interface, a cellular telecommunications industry association of the USA (CTIA) standard interface.
压力传感器180A用于感受压力信号,可以将压力信号转换成电信号。在一些实施例中,压力传感器180A可以设置于显示屏194。压力传感器180A的种类很多,如电阻式压力传感器,电感式压力传感器,电容式压力传感器等。电容式压力传感器可以是包括至少两个具有导电材料的平行板。当有力作用于压力传感器180A,电极之间的电容改变。终端设备100根据电容的变化确定压力的强度。当有触摸操作作用于显示屏194,终端设备100根据压力传感器180A检测所述触摸操作强度。终端设备100也可以根据压力传感器180A的检测信号计算触摸的位置。在一些实施例中,作用于相同触摸位置,但不同触摸操作强度的触摸操作,可以对应不同的操作指令。例如:当有触摸操作强度小于第一压力阈值的触摸操作作用于短消息应用图标时,执行查看短消息的指令。当有触摸操作强度大于或等于第一压力阈值的触摸操作作用于短消息应用图标时,执行新建短消息的指令。The pressure sensor 180A is used to sense pressure signals, and can convert the pressure signals into electrical signals. In some embodiments, the pressure sensor 180A may be provided on the display screen 194 . There are many types of pressure sensors 180A, such as resistive pressure sensors, inductive pressure sensors, capacitive pressure sensors, and the like. The capacitive pressure sensor may be comprised of at least two parallel plates of conductive material. When a force is applied to the pressure sensor 180A, the capacitance between the electrodes changes. The terminal device 100 determines the intensity of the pressure according to the change in capacitance. When a touch operation acts on the display screen 194, the terminal device 100 detects the intensity of the touch operation according to the pressure sensor 180A. The terminal device 100 may also calculate the touched position according to the detection signal of the pressure sensor 180A. In some embodiments, touch operations acting on the same touch position but with different touch operation intensities may correspond to different operation instructions. For example, when a touch operation whose intensity is less than the first pressure threshold acts on the short message application icon, the instruction for viewing the short message is executed. When a touch operation with a touch operation intensity greater than or equal to the first pressure threshold acts on the short message application icon, the instruction to create a new short message is executed.
陀螺仪传感器180B可以用于确定终端设备100的运动姿态。在一些实施例中,可以通过陀螺仪传感器180B确定终端设备100围绕三个轴(即,x,y和z轴)的角速度。陀螺仪传感器180B可以用于拍摄防抖。示例性的,当按下快门,陀螺仪传感器180B检测终端设备100抖动的角度,根据角度计算出镜头模组需要补偿的距离,让镜头通过反向运动抵消终端设备100的抖动,实现防抖。陀螺仪传感器180B还可以用于导航,体感游戏场景。The gyro sensor 180B may be used to determine the motion attitude of the terminal device 100 . In some embodiments, the angular velocity of the end device 100 about three axes (ie, the x, y and z axes) may be determined by the gyro sensor 180B. The gyro sensor 180B can be used for image stabilization. Exemplarily, when the shutter is pressed, the gyro sensor 180B detects the shaking angle of the terminal device 100, calculates the distance to be compensated by the lens module according to the angle, and allows the lens to offset the shaking of the terminal device 100 through reverse motion to achieve anti-shake. The gyro sensor 180B can also be used for navigation and somatosensory game scenarios.
气压传感器180C用于测量气压。在一些实施例中,终端设备100通过气压传感器180C测得的气压值计算海拔高度,辅助定位和导航。The air pressure sensor 180C is used to measure air pressure. In some embodiments, the terminal device 100 calculates the altitude through the air pressure value measured by the air pressure sensor 180C to assist in positioning and navigation.
磁传感器180D包括霍尔传感器。终端设备100可以利用磁传感器180D检测翻盖皮套的开合。在一些实施例中,当终端设备100是翻盖机时,终端设备100可以根据磁传感器180D检测翻盖的开合。进而根据检测到的皮套的开合状态或翻盖的开合状态,设置翻盖自动解锁等特性。The magnetic sensor 180D includes a Hall sensor. The terminal device 100 can detect the opening and closing of the flip holster using the magnetic sensor 180D. In some embodiments, when the terminal device 100 is a flip machine, the terminal device 100 can detect the opening and closing of the flip according to the magnetic sensor 180D. Further, according to the detected opening and closing state of the leather case or the opening and closing state of the flip cover, characteristics such as automatic unlocking of the flip cover are set.
加速度传感器180E可检测终端设备100在各个方向上(一般为三轴)加速度的大小。当终端设备100静止时可检测出重力的大小及方向。还可以用于识别终端设备姿态,应用于横竖屏切换,计步器等应用。The acceleration sensor 180E can detect the magnitude of the acceleration of the terminal device 100 in various directions (generally three axes). The magnitude and direction of gravity can be detected when the terminal device 100 is stationary. It can also be used to identify the posture of terminal devices, and can be used in applications such as horizontal and vertical screen switching, pedometers, etc.
距离传感器180F,用于测量距离。终端设备100可以通过红外或激光测量距离。在一些实施例中,拍摄场景,终端设备100可以利用距离传感器180F测距以实现快速对焦。Distance sensor 180F for measuring distance. The terminal device 100 can measure the distance through infrared or laser. In some embodiments, when shooting a scene, the terminal device 100 can use the distance sensor 180F to measure the distance to achieve fast focusing.
接近光传感器180G可以包括例如发光二极管(LED)和光检测器,例如光电二极管。发光二极管可以是红外发光二极管。终端设备100通过发光二极管向外发射红外光。终端设备100使用光电二极管检测来自附近物体的红外反射光。当检测到充分的反射光时,可以确定终端设备100附近有物体。当检测到不充分的反射光时,终端设备100可以确定终端设备100附近没有物体。终端设备100可以利用接近光传感器180G检测用户手持终端设备100贴近耳朵通话,以便自动熄灭屏幕达到省电的目的。接近光传感器180G也可用于皮套模式,口袋模式自动解锁与锁屏。Proximity light sensor 180G may include, for example, light emitting diodes (LEDs) and light detectors, such as photodiodes. The light emitting diodes may be infrared light emitting diodes. The terminal device 100 emits infrared light to the outside through the light emitting diode. The terminal device 100 detects infrared reflected light from nearby objects using a photodiode. When sufficient reflected light is detected, it can be determined that there is an object near the terminal device 100 . When insufficient reflected light is detected, the terminal device 100 may determine that there is no object near the terminal device 100 . The terminal device 100 can use the proximity light sensor 180G to detect that the user holds the terminal device 100 close to the ear to talk, so as to automatically turn off the screen to save power. Proximity light sensor 180G can also be used in holster mode, pocket mode automatically unlocks and locks the screen.
环境光传感器180L用于感知环境光亮度。终端设备100可以根据感知的环境光亮度自适应调节显示屏194亮度。环境光传感器180L也可用于拍照时自动调节白平衡。环境光传感器180L还可以与接近光传感器180G配合,检测终端设备100是否在口袋里,以防误触。The ambient light sensor 180L is used to sense ambient light brightness. The terminal device 100 can adaptively adjust the brightness of the display screen 194 according to the perceived ambient light brightness. The ambient light sensor 180L can also be used to automatically adjust the white balance when taking pictures. The ambient light sensor 180L can also cooperate with the proximity light sensor 180G to detect whether the terminal device 100 is in a pocket, so as to prevent accidental touch.
指纹传感器180H用于采集指纹。终端设备100可以利用采集的指纹特性实现指纹解锁,访问应用锁,指纹拍照,指纹接听来电等。The fingerprint sensor 180H is used to collect fingerprints. The terminal device 100 can use the collected fingerprint characteristics to realize fingerprint unlocking, accessing application locks, taking photos with fingerprints, answering incoming calls with fingerprints, and the like.
温度传感器180J用于检测温度。在一些实施例中,终端设备100利用温度传感器180J检测的温度,执行温度处理策略。例如,当温度传感器180J上报的温度超过阈值,终端设备100执行降低位于温度传感器180J附近的处理器的性能,以便降低功耗实施热保护。在另一些实施例中,当温度低于另一阈值时,终端设备100对电池142加热,以避免低温导致终端设备100异常关机。在其他一些实施例中,当温度低于又一阈值时,终端设备100对电池142的输出电压执行升压,以避免低温导致的异常关机。The temperature sensor 180J is used to detect the temperature. In some embodiments, the terminal device 100 uses the temperature detected by the temperature sensor 180J to execute the temperature processing strategy. For example, when the temperature reported by the temperature sensor 180J exceeds a threshold value, the terminal device 100 reduces the performance of the processor located near the temperature sensor 180J, so as to reduce power consumption and implement thermal protection. In other embodiments, when the temperature is lower than another threshold, the terminal device 100 heats the battery 142 to avoid abnormal shutdown of the terminal device 100 caused by the low temperature. In some other embodiments, when the temperature is lower than another threshold, the terminal device 100 boosts the output voltage of the battery 142 to avoid abnormal shutdown caused by low temperature.
触摸传感器180K,也称“触控器件”。触摸传感器180K可以设置于显示屏194,由触摸传感器180K与显示屏194组成触摸屏,也称“触控屏”。触摸传感器180K用于检测作用于其上或附近的触摸操作。触摸传感器可以将检测到的触摸操作传递给应用处理器,以确定触摸事件类型。可以通过显示屏194提供与触摸操作相关的视觉输出。在另一些实施例中,触摸传感器180K也可以设置于终端设备100的表面,与显示屏194所处的位置不同。Touch sensor 180K, also called "touch device". The touch sensor 180K may be disposed on the display screen 194 , and the touch sensor 180K and the display screen 194 form a touch screen, also called a “touch screen”. The touch sensor 180K is used to detect a touch operation on or near it. The touch sensor can pass the detected touch operation to the application processor to determine the type of touch event. Visual output related to touch operations may be provided through display screen 194 . In other embodiments, the touch sensor 180K may also be disposed on the surface of the terminal device 100 , which is different from the position where the display screen 194 is located.
骨传导传感器180M可以获取振动信号。在一些实施例中,骨传导传感器180M可以获取人体声部振动骨块的振动信号。骨传导传感器180M也可以接触人体脉搏,接收血压跳动信号。在一些实施例中,骨传导传感器180M也可以设置于耳机中,结合成骨传导耳机。音频模块170可以基于所述骨传导传感器180M获取的声部振动骨 块的振动信号,解析出语音信号,实现语音功能。应用处理器可以基于所述骨传导传感器180M获取的血压跳动信号解析心率信息,实现心率检测功能。The bone conduction sensor 180M can acquire vibration signals. In some embodiments, the bone conduction sensor 180M can acquire the vibration signal of the vibrating bone mass of the human voice. The bone conduction sensor 180M can also contact the pulse of the human body and receive the blood pressure beating signal. In some embodiments, the bone conduction sensor 180M can also be disposed in the earphone, combined with the bone conduction earphone. The audio module 170 can analyze the voice signal based on the vibration signal of the voice vibration bone block obtained by the bone conduction sensor 180M, and realize the voice function. The application processor can analyze the heart rate information based on the blood pressure beat signal obtained by the bone conduction sensor 180M, and realize the function of heart rate detection.
按键190包括开机键,音量键等。按键190可以是机械按键。也可以是触摸式按键。终端设备100可以接收按键输入,产生与终端设备100的用户设置以及功能控制有关的键信号输入。The keys 190 include a power-on key, a volume key, and the like. Keys 190 may be mechanical keys. It can also be a touch key. The terminal device 100 may receive key input and generate key signal input related to user settings and function control of the terminal device 100 .
马达191可以产生振动提示。马达191可以用于来电振动提示,也可以用于触摸振动反馈。例如,作用于不同应用(例如拍照,音频播放等)的触摸操作,可以对应不同的振动反馈效果。作用于显示屏194不同区域的触摸操作,马达191也可对应不同的振动反馈效果。不同的应用场景(例如:时间提醒,接收信息,闹钟,游戏等)也可以对应不同的振动反馈效果。触摸振动反馈效果还可以支持自定义。Motor 191 can generate vibrating cues. The motor 191 can be used for vibrating alerts for incoming calls, and can also be used for touch vibration feedback. For example, touch operations acting on different applications (such as taking pictures, playing audio, etc.) can correspond to different vibration feedback effects. The motor 191 can also correspond to different vibration feedback effects for touch operations on different areas of the display screen 194 . Different application scenarios (for example: time reminder, receiving information, alarm clock, games, etc.) can also correspond to different vibration feedback effects. The touch vibration feedback effect can also support customization.
指示器192可以是指示灯,可以用于指示充电状态,电量变化,也可以用于指示消息,未接来电,通知等。The indicator 192 can be an indicator light, which can be used to indicate the charging state, the change of the power, and can also be used to indicate a message, a missed call, a notification, and the like.
另外,在上述部件之上,运行有操作系统。例如苹果公司所开发的iOS操作系统,谷歌公司所开发的Android开源操作系统,微软公司所开发的Windows操作系统等。在该操作系统上可以安装运行应用程序。In addition, an operating system runs on the above-mentioned components. For example, the iOS operating system developed by Apple, the Android open source operating system developed by Google, and the Windows operating system developed by Microsoft. Applications can be installed and run on this operating system.
终端设备100的操作系统可以采用分层架构,事件驱动架构,微核架构,微服务架构,或云架构。本申请实施例以分层架构的Android系统为例,示例性说明终端设备100的软件结构。The operating system of the terminal device 100 may adopt a layered architecture, an event-driven architecture, a microkernel architecture, a microservice architecture, or a cloud architecture. The embodiments of the present application take an Android system with a layered architecture as an example to exemplarily describe the software structure of the terminal device 100 .
图2是本申请实施例的终端设备100的软件结构框图。FIG. 2 is a block diagram of a software structure of a terminal device 100 according to an embodiment of the present application.
分层架构将软件分成若干个层,每一层都有清晰的角色和分工。层与层之间通过软件接口通信。在一些实施例中,将Android系统分为四层,从上至下分别为应用程序层,应用程序框架层,安卓运行时(Android runtime)和系统库,以及内核层。The layered architecture divides the software into several layers, and each layer has a clear role and division of labor. Layers communicate with each other through software interfaces. In some embodiments, the Android system is divided into four layers, which are, from top to bottom, an application layer, an application framework layer, an Android runtime (Android runtime) and a system library, and a kernel layer.
应用程序层可以包括一系列应用程序包。如图2所示,应用程序包可以包括相机,图库,日历,通话,地图,导航,WLAN,蓝牙,音乐,视频,短信息等应用程序。例如,在拍照时,相机应用可以访问应用程序框架层提供的相机接口管理服务。The application layer can include a series of application packages. As shown in Figure 2, the application package can include applications such as camera, gallery, calendar, call, map, navigation, WLAN, Bluetooth, music, video, short message and so on. For example, when taking pictures, the camera application can access the camera interface management service provided by the application framework layer.
应用程序框架层为应用程序层的应用程序提供应用编程接口(application programming interface,API)和编程框架。应用程序框架层包括一些预先定义的函数。如图2所示,应用程序框架层可以包括窗口管理器,内容提供器,视图系统,电话管理器,资源管理器,通知管理器等。例如,在本申请实施例中,在拍照时,应用程序框架层可以为应用程序层提供拍照功能相关的API,并为应用程序层提供相机接口管理服务,以实现拍照功能。The application framework layer provides an application programming interface (application programming interface, API) and a programming framework for applications in the application layer. The application framework layer includes some predefined functions. As shown in Figure 2, the application framework layer may include window managers, content providers, view systems, telephony managers, resource managers, notification managers, and the like. For example, in this embodiment of the present application, when taking pictures, the application framework layer may provide the application layer with APIs related to the photographing function, and provide the application layer with a camera interface management service to realize the photographing function.
窗口管理器用于管理窗口程序。窗口管理器可以获取显示屏大小,判断是否有状态栏,锁定屏幕,截取屏幕等。A window manager is used to manage window programs. The window manager can get the size of the display screen, determine whether there is a status bar, lock the screen, take screenshots, etc.
内容提供器用来存放和获取数据,并使这些数据可以被应用程序访问。所述数据可以包括视频,图像,音频,拨打和接听的电话,浏览历史和书签,电话簿等。Content providers are used to store and retrieve data and make these data accessible to applications. The data may include video, images, audio, calls made and received, browsing history and bookmarks, phone book, etc.
视图系统包括可视控件,例如显示文字的控件,显示图片的控件等。视图系统可用于构建应用程序。显示界面可以由一个或多个视图组成的。例如,包括短信通知图标的显示界面,可以包括显示文字的视图以及显示图片的视图。The view system includes visual controls, such as controls for displaying text, controls for displaying pictures, and so on. View systems can be used to build applications. A display interface can consist of one or more views. For example, the display interface including the short message notification icon may include a view for displaying text and a view for displaying pictures.
电话管理器用于提供终端设备100的通信功能。例如通话状态的管理(包括接通, 挂断等)。The telephony manager is used to provide the communication function of the terminal device 100 . For example, the management of call status (including connecting, hanging up, etc.).
资源管理器为应用程序提供各种资源,比如本地化字符串,图标,图片,布局文件,视频文件等等。The resource manager provides various resources for the application, such as localization strings, icons, pictures, layout files, video files and so on.
通知管理器使应用程序可以在状态栏中显示通知信息,可以用于传达告知类型的消息,可以短暂停留后自动消失,无需用户交互。比如通知管理器被用于告知下载完成,消息提醒等。通知管理器还可以是以图表或者滚动条文本形式出现在系统顶部状态栏的通知,例如后台运行的应用程序的通知,还可以是以对话窗口形式出现在屏幕上的通知。例如在状态栏提示文本信息,发出提示音,终端设备振动,指示灯闪烁等。The notification manager enables applications to display notification information in the status bar, which can be used to convey notification-type messages, and can disappear automatically after a brief pause without user interaction. For example, the notification manager is used to notify download completion, message reminders, etc. The notification manager can also display notifications in the status bar at the top of the system in the form of graphs or scroll bar text, such as notifications of applications running in the background, and notifications on the screen in the form of dialog windows. For example, text information is prompted in the status bar, a prompt sound is issued, the terminal device vibrates, and the indicator light flashes.
Android Runtime包括核心库和虚拟机。Android runtime负责安卓系统的调度和管理。Android Runtime includes core libraries and a virtual machine. Android runtime is responsible for scheduling and management of the Android system.
核心库包含两部分:一部分是java语言需要调用的功能函数,另一部分是安卓的核心库。The core library consists of two parts: one is the function functions that the java language needs to call, and the other is the core library of Android.
应用程序层和应用程序框架层运行在虚拟机中。虚拟机将应用程序层和应用程序框架层的java文件执行为二进制文件。虚拟机用于执行对象生命周期的管理,堆栈管理,线程管理,安全和异常的管理,以及垃圾回收等功能。The application layer and the application framework layer run in virtual machines. The virtual machine executes the java files of the application layer and the application framework layer as binary files. The virtual machine is used to perform functions such as object lifecycle management, stack management, thread management, safety and exception management, and garbage collection.
系统库可以包括多个功能模块。例如:表面管理器(surface manager),媒体库(Media Libraries),三维图形处理库(例如:OpenGL ES),2D图形引擎(例如:SGL)等。A system library can include multiple functional modules. For example: surface manager (surface manager), media library (Media Libraries), 3D graphics processing library (eg: OpenGL ES), 2D graphics engine (eg: SGL), etc.
表面管理器用于对显示子系统进行管理,并且为多个应用程序提供了2D和3D图层的融合。The Surface Manager is used to manage the display subsystem and provides a fusion of 2D and 3D layers for multiple applications.
媒体库支持多种常用的音频,视频格式回放和录制,以及静态图像文件等。媒体库可以支持多种音视频编码格式,例如:动态图像专家组(moving picture experts group,MPEG)4,H.264,MP3,高级音频编码(advanced audio coding,AAC),自适应多速率(adaptive multi rate,AMR),联合摄影制图专家组(joint photo graphic experts group,JPEG),可移植网络图形格式(portable network graphic format,PNG)等。The media library supports playback and recording of a variety of commonly used audio and video formats, as well as still image files. The media library can support a variety of audio and video encoding formats, such as: moving picture experts group (MPEG) 4, H.264, MP3, advanced audio coding (AAC), adaptive multi-rate (adaptive) multi rate, AMR), joint photographic experts group (joint photo graphic experts group, JPEG), portable network graphic format (portable network graphic format, PNG) and so on.
三维图形处理库用于实现三维图形绘图,图像渲染,合成,和图层处理等。The 3D graphics processing library is used to implement 3D graphics drawing, image rendering, compositing, and layer processing.
二维(2 dimensions,2D)图形引擎是2D绘图的绘图引擎。Two-dimensional (2 dimensions, 2D) graphics engine is a drawing engine for 2D drawing.
内核层是硬件和软件之间的层。内核层至少包含显示驱动,摄像头驱动,音频驱动,传感器驱动。The kernel layer is the layer between hardware and software. The kernel layer contains at least display drivers, camera drivers, audio drivers, and sensor drivers.
需要说明的是,本申请实施例虽然以
Figure PCTCN2021140241-appb-000005
系统为例进行说明,但是其基本原理同样适用于基于
Figure PCTCN2021140241-appb-000006
Figure PCTCN2021140241-appb-000007
等操作系统的终端设备。
It should be noted that although the embodiments of the present application are based on
Figure PCTCN2021140241-appb-000005
The system is used as an example to illustrate, but its basic principles are also applicable to
Figure PCTCN2021140241-appb-000006
or
Figure PCTCN2021140241-appb-000007
Terminal devices such as operating systems.
下面将结合附图,对本申请中的技术方案进行描述。The technical solutions in the present application will be described below with reference to the accompanying drawings.
本申请实施例提供的三维对象注册方法能够应用在AR、VR以及需要显示对象的虚拟内容的场景。具体而言,本申请实施例的三维对象注册方法能够应用在AR场景中,下面结合图1和AR场景,示例性说明终端设备100软件以及硬件的工作流程。The three-dimensional object registration method provided by the embodiments of the present application can be applied to AR, VR, and scenarios where virtual content of an object needs to be displayed. Specifically, the three-dimensional object registration method in this embodiment of the present application can be applied in an AR scene. The following describes the workflow of the software and hardware of the terminal device 100 with reference to FIG. 1 and the AR scene.
触摸传感器180K接收到触摸操作,上报给处理器110,使得处理器响应于上述触摸操作,启动该AR应用,并在显示屏194上显示该AR应用的用户界面。例如,触摸传感器180K当接收到对AR图标的触摸操作后,向处理器110上报对AR图标的触摸操作,使得处理器110响应于上述触摸操作,启动AR图标对应的AR应用,并在显示屏194上显示AR的用户界面。此外,本申请实施例中还可以通过其它方式使得 终端启动AR,并在显示屏194上显示AR的用户界面。例如,终端当黑屏、显示锁屏界面或者解锁后显示某一用户界面时,可以响应于用户的语音指令或者快捷操作等,启动AR,并在显示屏194上显示AR的用户界面。The touch sensor 180K receives the touch operation and reports it to the processor 110 , so that the processor starts the AR application in response to the above-mentioned touch operation, and displays the user interface of the AR application on the display screen 194 . For example, after receiving the touch operation on the AR icon, the touch sensor 180K reports the touch operation on the AR icon to the processor 110, so that the processor 110 responds to the above touch operation, starts the AR application corresponding to the AR icon, and displays it on the display screen. 194 shows the user interface for AR. In addition, in this embodiment of the present application, the terminal may also enable AR in other manners, and display the AR user interface on the display screen 194. For example, when the terminal is on a black screen, displays a lock screen interface, or displays a certain user interface after being unlocked, the terminal may start AR in response to a user's voice command or a shortcut operation, and display the AR user interface on the display screen 194 .
终端设备100中用于对对象的位姿检测和跟踪的APP中,配置了用于检测对象的位姿的网络。当用户在终端设备100中启动用于对对象的位姿检测和跟踪的APP时,终端设备100通过摄像头193捕获视野中的图像,由检测对象的位姿的网络识别该图像中包括的可识别对象,并获取可识别对象的位姿,再根据获取的位姿,通过显示屏194在捕获图像上叠加显示可识别对象对应的虚拟内容。In the APP for detecting and tracking the pose of the object in the terminal device 100, a network for detecting the pose of the object is configured. When the user starts the APP for detecting and tracking the pose of the object in the terminal device 100, the terminal device 100 captures an image in the field of view through the camera 193, and the network that detects the pose of the object recognizes the identifiable objects included in the image. and acquire the pose of the recognizable object, and then superimpose and display the virtual content corresponding to the recognizable object on the captured image through the display screen 194 according to the acquired pose.
示例性的,在商场购物、博物馆、展览馆参观等过程中,终端设备100在捕获场景图像后,识别图像中可识别对象,将可识别对象的讲解信息按照可识别对象的位姿,根据位姿将对应的讲解内容在三维空间中精确的叠加在三维物体的不同位置呈现给用户。如图3示意了讲解信息与真实物体的虚实融合效果,可以直观生动形象的将图像中真实物体的信息呈现给用户。Exemplarily, in shopping malls, museums, exhibition hall visits, etc., after capturing the scene image, the terminal device 100 recognizes the identifiable objects in the image, and uses the explanation information of the identifiable objects according to the pose of the identifiable object, according to the position and position of the identifiable object. Gesture presents the corresponding explanation content to the user by superimposing it on different positions of the three-dimensional object accurately in the three-dimensional space. Figure 3 shows the virtual-real fusion effect of the explanation information and the real object, which can intuitively and vividly present the information of the real object in the image to the user.
在当前的显示对象的虚拟内容的场景中,通常是在线下会根据对象的三维模型,对终端设备100中配置的用于检测对象的位姿的网络进行训练,使得该网络支持多个对象的位姿识别,即完成了将待识别的对象注册于该网络中。3D对象位姿检测在实际应用过程中,需要高效快速的对可识别的对象进行添加和删除。当前,通常基于机器学习的方法对可识别的对象进行添加,对于每一个新增的对象,都需要对所有可识别的物体重新进行训练,会导致训练时间的线性增长以及影响已经训练好的物体识别效果。In the current scene where the virtual content of the object is displayed, the network configured in the terminal device 100 for detecting the pose of the object is usually trained offline according to the three-dimensional model of the object, so that the network supports the Pose recognition, that is, the registration of the object to be recognized in the network is completed. In the actual application process of 3D object pose detection, it is necessary to add and delete identifiable objects efficiently and quickly. Currently, recognizable objects are usually added based on machine learning methods. For each new object, all recognizable objects need to be retrained, which will lead to a linear increase in training time and affect the trained objects. Identify the effect.
图4意了一种现有的位姿检测流程,如图4所示,包含对象的图片输入多对象位姿估计网络,多对象位姿估计网络输出该图片中包含的可识别对象的位姿及类别,并对输出的位姿进行位姿优化。该流程中用到的多对象位姿估计网络为离线训练所生成,当用户需要添加新的可识别对象时,需要和原有网络所支持的所有可识别对象一起重新进行训练,得到一个新的多对象位姿估计网络,从而实现对新增可识别对象的位姿检测的支持。这样一来,新增对象重新进行训练会导致训练时间急剧增加,新增对象会影响已经训练好的物体位姿检测效果,导致检测准确率、成功率下降。Figure 4 illustrates an existing pose detection process. As shown in Figure 4, a picture containing an object is input to a multi-object pose estimation network, and the multi-object pose estimation network outputs the pose of the recognizable objects contained in the picture. and categories, and perform pose optimization on the output pose. The multi-object pose estimation network used in this process is generated by offline training. When the user needs to add a new identifiable object, it needs to be retrained together with all the identifiable objects supported by the original network to obtain a new Multi-object pose estimation network, enabling support for pose detection of newly identifiable objects. In this way, re-training of new objects will lead to a sharp increase in training time, and new objects will affect the pose detection effect of the trained objects, resulting in a decrease in detection accuracy and success rate.
为了解决新增对象重新训练所有可识别物体带来的训练时间增加的问题,业界提出了增量式学习的方法,该方法的过程可以如图5a所示,用户提交期望识别的对象的三维模型,根据提交的三维模型训练一个多对象位姿估计网络M0;当新增可识别的对象时,用户新提交期望识别的对象的三维模型,基于训练好的网络M0,采用已经训练好的对象的少量数据进行增量训练,得到新的网络M1;当再新增可识别的对象时,用户继续提交期望识别的对象的三维模型,基于训练好的网络M1,采用已经训练好的对象的少量数据进行增量训练,得到新的网络M2,以此类推。该方案当有新的待识别的对象增加到训练集时,仅采用已经训练好的物体的少量数据,在现有模型的基础上进行增量式学习,可以极大的减少重新训练的时间。但由于增量式学习面临灾难性遗忘问题,即新模型的训练仅参考已识别对象的少量数据,随着新增对象数量的增加,已经训练好的对象性能就会急剧下降。In order to solve the problem of increased training time caused by retraining all identifiable objects for new objects, the industry has proposed an incremental learning method. The process of this method can be shown in Figure 5a. The user submits a 3D model of the object to be recognized. , train a multi-object pose estimation network M0 according to the submitted 3D model; when adding a new identifiable object, the user submits a new 3D model of the object to be recognized, based on the trained network M0, using the trained object's 3D model A small amount of data is incrementally trained to obtain a new network M1; when recognizable objects are added, the user continues to submit the 3D model of the object expected to be recognized, and based on the trained network M1, a small amount of data of the trained object is used Perform incremental training to get a new network M2, and so on. When a new object to be recognized is added to the training set, this scheme only uses a small amount of data of the trained object, and performs incremental learning on the basis of the existing model, which can greatly reduce the retraining time. However, since incremental learning faces the problem of catastrophic forgetting, that is, the training of new models only refers to a small amount of data of recognized objects, and as the number of new objects increases, the performance of already trained objects will drop sharply.
基于此,本申请提供了一种三维对象注册方法,具体包括:配置单对象位姿检测 网络,采用三维对象的真实图像与多个位姿下可微渲染得到的合成图像的差异构建损失函数,来训练单对象位姿检测网络,以得到用于提取该三维对象的位姿检测网络,再利用训练后的位姿检测网络提取该三维对象的真实图像以及多个位姿下可微渲染合成图像中该三维对象的特征,记录特征与该三维对象的标识,完成该三维对象的注册。由于本申请提供的三维对象注册方法,采用单对象位姿检测网络,即使新增可识别对象,训练时间短,也不会影响其他可识别对象的识别效果;另外,采用三维对象的真实图像与多个位姿下可微渲染合成图像的差异构建损失函数,提高了单对象位姿检测网络的精度。Based on this, the present application provides a three-dimensional object registration method, which specifically includes: configuring a single-object pose detection network, and constructing a loss function using the difference between a real image of a three-dimensional object and a composite image obtained by differentiable rendering under multiple poses, to train a single-object pose detection network to obtain a pose detection network for extracting the three-dimensional object, and then use the trained pose detection network to extract the real image of the three-dimensional object and the differentiable rendering composite image in multiple poses In the feature of the three-dimensional object, record the feature and the identification of the three-dimensional object, and complete the registration of the three-dimensional object. Because the three-dimensional object registration method provided in this application adopts a single-object pose detection network, even if a new recognizable object is added, the training time is short, and the recognition effect of other recognizable objects will not be affected; Differentiable rendering of synthetic images in multiple poses builds a loss function that improves the accuracy of single-object pose detection networks.
下面从模型训练侧和模型应用侧对本申请提供的方法进行描述:The method provided by this application is described below from the model training side and the model application side:
本申请实施例提供的获取对象位姿的方法,涉及计算机视觉的处理,具体可以应用于数据训练、机器学习、深度学习等数据处理方法,对训练数据(如本申请中的对象的图像)进行符号化和形式化的智能信息建模、抽取、预处理、训练等,最终得到训练好的单对象位姿检测网络;并且,本申请实施例提供的对象注册方法可以运用上述训练好的单对象位姿检测网络,将输入数据(如本申请中的包括待识别对象的图像)输入到所述训练好的该待识别对象对应的单对象位姿检测网络中,得到输出数据(如本申请中的图像中可识别对象的位姿)。需要说明的是,本申请实施例提供的单对象位姿检测网络的训练方法和对象注册方法是基于同一个构思产生的发明,也可以理解为一个系统中的两个部分,或一个整体流程的两个阶段:如模型训练阶段和模型应用阶段。The method for obtaining the pose of an object provided by the embodiments of the present application involves computer vision processing, and can be specifically applied to data processing methods such as data training, machine learning, and deep learning. Symbolized and formalized intelligent information modeling, extraction, preprocessing, training, etc., finally obtain a trained single-object pose detection network; and, the object registration method provided in the embodiment of the present application can use the above-trained single-object Pose detection network, input the input data (such as the image including the object to be recognized in this application) into the trained single object pose detection network corresponding to the object to be recognized, and obtain output data (such as in this application) pose of recognizable objects in the image). It should be noted that the training method and the object registration method of the single-object pose detection network provided by the embodiments of the present application are inventions based on the same concept, and can also be understood as two parts in a system, or as part of an overall process. Two stages: such as model training stage and model application stage.
由于本申请实施例涉及大量神经网络的应用,为了便于理解,下面先对本申请实施例涉及的相关术语及神经网络等相关概念进行介绍。Since the embodiments of the present application involve a large number of neural network applications, for ease of understanding, related terms and neural networks and other related concepts involved in the embodiments of the present application are first introduced below.
(1)神经网络(neural network,NN)(1) Neural Network (NN)
神经网络是机器学习模型,是一种模拟人脑的神经网络以能够实现类人工智能的机器学习技术。可以根据实际需求配置神经网络的输入及输出,并通过样本数据对神经网络训练,以使得其输出与样本数据对应的真实输出的误差最小。神经网络可以是由神经单元组成的,神经单元可以是指以x s和截距1为输入的运算单元,该运算单元的输出可以为: Neural network is a machine learning model, which is a kind of machine learning technology that simulates the neural network of the human brain to realize artificial intelligence. The input and output of the neural network can be configured according to actual needs, and the neural network can be trained through sample data, so that the error between its output and the real output corresponding to the sample data is minimized. A neural network can be composed of neural units, and a neural unit can refer to an operation unit that takes x s and an intercept 1 as input, and the output of the operation unit can be:
Figure PCTCN2021140241-appb-000008
Figure PCTCN2021140241-appb-000008
其中,s=1、2、……n,n为大于1的自然数,W s为x s的权重,b为神经单元的偏置。f为神经单元的激活函数(activation functions),用于将非线性特性引入神经网络中,来将神经单元中的输入信号转换为输出信号。该激活函数的输出信号可以作为下一层卷积层的输入。激活函数可以是sigmoid函数。神经网络是将许多个上述单一的神经单元联结在一起形成的网络,即一个神经单元的输出可以是另一个神经单元的输入。每个神经单元的输入可以与前一层的局部接受域相连,来提取局部接受域的特征,局部接受域可以是由若干个神经单元组成的区域。 Among them, s=1, 2, ... n, n is a natural number greater than 1, W s is the weight of x s , and b is the bias of the neural unit. f is an activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into an output signal. The output signal of this activation function can be used as the input of the next convolutional layer. The activation function can be a sigmoid function. A neural network is a network formed by connecting many of the above single neural units together, that is, the output of one neural unit can be the input of another neural unit. The input of each neural unit can be connected with the local receptive field of the previous layer to extract the features of the local receptive field, and the local receptive field can be an area composed of several neural units.
(2)深度神经网络(2) Deep neural network
深度神经网络(deep neural network,DNN),也称多层神经网络,可以理解为具有很多层隐含层的神经网络,这里的“很多”并没有特别的度量标准。从DNN按不同层的位置划分,DNN内部的神经网络可以分为三类:输入层,隐含层,输出层。一般来 说第一层是输入层,最后一层是输出层,中间的层数都是隐含层。层与层之间是全连接的,也就是说,第i层的任意一个神经元一定与第i+1层的任意一个神经元相连。虽然DNN看起来很复杂,但是就每一层的工作来说,其实并不复杂,简单来说就是如下线性关系表达式:
Figure PCTCN2021140241-appb-000009
其中,
Figure PCTCN2021140241-appb-000010
是输入向量,
Figure PCTCN2021140241-appb-000011
是输出向量,b是偏移向量,W是权重矩阵(也称系数),α()是激活函数。每一层仅仅是对输入向量
Figure PCTCN2021140241-appb-000012
经过如此简单的操作得到输出向量
Figure PCTCN2021140241-appb-000013
由于DNN层数多,则系数W和偏移向量b的数量也就很多了。这些参数在DNN中的定义如下所述:以系数W为例:假设在一个三层的DNN中,第二层的第4个神经元到第三层的第2个神经元的线性系数定义为
Figure PCTCN2021140241-appb-000014
上标3代表系数W所在的层数,而下标对应的是输出的第三层索引2和输入的第二层索引4。总结就是:第L-1层的第k个神经元到第L层的第j个神经元的系数定义为
Figure PCTCN2021140241-appb-000015
需要注意的是,输入层是没有W参数的。在深度神经网络中,更多的隐含层让网络更能够刻画现实世界中的复杂情形。理论上而言,参数越多的模型复杂度越高,“容量”也就越大,也就意味着它能完成更复杂的学习任务。训练深度神经网络的也就是学习权重矩阵的过程,其最终目的是得到训练好的深度神经网络的所有层的权重矩阵(由很多层的向量W形成的权重矩阵)。
Deep neural network (deep neural network, DNN), also known as multi-layer neural network, can be understood as a neural network with many hidden layers, and there is no special metric for "many" here. From the division of DNN according to the position of different layers, the neural network inside DNN can be divided into three categories: input layer, hidden layer, and output layer. Generally speaking, the first layer is the input layer, the last layer is the output layer, and the middle layers are all hidden layers. The layers are fully connected, that is, any neuron in the i-th layer must be connected to any neuron in the i+1-th layer. Although DNN looks complicated, in terms of the work of each layer, it is not complicated. In short, it is the following linear relationship expression:
Figure PCTCN2021140241-appb-000009
in,
Figure PCTCN2021140241-appb-000010
is the input vector,
Figure PCTCN2021140241-appb-000011
is the output vector, b is the offset vector, W is the weight matrix (also called coefficients), and α() is the activation function. Each layer is just an input vector
Figure PCTCN2021140241-appb-000012
After such a simple operation to get the output vector
Figure PCTCN2021140241-appb-000013
Due to the large number of DNN layers, the number of coefficients W and offset vector b is also large. These parameters are defined in the DNN as follows: Take the coefficient W as an example: Suppose that in a three-layer DNN, the linear coefficient from the 4th neuron in the second layer to the 2nd neuron in the third layer is defined as
Figure PCTCN2021140241-appb-000014
The superscript 3 represents the number of layers where the coefficient W is located, and the subscript corresponds to the output third layer index 2 and the input second layer index 4. The summary is: the coefficient from the kth neuron in the L-1 layer to the jth neuron in the Lth layer is defined as
Figure PCTCN2021140241-appb-000015
It should be noted that the input layer does not have a W parameter. In a deep neural network, more hidden layers allow the network to better capture the complexities of the real world. In theory, a model with more parameters is more complex and has a larger "capacity", which means that it can complete more complex learning tasks. Training the deep neural network is the process of learning the weight matrix, and its ultimate goal is to obtain the weight matrix of all layers of the trained deep neural network (the weight matrix formed by the vectors W of many layers).
(3)卷积神经网络(3) Convolutional Neural Network
卷积神经网络(convolutional neuron network,CNN)是一种带有卷积结构的深度神经网络。卷积神经网络包含了一个由卷积层和子采样层构成的特征抽取器。该特征抽取器可以看作是滤波器,卷积过程可以看作是使用一个可训练的滤波器与一个输入的图像或者卷积特征平面(feature map)做卷积。卷积层是指卷积神经网络中对输入信号进行卷积处理的神经元层。在卷积神经网络的卷积层中,一个神经元可以只与部分邻层神经元连接。一个卷积层中,通常包含若干个特征平面,每个特征平面可以由一些矩形排列的神经单元组成。同一特征平面的神经单元共享权重,这里共享的权重就是卷积核。共享权重可以理解为提取图像信息的方式与位置无关。这其中隐含的原理是:图像的某一部分的统计信息与其他部分是一样的。即意味着在某一部分学习的图像信息也能用在另一部分上。所以对于图像上的所有位置,都能使用同样的学习得到的图像信息。在同一卷积层中,可以使用多个卷积核来提取不同的图像信息,一般地,卷积核数量越多,卷积操作反映的图像信息越丰富。Convolutional neural network (CNN) is a deep neural network with a convolutional structure. A convolutional neural network consists of a feature extractor consisting of convolutional and subsampling layers. The feature extractor can be viewed as a filter, and the convolution process can be viewed as convolution with an input image or a convolutional feature map using a trainable filter. The convolutional layer refers to the neuron layer in the convolutional neural network that convolves the input signal. In a convolutional layer of a convolutional neural network, a neuron can only be connected to some of its neighbors. A convolutional layer usually contains several feature planes, and each feature plane can be composed of some neural units arranged in a rectangle. Neural units in the same feature plane share weights, and the shared weights here are convolution kernels. Shared weights can be understood as the way to extract image information is independent of location. The underlying principle is that the statistics of one part of the image are the same as the other parts. This means that image information learned in one part can also be used in another part. So for all positions on the image, the same learned image information can be used. In the same convolution layer, multiple convolution kernels can be used to extract different image information. Generally, the more the number of convolution kernels, the richer the image information reflected by the convolution operation.
卷积核可以以随机大小的矩阵的形式初始化,在卷积神经网络的训练过程中卷积核可以通过学习得到合理的权重。另外,共享权重带来的直接好处是减少卷积神经网络各层之间的连接,同时又降低了过拟合的风险。The convolution kernel can be initialized in the form of a matrix of random size, and the convolution kernel can obtain reasonable weights by learning during the training process of the convolutional neural network. In addition, the immediate benefit of sharing weights is to reduce the connections between the layers of the convolutional neural network, while reducing the risk of overfitting.
(4)循环神经网络(recurrent neural networks,RNN)是用来处理序列数据的。在传统的神经网络模型中,是从输入层到隐含层再到输出层,层与层之间是全连接的,而对于每一层层内之间的各个节点是无连接的。这种普通的神经网络虽然解决了很多难题,但是却仍然对很多问题却无能无力。例如,你要预测句子的下一个单词是什么,一般需要用到前面的单词,因为一个句子中前后单词并不是独立的。RNN之所以称为循环神经网路,即一个序列当前的输出与前面的输出也有关。具体的表现形式为网络会对前面的信息进行记忆并应用于当前输出的计算中,即隐含层本层之间的节点不再无连接而是有连接的,并且隐含层的输入不仅包括输入层的输出还包括上一时刻隐含 层的输出。理论上,RNN能够对任何长度的序列数据进行处理。对于RNN的训练和对传统的CNN或DNN的训练一样。同样使用误差反向传播算法,不过有一点区别:即,如果将RNN进行网络展开,那么其中的参数,如W,是共享的;而如上举例上述的传统神经网络却不是这样。并且在使用梯度下降算法中,每一步的输出不仅依赖当前步的网络,还依赖前面若干步网络的状态。该学习算法称为基于时间的反向传播算法Back propagation Through Time(BPTT)。(4) Recurrent neural networks (RNN) are used to process sequence data. In the traditional neural network model, from the input layer to the hidden layer to the output layer, the layers are fully connected, and each node in each layer is unconnected. Although this ordinary neural network solves many problems, it is still powerless for many problems. For example, if you want to predict the next word of a sentence, you generally need to use the previous words, because the front and rear words in a sentence are not independent. The reason why RNN is called a recurrent neural network is that the current output of a sequence is also related to the previous output. The specific manifestation is that the network will memorize the previous information and apply it to the calculation of the current output, that is, the nodes between the hidden layer and this layer are no longer unconnected but connected, and the input of the hidden layer not only includes The output of the input layer also includes the output of the hidden layer at the previous moment. In theory, RNNs can process sequence data of any length. The training of RNN is the same as the training of traditional CNN or DNN. The error back-propagation algorithm is also used, but there is a difference: that is, if the RNN is expanded, the parameters, such as W, are shared; while the traditional neural network mentioned above is not the case. And in the gradient descent algorithm, the output of each step depends not only on the network of the current step, but also on the state of the network in the previous steps. This learning algorithm is called the Back propagation Through Time (BPTT) algorithm based on time.
既然已经有了卷积神经网络,为什么还要循环神经网络?原因很简单,在卷积神经网络中,有一个前提假设是:元素之间是相互独立的,输入与输出也是独立的,比如猫和狗。但现实世界中,很多元素都是相互连接的,比如股票随时间的变化,再比如一个人说了:我喜欢旅游,其中最喜欢的地方是云南,以后有机会一定要去。这里填空,人类应该都知道是填“云南”。因为人类会根据上下文的内容进行推断,但如何让机器做到这一步?RNN就应运而生了。RNN旨在让机器像人一样拥有记忆的能力。因此,RNN的输出就需要依赖当前的输入信息和历史的记忆信息。Why use a recurrent neural network when you already have a convolutional neural network? The reason is very simple. In the convolutional neural network, there is a premise that the elements are independent of each other, and the input and output are also independent, such as cats and dogs. But in the real world, many elements are interconnected, such as the change of stocks over time, and another example of a person who said: I like to travel, and my favorite place is Yunnan. I must go there in the future. Fill in the blanks here. Humans should all know that it is "Yunnan". Because humans make inferences based on the content of the context, but how do you get machines to do this? RNN came into being. RNNs are designed to give machines the ability to memorize like humans do. Therefore, the output of RNN needs to rely on current input information and historical memory information.
(6)损失函数(6) Loss function
在训练深度神经网络的过程中,因为希望深度神经网络的输出尽可能的接近真正想要预测的值,所以可以通过比较当前网络的预测值和真正想要的目标值,再根据两者之间的差异情况来更新每一层神经网络的权重向量(当然,在第一次更新之前通常会有初始化的过程,即为深度神经网络中的各层预先配置参数),比如,如果网络的预测值高了,就调整权重向量让它预测低一些,不断的调整,直到深度神经网络能够预测出真正想要的目标值或与真正想要的目标值非常接近的值。因此,就需要预先定义“如何比较预测值和目标值之间的差异”,这便是损失函数(loss function)或目标函数(objective function),它们是用于衡量预测值和目标值的差异的重要方程。其中,以损失函数举例,损失函数的输出值(loss)越高表示差异越大,那么深度神经网络的训练就变成了尽可能缩小这个loss的过程。In the process of training a deep neural network, because it is hoped that the output of the deep neural network is as close as possible to the value you really want to predict, you can compare the predicted value of the current network with the target value you really want, and then based on the difference between the two to update the weight vector of each layer of neural network (of course, there is usually an initialization process before the first update, that is, to pre-configure parameters for each layer in the deep neural network), for example, if the predicted value of the network If it is high, adjust the weight vector to make its prediction lower, and keep adjusting until the deep neural network can predict the real desired target value or a value that is very close to the real desired target value. Therefore, it is necessary to pre-define "how to compare the difference between the predicted value and the target value", which is the loss function (loss function) or objective function (objective function), which are used to measure the difference between the predicted value and the target value. important equation. Among them, taking the loss function as an example, the higher the output value of the loss function (loss), the greater the difference, then the training of the deep neural network becomes the process of reducing the loss as much as possible.
(7)像素值(7) Pixel value
图像的像素值可以是一个红绿蓝(RGB)颜色值,像素值可以是表示颜色的长整数。例如,像素值为256*Red+100*Green+76Blue,其中,Blue代表蓝色分量,Green代表绿色分量,Red代表红色分量。各个颜色分量中,数值越小,亮度越低,数值越大,亮度越高。对于灰度图像来说,像素值可以是灰度值。The pixel value of the image can be a red-green-blue (RGB) color value, and the pixel value can be a long integer representing the color. For example, the pixel value is 256*Red+100*Green+76Blue, where Blue represents the blue component, Green represents the green component, and Red represents the red component. In each color component, the smaller the value, the lower the brightness, and the larger the value, the higher the brightness. For grayscale images, the pixel values can be grayscale values.
下面介绍本申请实施例提供的系统架构。The following describes the system architecture provided by the embodiments of the present application.
参见附图5b,本发明实施例提供了一种系统架构500。如所述系统架构500所示,数据采集设备560用于采集训练数据,本申请实施例中训练数据包括:待识别对象的真实图像和/或合成图像;并将训练数据存入数据库530,训练设备520基于数据库530中维护的训练数据训练得到目标模型/规则501。下面将以实施例二和实施例三更详细地描述训练设备520如何基于训练数据得到目标模型/规则501,该目标模型/规则501可以为本申请实施例中描述的单对象位姿检测网络(提取图像中对象位姿的第一网络),即将图像输入该目标模型/规则501,即可得到图像中包括的可识别对象的位姿;或者,该目标模型/规则501可以为本申请实施例中描述的可微渲染器,即将对象的三维模型及预设位姿输入该目标模型/规则501,即可得到对象在预设位姿下的合成图像。本申 请实施例中的目标模型/规则501具体可以为单对象位姿检测网络或可微渲染器,在本申请提供的实施例中,该单对象位姿检测网络是通过训练单对象基础位姿检测网络得到的。需要说明的是,在实际的应用中,所述数据库530中维护的训练数据不一定都来自于数据采集设备560的采集,也有可能是从其他设备接收得到的。另外需要说明的是,训练设备520也不一定完全基于数据库530维护的训练数据进行目标模型/规则501的训练,也有可能从云端或其他地方获取训练数据进行模型训练,上述描述不应该作为对本申请实施例的限定。Referring to FIG. 5b, an embodiment of the present invention provides a system architecture 500. As shown in the system architecture 500, the data collection device 560 is used to collect training data. In this embodiment of the present application, the training data includes: real images and/or synthetic images of objects to be recognized; The device 520 obtains the target model/rule 501 by training based on the training data maintained in the database 530 . The following will describe in more detail how the training device 520 obtains the target model/rule 501 based on the training data with the second embodiment and the third embodiment. The target model/rule 501 may be the single-object pose detection network ( The first network for extracting the pose of the object in the image), that is, inputting the image into the target model/rule 501 to obtain the pose of the identifiable object included in the image; or, the target model/rule 501 may be an embodiment of the application The differentiable renderer described in , inputting the 3D model and the preset pose of the object into the target model/rule 501 can obtain the composite image of the object in the preset pose. The target model/rule 501 in the embodiment of the present application may specifically be a single-object pose detection network or a differentiable renderer. In the embodiment provided by the present application, the single-object pose detection network is performed by training a single-object basic pose Detected by the network. It should be noted that, in practical applications, the training data maintained in the database 530 may not necessarily all come from the collection of the data collection device 560, and may also be received from other devices. In addition, it should be noted that the training device 520 may not necessarily train the target model/rule 501 entirely based on the training data maintained by the database 530, and may also obtain training data from the cloud or other places for model training. The above description should not be used as a reference to this application Limitations of Examples.
根据训练设备520训练得到的目标模型/规则501可以应用于不同的系统或设备中,如应用于图5b所示的执行设备510,所述执行设备510可以是终端,如手机终端,平板电脑,笔记本电脑,AR/VR,车载终端等,还可以是服务器或者云端等。在附图5b中,执行设备510配置有I/O接口512,用于与外部设备进行数据交互,用户可以通过客户设备540向I/O接口512输入数据,所述输入数据在本申请实施例中可以包括:待识别对象的三维模型、待识别对象的真实图像、待识别对象的三维模型在不同位姿下渲染的合成图像。The target model/rule 501 trained according to the training device 520 can be applied to different systems or devices, such as the execution device 510 shown in FIG. 5b, the execution device 510 can be a terminal, such as a mobile phone terminal, a tablet Notebook computer, AR/VR, vehicle terminal, etc., it can also be a server or cloud, etc. In FIG. 5b, the execution device 510 is configured with an I/O interface 512, which is used for data interaction with external devices. The user can input data to the I/O interface 512 through the client device 540, and the input data is described in the embodiments of the present application. may include: a three-dimensional model of the object to be recognized, a real image of the object to be recognized, and a composite image rendered by the three-dimensional model of the object to be recognized in different poses.
在执行设备510的计算模块511执行计算等相关的处理过程中,执行设备510可以调用数据存储系统550中的数据、代码等以用于相应的处理,也可以将相应处理得到的数据、指令等存入数据存储系统150中。When the computing module 511 of the execution device 510 performs calculations and other related processing processes, the execution device 510 may call the data, codes, etc. in the data storage system 550 for corresponding processing, or may store the data, instructions, etc. obtained from the corresponding processing. Stored in data storage system 150 .
最后,I/O接口512将处理结果,如得到的图像中的可识别对象的虚拟内容以及位姿返回给客户设备140,从而提供给用户按照该位姿显示的虚拟内容,实现虚实结合的体验。Finally, the I/O interface 512 returns the processing result, such as the virtual content and pose of the recognizable object in the obtained image, to the client device 140, so as to provide the user with the virtual content displayed according to the pose, so as to realize the experience of combining virtual and real .
值得说明的是,训练设备520可以针对不同的目标或称不同的任务,基于不同的训练数据生成相应的目标模型/规则501,该相应的目标模型/规则501即可以用于实现上述目标或完成上述任务,从而为用户提供所需的结果。It is worth noting that the training device 520 can generate corresponding target models/rules 501 based on different training data for different goals or tasks, and the corresponding target models/rules 501 can be used to achieve the above goals or complete The above task, thus providing the user with the desired result.
在图5b中所示情况下,用户可以手动给定输入数据,该手动给定可以通过I/O接口512提供的界面进行操作。另一种情况下,客户设备540可以自动地向I/O接口512发送输入数据,如果要求客户设备540自动发送输入数据需要获得用户的授权,则用户可以在客户设备140中设置相应权限。用户可以在客户设备540查看执行设备510输出的结果,具体的呈现形式可以是显示、声音、动作等具体方式。客户设备540也可以作为数据采集端,采集如图5b所示输入I/O接口512的输入数据及输出I/O接口512的输出结果作为新的样本数据,并存入数据库530。当然,也可以不经过客户设备540进行采集,而是由I/O接口512直接将如图所示输入I/O接口512的输入数据及输出I/O接口512的输出结果,作为新的样本数据存入数据库530。In the case shown in FIG. 5 b , the user can manually specify input data, which can be operated through the interface provided by the I/O interface 512 . In another case, the client device 540 can automatically send the input data to the I/O interface 512 . If the user's authorization is required to require the client device 540 to automatically send the input data, the user can set the corresponding permission in the client device 140 . The user can view the result output by the execution device 510 on the client device 540, and the specific presentation form can be a specific manner such as display, sound, and action. The client device 540 can also be used as a data collection terminal to collect the input data of the input I/O interface 512 and the output result of the output I/O interface 512 as new sample data as shown in FIG. 5b and store them in the database 530. Of course, it is also possible not to collect through the client device 540, but the I/O interface 512 directly uses the input data input into the I/O interface 512 and the output result of the output I/O interface 512 as shown in the figure as a new sample The data is stored in database 530.
值得注意的是,图5b仅是本发明实施例提供的一种系统架构的示意图,图中所示设备、器件、模块等之间的位置关系不构成任何限制,例如,在图5b中,数据存储系统550相对执行设备110是外部存储器,在其它情况下,也可以将数据存储系统550置于执行设备510中。It is worth noting that FIG. 5b is only a schematic diagram of a system architecture provided by an embodiment of the present invention, and the positional relationship among the devices, devices, modules, etc. shown in the figure does not constitute any limitation. For example, in FIG. 5b, the data The storage system 550 is an external memory relative to the execution device 110 , and in other cases, the data storage system 550 may also be placed in the execution device 510 .
本申请实施例提供的方法和装置还可以用于扩充训练数据库,如图5b所示执行设备510的I/O接口512可以将经执行设备处理过的图像(如待识别物体在不同位姿下渲染的合成图像)和用户输入的待识别物体的真实图像,一起作为训练数据对发送给 数据库530,以使得数据库530维护的训练数据更加丰富,从而为训练设备520的训练工作提供更丰富的训练数据。The methods and apparatuses provided in the embodiments of the present application can also be used to expand the training database. As shown in FIG. 5b, the I/O interface 512 of the execution device 510 can convert the images processed by the execution device (such as the objects to be recognized in different poses) The rendered synthetic image) and the real image of the object to be recognized input by the user are sent to the database 530 as a pair of training data, so that the training data maintained by the database 530 is more abundant, thereby providing more abundant training for the training work of the training device 520 data.
如图5b所示,根据训练设备520训练得到目标模型/规则501,该目标模型/规则501在本申请实施例中可以是单对象位姿识别网络(本申请实施例中描述的第一网络、第二网络)。在本申请实施例提供的单对象位姿识别网络都可以是卷积神经网络、循环神经网络,或者其他。As shown in FIG. 5b, a target model/rule 501 is obtained by training according to the training device 520, and the target model/rule 501 may be a single-object pose recognition network (the first network, second network). The single-object pose recognition networks provided in the embodiments of the present application may all be convolutional neural networks, recurrent neural networks, or others.
如前文的基础概念介绍所述,卷积神经网络是一种带有卷积结构的深度神经网络,是一种深度学习(deep learning)架构,深度学习架构是指通过机器学习的算法,在不同的抽象层级上进行多个层次的学习。作为一种深度学习架构,CNN是一种前馈(feed-forward)人工神经网络,该前馈人工神经网络中的各个神经元可以对输入其中的图像作出响应。As mentioned in the introduction to the basic concepts above, a convolutional neural network is a deep neural network with a convolutional structure and a deep learning architecture. learning at multiple levels of abstraction. As a deep learning architecture, CNN is a feed-forward artificial neural network in which individual neurons can respond to images fed into it.
如图6所示,卷积神经网络(CNN)600可以包括输入层610,卷积层/池化层620(其中池化层为可选的),以及神经网络层630。As shown in FIG. 6 , a convolutional neural network (CNN) 600 may include an input layer 610 , a convolutional/pooling layer 620 (where the pooling layer is optional), and a neural network layer 630 .
卷积层/池化层620:Convolutional layer/pooling layer 620:
卷积层:Convolutional layer:
如图6所示卷积层/池化层620可以包括如示例621-626层,举例来说:在一种实现中,621层为卷积层,622层为池化层,623层为卷积层,624层为池化层,625为卷积层,626为池化层;在另一种实现方式中,621、622为卷积层,623为池化层,624、625为卷积层,626为池化层。即卷积层的输出可以作为随后的池化层的输入,也可以作为另一个卷积层的输入以继续进行卷积操作。As shown in Figure 6, the convolutional/pooling layer 620 may include layers 621-626 as examples, for example: in one implementation, layer 621 is a convolutional layer, layer 622 is a pooling layer, and layer 623 is a convolutional layer Layer 624 is a pooling layer, 625 is a convolutional layer, and 626 is a pooling layer; in another implementation, 621 and 622 are convolutional layers, 623 are pooling layers, and 624 and 625 are convolutional layers. layer, 626 is the pooling layer. That is, the output of a convolutional layer can be used as the input of a subsequent pooling layer, or it can be used as the input of another convolutional layer to continue the convolution operation.
下面将以卷积层621为例,介绍一层卷积层的内部工作原理。The following will take the convolutional layer 621 as an example to introduce the inner working principle of a convolutional layer.
卷积层621可以包括很多个卷积算子,卷积算子也称为核,其在图像处理中的作用相当于一个从输入图像矩阵中提取特定信息的过滤器,卷积算子本质上可以是一个权重矩阵,这个权重矩阵通常被预先定义,在对图像进行卷积操作的过程中,权重矩阵通常在输入图像上沿着水平方向一个像素接着一个像素(或两个像素接着两个像素……这取决于步长stride的取值)的进行处理,从而完成从图像中提取特定特征的工作。该权重矩阵的大小应该与图像的大小相关,需要注意的是,权重矩阵的纵深维度(depth dimension)和输入图像的纵深维度是相同的,在进行卷积运算的过程中,权重矩阵会延伸到输入图像的整个深度。因此,和一个单一的权重矩阵进行卷积会产生一个单一纵深维度的卷积化输出,但是大多数情况下不使用单一权重矩阵,而是应用多个尺寸(行×列)相同的权重矩阵,即多个同型矩阵。每个权重矩阵的输出被堆叠起来形成卷积图像的纵深维度,这里的维度可以理解为由上面所述的“多个”来决定。不同的权重矩阵可以用来提取图像中不同的特征,例如一个权重矩阵用来提取图像边缘信息,另一个权重矩阵用来提取图像的特定颜色,又一个权重矩阵用来对图像中不需要的噪点进行模糊化等。该多个权重矩阵尺寸(行×列)相同,经过该多个尺寸相同的权重矩阵提取后的特征图的尺寸也相同,再将提取到的多个尺寸相同的特征图合并形成卷积运算的输出。The convolution layer 621 may include many convolution operators. The convolution operator is also called a kernel, and its role in image processing is equivalent to a filter that extracts specific information from the input image matrix. The convolution operator is essentially Can be a weight matrix, which is usually pre-defined, usually one pixel by one pixel (or two pixels by two pixels) along the horizontal direction on the input image during the convolution operation on the image. ...It depends on the value of the stride step) to process, so as to complete the work of extracting specific features from the image. The size of the weight matrix should be related to the size of the image. It should be noted that the depth dimension of the weight matrix is the same as the depth dimension of the input image. During the convolution operation, the weight matrix will be extended to Enter the entire depth of the image. Therefore, convolution with a single weight matrix will result in a single depth dimension of the convolutional output, but in most cases a single weight matrix is not used, but multiple weight matrices of the same size (row × column) are applied, That is, multiple isotype matrices. The output of each weight matrix is stacked to form the depth dimension of the convolutional image, where the dimension can be understood as determined by the "multiple" described above. Different weight matrices can be used to extract different features in the image. For example, one weight matrix is used to extract image edge information, another weight matrix is used to extract specific colors of the image, and another weight matrix is used to extract unwanted noise in the image. Blur, etc. The multiple weight matrices have the same size (row×column), and the size of the feature maps extracted from the multiple weight matrices with the same size is also the same, and then the multiple extracted feature maps with the same size are combined to form a convolution operation. output.
这些权重矩阵中的权重值在实际应用中需要经过大量的训练得到,通过训练得到的权重值形成的各个权重矩阵可以用来从输入图像中提取信息,从而使得卷积神经网 络600进行正确的预测。The weight values in these weight matrices need to be obtained through a lot of training in practical applications, and each weight matrix formed by the weight values obtained by training can be used to extract information from the input image, so that the convolutional neural network 600 can make correct predictions .
当卷积神经网络600有多个卷积层的时候,初始的卷积层(例如621)往往提取较多的一般特征,该一般特征也可以称之为低级别的特征;随着卷积神经网络600深度的加深,越往后的卷积层(例如626)提取到的特征越来越复杂,比如高级别的语义之类的特征,语义越高的特征越适用于待解决的问题。When the convolutional neural network 600 has multiple convolutional layers, the initial convolutional layer (eg 621 ) often extracts more general features, which can also be called low-level features; with the convolutional neural network As the depth of the network 600 deepens, the features extracted by the later convolutional layers (eg 626 ) become more and more complex, such as features such as high-level semantics. Features with higher semantics are more suitable for the problem to be solved.
池化层:Pooling layer:
由于常常需要减少训练参数的数量,因此卷积层之后常常需要周期性的引入池化层,在如图6中620所示例的621-626各层,可以是一层卷积层后面跟一层池化层,也可以是多层卷积层后面接一层或多层池化层。在图像处理过程中,池化层的唯一目的就是减少图像的空间大小。池化层可以包括平均池化算子和/或最大池化算子,以用于对输入图像进行采样得到较小尺寸的图像。平均池化算子可以在特定范围内对图像中的像素值进行计算产生平均值作为平均池化的结果。最大池化算子可以在特定范围内取该范围内值最大的像素作为最大池化的结果。另外,就像卷积层中用权重矩阵的大小应该与图像尺寸相关一样,池化层中的运算符也应该与图像的大小相关。通过池化层处理后输出的图像尺寸可以小于输入池化层的图像的尺寸,池化层输出的图像中每个像素点表示输入池化层的图像的对应子区域的平均值或最大值。Since it is often necessary to reduce the number of training parameters, it is often necessary to periodically introduce a pooling layer after the convolutional layer. In the layers 621-626 exemplified by 620 in Figure 6, it can be a convolutional layer followed by a layer. The pooling layer can also be a multi-layer convolutional layer followed by one or more pooling layers. During image processing, the only purpose of pooling layers is to reduce the spatial size of the image. The pooling layer may include an average pooling operator and/or a max pooling operator for sampling the input image to obtain a smaller size image. The average pooling operator can calculate the pixel values in the image within a certain range to produce an average value as the result of average pooling. The max pooling operator can take the pixel with the largest value within a specific range as the result of max pooling. Also, just as the size of the weight matrix used in the convolutional layer should be related to the size of the image, the operators in the pooling layer should also be related to the size of the image. The size of the output image after processing by the pooling layer can be smaller than the size of the image input to the pooling layer, and each pixel in the image output by the pooling layer represents the average or maximum value of the corresponding sub-region of the image input to the pooling layer.
神经网络层630:Neural network layer 630:
在经过卷积层/池化层620的处理后,卷积神经网络600还不足以输出所需要的输出信息。因为如前所述,卷积层/池化层620只会提取特征,并减少输入图像带来的参数。然而为了生成最终的输出信息(所需要的类信息或其他相关信息),卷积神经网络600需要利用神经网络层630来生成一个或者一组所需要的类的数量的输出。因此,在神经网络层630中可以包括多层隐含层(如图6所示的631、632至63n)以及输出层640,该多层隐含层中所包含的参数可以根据具体的任务类型的相关训练数据进行预先训练得到,例如该任务类型可以包括图像识别,图像分类,图像超分辨率重建等等……After being processed by the convolutional layer/pooling layer 620, the convolutional neural network 600 is not sufficient to output the required output information. Because as mentioned before, the convolutional layer/pooling layer 620 only extracts features and reduces the parameters brought by the input image. However, in order to generate the final output information (required class information or other relevant information), the convolutional neural network 600 needs to utilize the neural network layer 630 to generate one or a set of outputs of the desired number of classes. Therefore, the neural network layer 630 may include multiple hidden layers (631, 632 to 63n as shown in FIG. 6) and the output layer 640, and the parameters contained in the multiple hidden layers may be based on specific task types The relevant training data is pre-trained, for example, the task type can include image recognition, image classification, image super-resolution reconstruction, etc...
在神经网络层630中的多层隐含层之后,也就是整个卷积神经网络600的最后层为输出层640,该输出层640具有类似分类交叉熵的损失函数,具体用于计算预测误差,一旦整个卷积神经网络600的前向传播(如图6由610至640方向的传播为前向传播)完成,反向传播(如图6由640至610方向的传播为反向传播)就会开始更新前面提到的各层的权重值以及偏差,以减少卷积神经网络600的损失,及卷积神经网络600通过输出层输出的结果和理想结果之间的误差。After the multi-layer hidden layers in the neural network layer 630, that is, the last layer of the entire convolutional neural network 600 is the output layer 640, the output layer 640 has a loss function similar to the classification cross entropy, and is specifically used to calculate the prediction error, Once the forward propagation of the entire convolutional neural network 600 (as shown in Fig. 6, the propagation from the direction 610 to 640 is the forward propagation) is completed, the back propagation (as shown in Fig. 6, the propagation from the 640 to 610 direction is the back propagation) will Begin to update the weight values and biases of the aforementioned layers to reduce the loss of the convolutional neural network 600 and the error between the result output by the convolutional neural network 600 through the output layer and the ideal result.
需要说明的是,如图6所示的卷积神经网络600仅作为一种卷积神经网络的示例,在具体的应用中,卷积神经网络还可以以其他网络模型的形式存在。It should be noted that the convolutional neural network 600 shown in FIG. 6 is only used as an example of a convolutional neural network, and in a specific application, the convolutional neural network may also exist in the form of other network models.
下面介绍本申请实施例提供的一种芯片硬件结构。The following describes a chip hardware structure provided by an embodiment of the present application.
图7a为本发明实施例提供的一种芯片硬件结构,该芯片包括神经网络处理器(NPU)70。该芯片可以被设置在如图5b所示的执行设备510中,用以完成计算模块511的计算工作。该芯片也可以被设置在如图5b所示的训练设备520中,用以完成训练设备520的训练工作并输出目标模型/规则501。如图6所示的卷积神经网络中各层的算法均可在如图7a所示的芯片中得以实现。FIG. 7a is a hardware structure of a chip according to an embodiment of the present invention, where the chip includes a neural network processor (NPU) 70 . The chip can be set in the execution device 510 as shown in FIG. 5 b to complete the calculation work of the calculation module 511 . The chip can also be set in the training device 520 as shown in FIG. 5 b to complete the training work of the training device 520 and output the target model/rule 501 . The algorithms of each layer in the convolutional neural network shown in Figure 6 can be implemented in the chip shown in Figure 7a.
如图7a所示,NPU 70作为协处理器挂载到主中央处理器(central processing unit,CPU)(Host CPU)上,由Host CPU分配任务。NPU的核心部分为运算电路70,控制器704控制运算电路703提取存储器(权重存储器或输入存储器)中的数据并进行运算。As shown in Figure 7a, the NPU 70 is mounted on the main central processing unit (CPU) (Host CPU) as a co-processor, and tasks are allocated by the Host CPU. The core part of the NPU is the operation circuit 70, and the controller 704 controls the operation circuit 703 to extract the data in the memory (weight memory or input memory) and perform operations.
在一些实现中,运算电路703内部包括多个处理单元(process engine,PE)。在一些实现中,运算电路703是二维脉动阵列。运算电路703还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子线路。在一些实现中,运算电路703是通用的矩阵处理器。In some implementations, the arithmetic circuit 703 includes multiple processing units (process engines, PEs). In some implementations, arithmetic circuit 703 is a two-dimensional systolic array. The arithmetic circuit 703 may also be a one-dimensional systolic array or other electronic circuitry capable of performing mathematical operations such as multiplication and addition. In some implementations, arithmetic circuit 703 is a general-purpose matrix processor.
举例来说,假设有输入矩阵A,权重矩阵B,输出矩阵C。运算电路703从权重存储器702中取矩阵B相应的数据,并缓存在运算电路中每一个PE上。运算电路703从输入存储器701中取矩阵A数据与矩阵B进行矩阵运算,得到的矩阵的部分结果或最终结果,保存在累加器708(accumulator)中。For example, suppose there is an input matrix A, a weight matrix B, and an output matrix C. The operation circuit 703 fetches the data corresponding to the matrix B from the weight memory 702 and buffers it on each PE in the operation circuit. The operation circuit 703 takes the data of the matrix A from the input memory 701 and performs the matrix operation on the matrix B, and stores the partial result or the final result of the matrix in the accumulator 708 (accumulator).
向量计算单元707可以对运算电路703的输出做进一步处理,如向量乘,向量加,指数运算,对数运算,大小比较等等。例如,向量计算单元707可以用于神经网络中非卷积/非全连接层(fully connected layers,FC)层的网络计算,如池化(Pooling),批归一化(Batch Normalization),局部响应归一化(Local Response Normalization)等。The vector calculation unit 707 can further process the output of the operation circuit 703, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison and so on. For example, the vector computing unit 707 can be used for network computation of non-convolutional/non-fully connected layers (FC) layers in the neural network, such as pooling (Pooling), batch normalization (Batch Normalization), local response Normalization (Local Response Normalization), etc.
在一些实现种,向量计算单元能707将经处理的输出的向量存储到统一缓存器706。例如,向量计算单元707可以将非线性函数应用到运算电路703的输出,例如累加值的向量,用以生成激活值。在一些实现中,向量计算单元707生成归一化的值、合并值,或二者均有。在一些实现中,处理过的输出的向量能够用作到运算电路703的激活输入,例如用于在神经网络中的后续层中的使用。In some implementations, the vector computation unit 707 can store the processed output vectors to the unified buffer 706 . For example, the vector calculation unit 707 may apply a nonlinear function to the output of the arithmetic circuit 703, eg, a vector of accumulated values, to generate activation values. In some implementations, vector computation unit 707 generates normalized values, merged values, or both. In some implementations, the vector of processed outputs can be used as activation input to the arithmetic circuit 703, eg, for use in subsequent layers in a neural network.
例如,如图6所示的卷积神经网络中各层的算法均可以由703或707执行。图5b中计算模块511、训练设备520的算法均可以由703或707执行。For example, the algorithms of each layer in the convolutional neural network shown in FIG. 6 can be executed by 703 or 707 . The algorithms of the calculation module 511 and the training device 520 in FIG. 5b can all be executed by 703 or 707 .
统一存储器706用于存放输入数据以及输出数据。Unified memory 706 is used to store input data and output data.
权重数据直接通过存储单元访问控制器(direct memory access controller,DMAC)705将外部存储器中的输入数据搬运到输入存储器701和/或统一存储器706、将外部存储器中的权重数据存入权重存储器702,以及将统一存储器706中的数据存入外部存储器。The weight data directly transfers the input data in the external memory to the input memory 701 and/or the unified memory 706 through the storage unit access controller (direct memory access controller, DMAC) 705, and stores the weight data in the external memory into the weight memory 702, and storing the data in the unified memory 706 into the external memory.
总线接口单元(bus interface unit,BIU)710,用于通过总线实现主CPU、DMAC和取指存储器709之间进行交互。A bus interface unit (BIU) 710 is used to realize the interaction between the main CPU, the DMAC and the instruction fetch memory 709 through the bus.
与控制器704连接的取指存储器(instruction fetch buffer)709,用于存储控制器704使用的指令。An instruction fetch buffer 709 connected to the controller 704 is used to store the instructions used by the controller 704 .
控制器704,用于调用指存储器709中缓存的指令,实现控制该运算加速器的工作过程。The controller 704 is used for invoking the instructions cached in the memory 709 to control the working process of the operation accelerator.
示例性的,这里的数据可以为是说明数据,可以是图6所示的卷积神经网络中各层的输入或输出数据,或者,可以是图5b中计算模块511、训练设备520的输入或输出数据。Exemplarily, the data here may be descriptive data, may be the input or output data of each layer in the convolutional neural network shown in FIG. Output Data.
一般地,统一存储器706,输入存储器701,权重存储器702以及取指存储器709均为片上(On-Chip)存储器,外部存储器为该NPU外部的存储器,该外部存储器可 以为双倍数据率同步动态随机存储器(double data rate synchronous dynamic random access memory,DDR SDRAM)、高带宽存储器(high bandwidth memory,HBM)或其他可读可写的存储器。Generally, the unified memory 706 , the input memory 701 , the weight memory 702 and the instruction fetch memory 709 are all on-chip memories, and the external memory is the memory outside the NPU, and the external memory can be double data rate synchronous dynamic random access. Memory (double data rate synchronous dynamic random access memory, DDR SDRAM), high bandwidth memory (high bandwidth memory, HBM) or other readable and writable memory.
可选的,图5b和图6中的程序算法是由主CPU和NPU共同配合完成的。Optionally, the program algorithms in FIG. 5b and FIG. 6 are jointly completed by the main CPU and the NPU.
示例性的,图7b示意了本申请提供的方案的系统总体框架。如图7b所示,该框架包括离线注册和在线检测两个部分。Exemplarily, FIG. 7b illustrates the overall system framework of the solution provided by this application. As shown in Figure 7b, the framework includes two parts: offline registration and online detection.
其中,离线注册部分,由用户输入对象的三维模型以及真实图片,然后按照三维模型及真实图片对基础单对象位姿检测网络进行训练,得到对象的单对象位姿检测网络,对象的单对象位姿检测网络用于检测图像中包含的该对象的位姿,且检测准确度优于基础单对象位姿检测网络。进一步的,可以按照三维模型及真实图片,利用已经得到的对象的单对象位姿检测网络,提取对象的特征进行增量式对象注册,将对象的特征与对象的类别(可以通过标识表示)进行注册,得到多对象类别分类器。多对象类别分类器可以用于识别图像中包括的可识别对象的类别。离线注册部分的具体操作,可以参照下述本申请实施例中提供的对象注册方法(例如图8示意的对象注册方法)的具体实现,此处不再赘述。Among them, in the offline registration part, the user inputs the three-dimensional model of the object and the real picture, and then trains the basic single-object pose detection network according to the three-dimensional model and the real picture to obtain the single-object pose detection network of the object, and the single-object pose detection network of the object. The pose detection network is used to detect the pose of the object contained in the image, and the detection accuracy is better than the basic single-object pose detection network. Further, according to the three-dimensional model and the real picture, the single-object pose detection network of the object that has been obtained can be used to extract the characteristics of the object for incremental object registration, and the characteristics of the object and the category of the object (which can be represented by identification) are carried out. Register to get a multi-object class classifier. A multi-object class classifier can be used to identify classes of recognizable objects included in an image. For the specific operation of the offline registration part, reference may be made to the specific implementation of the object registration method (for example, the object registration method illustrated in FIG. 8 ) provided in the following embodiments of the present application, which will not be repeated here.
在线检测部分,获取到输入图像后,利用离线注册部分得到的多对象类别分类器进行多特征融合分类,得到输入图像中包括的可识别对象的分类结果(类别),利用离线注册部分得到的对象的单对象位姿检测网络,得到输入图像中包括的可识别对象的位姿;按照输入图像中可识别对象的位姿,将可识别对象的类别对应的虚拟内容呈现。在线检测部分的具体操作,可以参照下述本申请实施例中提供的显示方法(例如图17示意的显示方法)的具体实现,此处不再赘述。In the online detection part, after the input image is acquired, the multi-feature fusion classification is performed using the multi-object category classifier obtained in the offline registration part to obtain the classification results (categories) of the identifiable objects included in the input image, and the objects obtained in the offline registration part are used. The single-object pose detection network is used to obtain the pose of the recognizable object included in the input image; according to the pose of the recognizable object in the input image, the virtual content corresponding to the category of the recognizable object is presented. For the specific operation of the online detection part, reference may be made to the specific implementation of the display method (eg, the display method illustrated in FIG. 17 ) provided in the following embodiments of the present application, which will not be repeated here.
申请实施例一提供一种对象注册方法,用于注册第一对象,第一对象为任一个待识别对象。每一个对象的注册过程均相同,本申请实施例一以注册第一对象为例进行描述,其他不再一一赘述。 Application Embodiment 1 provides an object registration method for registering a first object, where the first object is any object to be identified. The registration process of each object is the same. Embodiment 1 of the present application is described by taking the registration of the first object as an example, and the others will not be described in detail.
本申请实施例一提供的对象注册方法,可以由如图5b所示的执行设备510执行,该对象注册方法中的真实图像可以是如图5b所示的客户设备540给出的输入数据,所述执行设备510中的计算模块511可以用于执行所述S801至S803。The object registration method provided in the first embodiment of the present application may be executed by the execution device 510 as shown in FIG. 5b, and the real image in the object registration method may be the input data given by the client device 540 as shown in FIG. 5b, so The computing module 511 in the executing device 510 may be used to execute the S801 to S803.
可选的,本申请实施例一提供的对象注册方法可以由CPU处理,也可以由CPU和GPU共同处理,也可以不用GPU,而使用其他适合用于神经网络计算的处理器,本申请不做限制。Optionally, the object registration method provided in Embodiment 1 of the present application may be processed by the CPU, or may be jointly processed by the CPU and the GPU, or other processors suitable for neural network computing may be used without the use of the GPU. limit.
如图8所示,本申请实施例一提供的对象注册方法可以包括:As shown in FIG. 8 , the object registration method provided in Embodiment 1 of the present application may include:
S801、获取包括第一对象的多个第一输入图像。S801. Acquire a plurality of first input images including a first object.
其中,该多个第一输入图像包括第一对象的真实图像和/或第一对象的多个第一合成图像。Wherein, the plurality of first input images include real images of the first object and/or a plurality of first composite images of the first object.
一种可能的实现方式中,第一输入图像可以包括第一对象的多个真实图像。In a possible implementation manner, the first input image may include multiple real images of the first object.
由于用户输入的对象的真实图像数量有限,利用第一对象的真实图像执行S802提取特征信息时,存在无法完全表征待识别对象在不同光照、不同角度和距离下的视觉特征的不足,因此可以在真实图像的基础上,加入合成图像以进一步增加特征提取的有效性。因此,第一输入图像可以包括第一对象的真实图像以及第一合成图像。第 一合成图像由可微渲染得到,可以降低第一合成图像与真实图像之间的差异。Since the number of real images of the object input by the user is limited, when performing S802 to extract feature information using the real image of the first object, there is a deficiency that the visual features of the object to be identified under different illumination, different angles and distances cannot be fully characterized. On the basis of real images, synthetic images are added to further increase the effectiveness of feature extraction. Therefore, the first input image may include the real image of the first object as well as the first composite image. The first composite image is obtained by differentiable rendering, which can reduce the difference between the first composite image and the real image.
另一种可能的实现方式中,第一输入图像可以包括第一对象的多个第一合成图像。该多个第一合成图像可以是由第一对象的三维模型在多个第一位姿下可微渲染得到的。该多个第一位姿不同。In another possible implementation manner, the first input image may include multiple first composite images of the first object. The multiple first composite images may be obtained by differentiable rendering of the three-dimensional model of the first object in multiple first poses. The plurality of first poses are different.
其中,多个第一位姿不同,是指多个第一位姿对应了相机的不同拍摄角度。The multiple first poses being different means that the multiple first poses correspond to different shooting angles of the camera.
示例性的,该多个第一位姿可以在球面取。多个第一位姿可以为图9所示的球面上均匀采样得到的多个位姿,本申请实施例对于球面上采样位姿的密度不予限定,可以根据实际需求选取。当然,第一位姿也可以为用户输入的多个不同位姿,不予限定。Exemplarily, the plurality of first poses can be taken on a spherical surface. The multiple first poses may be multiple poses obtained by uniform sampling on the spherical surface shown in FIG. 9 , and the density of the sampling poses on the spherical surface is not limited in this embodiment of the present application, and may be selected according to actual needs. Of course, the first pose may also be multiple different poses input by the user, which is not limited.
具体的,在S801中第一输入图像包括第一合成图像的情况下,可以根据用户输入的,带纹理信息的第一对象的三维模型,用可微渲染器(也可以称为可微渲染引擎或可微渲染网络),合成相机在多个第一位姿下所得到的该3D模型的2D图像(第一合成图像)。Specifically, in the case where the first input image includes the first composite image in S801, a differentiable renderer (also referred to as a differentiable rendering engine may be used according to the three-dimensional model of the first object with texture information input by the user) or a differentiable rendering network), synthesizing the 2D images of the 3D model (first composite images) obtained by the camera in multiple first poses.
示例性的,在S801中第一输入图像包括第一合成图像的情况下,如图10所示的对象注册方法的具体过程,S801具体可以包括S801a和S801b。Exemplarily, in the case where the first input image includes the first composite image in S801, as shown in the specific process of the object registration method shown in FIG. 10, S801 may specifically include S801a and S801b.
S801a、将第一对象的三维模型进行可微渲染,得到多个第一合成图像。S801a. Perform differentiable rendering on the three-dimensional model of the first object to obtain a plurality of first composite images.
S801b、根据第一合成图像,得到第一输入图像。S801b, obtaining a first input image according to the first composite image.
其中,S801b中得到的第一输入图像可以包括第一对象的真实图像和多个第一合成图像,或者,S801b中得到的第一输入图像可以仅包括多个第一合成图像。The first input image obtained in S801b may include a real image of the first object and a plurality of first composite images, or the first input image obtained in S801b may only include a plurality of first composite images.
需要说明的是,对于可微渲染的过程,已经在前述内容中进行了详细说明,此处不再赘述。It should be noted that the process of differentiable rendering has been described in detail in the foregoing content, and will not be repeated here.
S802、分别提取多个第一输入图像的特征信息,该特征信息用于指示第一对象在其所处的第一输入图像中的特征。S802. Extract feature information of a plurality of first input images respectively, where the feature information is used to indicate features of the first object in the first input image where it is located.
其中,该特征信息可以包括局部特征点的描述子,以及全局特征的描述子。The feature information may include descriptors of local feature points and descriptors of global features.
描述子用于描述特征,是刻画特征的一个数据结构,一个描述子的维数可以是多维的。描述子可以有多种类型,例如SIFT、SURF、MSER等,本申请实施例对于描述子的类型不予限定。描述子可以为多维向量的形式。Descriptors are used to describe features and are a data structure that describes features. The dimension of a descriptor can be multi-dimensional. Descriptors may be of multiple types, such as SIFT, SURF, MSER, etc. The embodiments of the present application do not limit the types of descriptors. Descriptors can be in the form of multidimensional vectors.
例如,局部特征点的描述子可以是表征该特征点局部图像信息的高维向量;全局特征的描述子可以是表征整副图像或者较大区域的图像信息的高维向量。For example, the descriptor of a local feature point may be a high-dimensional vector representing the local image information of the feature point; the descriptor of the global feature may be a high-dimensional vector representing the image information of the entire image or a larger area.
需要说明的是,对于局部特征点以及全局特征的定义以及提取方法,已经在前述内容中进行了说明此处不再赘述。It should be noted that the definitions and extraction methods of local feature points and global features have been described in the foregoing content and will not be repeated here.
进一步可选的,特征信息中还可以包括局部特征点的位置。Further optionally, the feature information may also include the location of the local feature point.
一种可能的实现方式中,S802中可以在每个第一输入图像内的所有区域提取特征信息。In a possible implementation manner, in S802, feature information may be extracted from all regions in each first input image.
另一种可能的实现方式中,S802中可以在每个第一输入图像中确定第一对象的区域,然后在第一对象的区域中提取特征信息。In another possible implementation manner, in S802, the region of the first object may be determined in each first input image, and then feature information is extracted in the region of the first object.
其中,第一对象的区域可以为第一输入图像中仅包括第一对象的区域,或者,第一对象的区域可以为第一输入图像中包括第一对象的区域,该区域包含于第一输入图像内。The area of the first object may be an area of the first input image that only includes the first object, or the area of the first object may be an area of the first input image that includes the first object, and the area is included in the first input image. within the image.
示例性的,可以将每个第一输入图像,分部输入第一网络,根据第一网络的输出 确定每个第一输入图像中第一对象的区域。第一网络用于识别图像中第一对象的位姿,或者第一网络用于提取图像中第一对象的黑白图像。Exemplarily, each first input image may be input into the first network in sections, and the region of the first object in each first input image may be determined according to the output of the first network. The first network is used for recognizing the pose of the first object in the image, or the first network is used for extracting a black and white image of the first object in the image.
可选的,在第一输入图像中确定第一对象的区域,然后在第一对象的区域提取特征信息,可以包括但不限于下述两种可能的实现方式:Optionally, determining the region of the first object in the first input image, and then extracting feature information in the region of the first object, may include but are not limited to the following two possible implementations:
第一种实现方式、第一网络用于识别图像中所述第一对象的位姿,S802中可以将多个第一输入图像分别输入第一网络进行位姿识别,得到每个第一输入图像中第一对象的位姿;按照获取的第一对象的位姿,将第一对象的三维模型分别投影至每个第一输入图像,得到每个第一输入图像中的投影区域(作为第一对象的区域);分别在每个第一输入图像中的投影区域,提取特征信息。In the first implementation manner, the first network is used to recognize the pose of the first object in the image. In S802, a plurality of first input images can be input into the first network respectively for pose recognition, and each first input image can be obtained. According to the obtained pose of the first object, project the three-dimensional model of the first object to each first input image respectively, and obtain the projection area in each first input image (as the first area of the object); feature information is extracted from the projection area in each first input image respectively.
第二种实现方式、第一网络用于提取图像中所述第一对象的黑白图像,S802中可以将多个第一输入图像分别输入第一网络进行位姿识别,得到每个第一输入图像中第一对象的黑白图像(作为第一对象的区域);分别在每个所述第一输入图像中所述第一对象的黑白图像内,提取所述特征信息。In the second implementation manner, the first network is used to extract the black-and-white image of the first object in the image. In S802, a plurality of first input images can be input into the first network respectively for pose recognition, and each first input image can be obtained. The black-and-white image of the first object (as the area of the first object) in each of the first input images; the feature information is extracted from the black-and-white image of the first object in each of the first input images.
示例性的,可以在S802中获取的每个第一输入图像中第一对象的区域内,提取视觉显著的特征点的像素位置以及对应的描述子,作为局部特征点的特征信息,每个描述子为多维向量;在S802中获取的每个第二输入图像中第一对象的区域内,提取所有的视觉特征点的特征信息,输出一个多维向量,作为全局特征的特征信息。Exemplarily, in the region of the first object in each first input image obtained in S802, the pixel positions of the visually significant feature points and the corresponding descriptors can be extracted as the feature information of the local feature points. The sub is a multi-dimensional vector; in the area of the first object in each second input image obtained in S802, the feature information of all visual feature points is extracted, and a multi-dimensional vector is output as the feature information of the global feature.
需要说明的是,对于提取特征的算法,本申请实施例不予限定,可以根据实际需求选取。It should be noted that the algorithm for extracting features is not limited in the embodiments of the present application, and can be selected according to actual needs.
示例性的,如图11所示的图像区域,根据第一输入图像输入第一网络得到的结果,可以确定第一对象在该第一输入图像中的所在区域为图11中虚线包围框所示的区域。可以在图11所示的虚线包围框所示的区域中,提取视觉显著的特征点的特征信息,以及在图11所示的虚线包围框所示的区域中提取所有的视觉特征点的特征信息。Exemplarily, in the image area shown in FIG. 11 , according to the result obtained by inputting the first input image to the first network, it can be determined that the area where the first object is located in the first input image is shown by the dashed bounding box in FIG. 11 . Area. Feature information of visually significant feature points can be extracted in the area shown by the dashed bounding box shown in FIG. 11 , and feature information of all visual feature points can be extracted from the area shown by the dashed bounding box shown in FIG. 11 . .
S803、将每个第一输入图像中提取的特征信息,与第一对象的标识相对应,进行第一对象的注册。S803 . The feature information extracted from each first input image is corresponding to the identifier of the first object, and the registration of the first object is performed.
具体的,在S803中,将第一对象的特征信息与第一对象的标识对应记录,完成第一对象的注册,可以将多个对象的注册内容,称为多对象类别分类器。该多对象类别分类器中记录了不同对象的特征与标识的对应关系,在实际应用中,通过提取图像中对象的特征信息,与多对象类别分类器中记录的特征进行对比,以确定图像中对象的标识,完成图像中对象的识别。Specifically, in S803, the feature information of the first object and the identifier of the first object are recorded correspondingly to complete the registration of the first object, and the registration content of multiple objects may be referred to as a multi-object category classifier. The multi-object category classifier records the correspondence between the features and identifiers of different objects. In practical applications, by extracting the feature information of the objects in the image and comparing with the features recorded in the multi-object category classifier to determine the The identification of the object completes the identification of the object in the image.
其中,第一对象的标识可以用于指示第一对象的类别。本申请实施例对于该标识的形式不予限定。Wherein, the identifier of the first object may be used to indicate the category of the first object. The embodiment of the present application does not limit the form of the identification.
具体的,在S803中可以将S802中提取的每个第一输入图像中第一对象的特征信息,与第一对象的标识,按照一定的结构存储,完成第一对象的注册。Specifically, in S803, the feature information of the first object in each of the first input images extracted in S802, and the identifier of the first object may be stored according to a certain structure to complete the registration of the first object.
将结构存储的多个对象的特征信息与标识称为多对象类别分类器,以便高效的查找。The feature information and identification of multiple objects stored in the structure are called multi-object category classifiers for efficient search.
通过本申请提供的对象注册方法,提取图像中对象的特征进行对象注册,注册时间短,且不影响其他已注册对象的识别性能,即使新增很多对象注册,也不会过大的增加注册时间,也保证了已注册对象的检测性能。Through the object registration method provided in this application, the features of objects in the image are extracted for object registration, and the registration time is short, and the recognition performance of other registered objects is not affected. Even if many new objects are registered, the registration time will not be increased too much. , which also guarantees the detection performance of registered objects.
需要说明的是,上述对象注册的方法,可以在线下得到第一网络以及完成对象的注册,在线上应用时,可以通过第一网络获取可识别对象的位姿,以及根据可识别对象的特征确定对象注册的标识,然后按照可识别对象的位姿,显示可识别对象的标识对应的虚拟内容,进而完成虚实结果的呈现效果。It should be noted that the above-mentioned method for object registration can obtain the first network offline and complete the registration of the object. When applied online, the pose of the identifiable object can be obtained through the first network, and determined according to the characteristics of the identifiable object. The registered logo of the object is then displayed according to the pose of the identifiable object, and the virtual content corresponding to the logo of the identifiable object is displayed, thereby completing the presentation effect of the virtual and real results.
本申请的实施例二提供一种优化方法,该优化方法可以与图8示意的对象注册方法、以及本申请实施例三提供的训练对象位姿检测网络的方法结合使用,也可以独立使用,对其使用场景本申请实施例不予限定。本申请实施例二提供的优化方法如图12所示,包括:The second embodiment of the present application provides an optimization method, which can be used in combination with the object registration method shown in FIG. 8 and the method for training an object pose detection network provided in the third embodiment of the present application, or can be used independently. The usage scenarios thereof are not limited in the embodiments of the present application. The optimization method provided in the second embodiment of the present application is shown in Figure 12, including:
S1201、将第一对象的N个真实图像分别输入第一网络,得到第一网络输出的每个真实图像中第一对象的第二位姿。S1201. Input N real images of the first object into the first network respectively, and obtain the second pose of the first object in each real image output by the first network.
其中,N大于或等于1。第一网络用于识别图像中第一对象的位姿。where N is greater than or equal to 1. The first network is used to identify the pose of the first object in the image.
S1202、根据第一对象的三维模型,采用可微渲染器在每个第二位姿下,可微渲染得到N个第二合成图像。S1202. According to the three-dimensional model of the first object, using a differentiable renderer under each second pose, obtain N second composite images by differentiable rendering.
其中,S1202中采用可微渲染器,在S1201中得到的每个第二位姿下,可微渲染得到N个第二合成图像。Wherein, a differentiable renderer is used in S1202, and under each second pose obtained in S1201, N second composite images are obtained by differentiable rendering.
具体的,获取一个第二位姿的真实图像与一个第二位姿渲染得到的第二合成图像相对应。Specifically, acquiring a real image of a second pose corresponds to a second composite image rendered by a second pose.
S1203、分别截取每个真实图像中,与真实对象对应的第二合成图像中第一对象相同位置的区域,作为每个真实图像的前景图像。S1203 , respectively intercepting a region at the same position of the first object in the second composite image corresponding to the real object in each real image, as the foreground image of each real image.
具体的,S1203中截取S1201中输入第一网络的每个真实图像的前景图像。Specifically, in S1203, the foreground image of each real image input to the first network in S1201 is intercepted.
应理解,相同位置的区域图像,可以指以第二合成图像中第一对象的某一点为基准,相同坐标的区域图像。换言之,相同位置的区域图像,可以指以第二合成图像中第一对象的某一点为基准,将第二合成图像的黑白图像投影至真实图像中的投影区域图像。It should be understood that the area images at the same position may refer to area images with the same coordinates based on a certain point of the first object in the second composite image. In other words, the area image at the same position may refer to a projection area image in which the black and white image of the second synthetic image is projected to the real image based on a certain point of the first object in the second synthetic image.
一种可能的实现方式中,可以将第二合成图像中的第一对象的黑白图像,投影到与该第二合成图像对应的真实图像中,投影区域作为该真实图像的前景图像。In a possible implementation manner, the black-and-white image of the first object in the second composite image may be projected into the real image corresponding to the second composite image, and the projection area is used as the foreground image of the real image.
另一种可能的实现方式中,可以先获取第二合成图像中的第一对象的黑白图像,黑白图像为二值化图,将第二合成图像中的第一对象的黑白图像,与该第二合成图像对应的真实图像相乘,将相乘后得到的保留区域作为该真实图像的前景图像。In another possible implementation manner, a black-and-white image of the first object in the second composite image may be obtained first, and the black-and-white image is a binarized image, and the black-and-white image of the first object in the second composite image is combined with the first object in the second composite image. The real images corresponding to the two synthetic images are multiplied, and the reserved area obtained after the multiplication is used as the foreground image of the real image.
S1204、根据N个真实图像的前景图像与其对应的第二合成图像的第一差异信息,构建第一损失函数。S1204: Construct a first loss function according to the first difference information between the foreground images of the N real images and their corresponding second composite images.
其中,第一差异信息用于指示前景图像与其对应的第二合成图像的差异。The first difference information is used to indicate the difference between the foreground image and its corresponding second composite image.
可选的,第一差异信息可以包括:特征图差、像素颜色差、提取的特征描述子之差中的一项或多项。Optionally, the first difference information may include one or more items of feature map difference, pixel color difference, and difference between extracted feature descriptors.
其中,两幅图像的特征图差,可以为感知loss,即用深度学习网络编码后的特征图之间的差。具体的,深度学习网络可用预训练好的计算机视觉组(visual geometry group,VGG)16网络,对于给定的图像,经VGG16网络编码后的特征图为C*H*W的张量,C为特征图的通道数,H和W为特征图的长和宽。特征图之间的差即为两个张量的距离,具体的,可为张量之差的L1范数、L2范数或者其他范数。Among them, the difference between the feature maps of the two images can be the perceptual loss, that is, the difference between the feature maps encoded by the deep learning network. Specifically, the deep learning network can use a pre-trained visual geometry group (VGG)16 network. For a given image, the feature map encoded by the VGG16 network is a C*H*W tensor, and C is The number of channels of the feature map, H and W are the length and width of the feature map. The difference between the feature maps is the distance between the two tensors, and specifically, it can be the L1 norm, the L2 norm or other norm of the difference between the tensors.
像素颜色差可以为:像素颜色的数值计算差值,具体的可为两幅图像的像素之差的L2范数。The pixel color difference may be: the numerical value of the pixel color calculates the difference, and specifically may be the L2 norm of the difference between the pixels of the two images.
提取的特征描述子之差:是指表示描述子的向量之间的距离。例如,该距离可以包括但不限于:欧式距离、汉明距离等。具体的,描述子可为N维浮点数向量,距离为两个N维向量的欧式距离,及两个N维向量之差的L2范数;或者描述子可为M维的二进制向量,则距离为两个向量之差的L1范数。The difference between the extracted feature descriptors: refers to the distance between the vectors representing the descriptors. For example, the distance may include, but is not limited to, Euclidean distance, Hamming distance, and the like. Specifically, the descriptor can be an N-dimensional floating-point vector, the distance is the Euclidean distance of two N-dimensional vectors, and the L2 norm of the difference between the two N-dimensional vectors; or the descriptor can be an M-dimensional binary vector, then the distance is the L1 norm of the difference between two vectors.
以上所指的L2范数是指向量各元素的平方和然后开方,L1范数是指向量中各个元素绝对值之和。The L2 norm referred to above is the sum of the squares of each element of the pointing quantity and then the square root, and the L1 norm is the sum of the absolute values of each element in the pointing quantity.
需要说明的是,若N大于1,S1204中构建第一损失函数时的第一差异信息,可以为多个前景图像与其对应的第二合成图像的差异信息的计算值。It should be noted that, if N is greater than 1, the first difference information when constructing the first loss function in S1204 may be the calculated value of the difference information between the multiple foreground images and their corresponding second composite images.
其中,前景图像对应的第二合成图像,是指该前景图像所属的真实图像,经过第一网络得到的第二位姿,可微渲染得到了该第二合成图像。The second composite image corresponding to the foreground image refers to the real image to which the foreground image belongs, and the second composite image is obtained by differentiable rendering through the second pose obtained by the first network.
示例性的,对于用户输入的第一对象的真实图像I(一个或多个),用第一网络检测真实图像中第一对象的6DOF位姿。然后,可微渲染器根据用户输入的初始三维模型以及对应的纹理和光照信息,基于检测到的真实图像中第一对象的6DOF位姿,可微渲染得到第二合成图像R(与真实图像I的数量相同)。根据每个第二合成图像R,得到第一对象在每个R中的mask(与真实图像I的数量相同),基于mask截取真实图片I中第一对象的前景图像F(一个mask用于截取其对应的真实图像的前景图像),构建F和R的第一损失函数Loss 1如下:Loss 1=L p+L i+L fExemplarily, for the real image I (one or more) of the first object input by the user, the first network is used to detect the 6DOF pose of the first object in the real image. Then, according to the initial three-dimensional model input by the user and the corresponding texture and illumination information, the differentiable renderer obtains the second composite image R (with the real image I) based on the 6DOF pose of the first object in the detected real image by differentiable rendering. the same number). According to each second composite image R, obtain the mask of the first object in each R (the same number as the real image I), and intercept the foreground image F of the first object in the real image I based on the mask (one mask is used to intercept The corresponding foreground image of the real image), the first loss function Loss 1 of F and R is constructed as follows: Loss 1 =L p +L i +L f .
其中,L p为每个前景图像与其对应的第二合成图像的特征图之间的差的计算值,L i为每个前景图像与其对应的第二合成图像的像素颜色差的计算值,L f为每个前景图像与其对应的第二合成图像的特征描述子的差的计算值。 Among them, L p is the calculated value of the difference between the feature maps of each foreground image and its corresponding second composite image, L i is the calculated value of the pixel color difference between each foreground image and its corresponding second composite image, L f is the calculated value of the difference between the feature descriptors of each foreground image and its corresponding second composite image.
需要说明的是,一个图像提取的特征描述子可以为一个或多个,当一个图像提取的特征描述子为多个时,两个图像的特征描述子的差可以为相同位置的特征描述子的差的计算值。It should be noted that there can be one or more feature descriptors extracted from an image. When there are multiple feature descriptors extracted from an image, the difference between the feature descriptors of the two images can be the difference of the feature descriptors of the same position. poor calculated value.
S1205、根据第一损失函数更新可微渲染器,使得可微渲染器的输出的合成图像逼近对象的真实图像。S1205. Update the differentiable renderer according to the first loss function, so that the synthesized image output by the differentiable renderer approximates the real image of the object.
示例性的,在S1205中可以根据S1204中构建的第一损失函数,更新可微渲染器中的纹理、光照或其他参数,从而实现可微渲染器输出的合成图像与真实图像之家的差异最小化。Exemplarily, in S1205, the texture, lighting or other parameters in the differentiable renderer can be updated according to the first loss function constructed in S1204, so as to achieve the smallest difference between the synthetic image output by the differentiable renderer and the real image. change.
具体的,S1205可以按照预设规则调整更新可微渲染器中的纹理、光照或其他参数,之后再重复进行图12示意的优化过程,直至可微渲染器输出的合成图像与真实图像之家的差异最小。Specifically, S1205 can adjust and update textures, lighting or other parameters in the differentiable renderer according to preset rules, and then repeat the optimization process shown in FIG. 12 until the synthetic image output by the differentiable renderer is the same as the real image. The difference is minimal.
其中,预设规则可以为预先配置第一损失函数满足不同条件时,对应调整不同的参数以及调整值,在S1204中确定第一损失函数后,根据满足的条件,按照对应的调整参数和调整值进行调整。The preset rule may be that when the preconfigured first loss function satisfies different conditions, different parameters and adjustment values are correspondingly adjusted. After the first loss function is determined in S1204, according to the satisfied conditions, the corresponding adjustment parameters and adjustment values are adjusted according to the conditions. make adjustments.
示例性的,图12示意的优化过程也可以如图13所示,对象的真实图像输入第一网络,第一网络输出的第二位姿输入可微渲染器,可微渲染器根据该对象的三维模型,可微渲染输出第二合成图像及第二合成图像中对象的黑白图像,根据第二合成图像与 真实图像构建第一损失函数,采用第一损失函数更新可微渲染器。Exemplarily, the optimization process shown in FIG. 12 can also be shown in FIG. 13 , the real image of the object is input into the first network, the second pose output by the first network is input into the differentiable renderer, and the differentiable renderer is based on the object's The three-dimensional model, the second composite image and the black-and-white image of the object in the second composite image are output by differentiable rendering, a first loss function is constructed according to the second composite image and the real image, and the differentiable renderer is updated by using the first loss function.
通过本申请提供的优化方法,可以对可微渲染器(也可以称为可微渲染引擎或可微渲染网络)进行优化,提高可微渲染器的渲染真实性,降低可微渲染得到的合成图像与真实图像之间的差异。With the optimization method provided in this application, a differentiable renderer (also called a differentiable rendering engine or a differentiable rendering network) can be optimized, the rendering authenticity of the differentiable renderer can be improved, and the composite image obtained by differentiable rendering can be reduced. difference from the real image.
本申请实施例二提供的优化方法具体可以由如图5b所示的训练设备520执行,该优化方法中的第二合成图像可以是如图5b所示的数据库530中维护的训练数据。可选的,实施例二提供的优化方法中的S1201至S1203中的部分或全部可以在训练设备520中执行,也可以在训练设备520之前由其他功能模块预先执行,即先对从所述数据库530中接收或者获取到的训练数据进行预处理,如S1201至S1203所述的过程,得到前景图像及第二合成图像,作为所述训练设备520的输入,并由所述训练设备520执行S1204至S1205。The optimization method provided in the second embodiment of the present application may be specifically performed by the training device 520 as shown in FIG. 5b, and the second composite image in the optimization method may be the training data maintained in the database 530 as shown in FIG. 5b. Optionally, some or all of S1201 to S1203 in the optimization method provided in the second embodiment may be executed in the training device 520, or may be pre-executed by other functional modules before the training device 520, that is, the database The training data received or acquired in 530 is preprocessed, as described in the process of S1201 to S1203, to obtain the foreground image and the second composite image, as the input of the training device 520, and the training device 520 executes S1204 to S1204 to S1205.
可选的,本申请实施例二提供的优化方法可以由CPU处理,也可以由CPU和GPU共同处理,也可以不用GPU,而使用其他适合用于神经网络计算的处理器,本申请不做限制。Optionally, the optimization method provided in the second embodiment of the present application may be processed by the CPU, or may be jointly processed by the CPU and the GPU, or other processors suitable for neural network computing may be used without the use of the GPU, which is not limited in this application. .
示例性的,本申请提供的对象注册方法在获取合成图像之前,可以采用图12示意的优化方法对可微渲染器进行优化,降低可微渲染得到的合成图像与真实图像之间的差异。图14示意了另一种对象注册方法,该方法可以包括:Exemplarily, before the object registration method provided by the present application obtains the composite image, the optimization method shown in FIG. 12 may be used to optimize the differentiable renderer to reduce the difference between the composite image obtained by differentiable rendering and the real image. Figure 14 illustrates another object registration method, which may include:
S1401、优化可微渲染器。S1401. Optimize the differentiable renderer.
其中,S1401的具体操作可以参照图12示意的过程,此处不再赘述。The specific operation of S1401 may refer to the process illustrated in FIG. 12 , which will not be repeated here.
S1402、将第一对象的三维模型输入优化后的可微渲染器进行可微渲染,得到第一合成图像。S1402. Input the three-dimensional model of the first object into the optimized differentiable renderer to perform differentiable rendering to obtain a first composite image.
S1403、根据第一合成图像,得到第一输入图像。S1403. Obtain a first input image according to the first composite image.
其中,S1403的具体实现可以参照S801b,此处不再赘述。The specific implementation of S1403 may refer to S801b, which will not be repeated here.
S1404、分别提取每个第一输入图像的特征信息。S1404. Extract feature information of each first input image respectively.
其中,S1404的具体实现参照前述S802,此处不再赘述。The specific implementation of S1404 refers to the aforementioned S802, which will not be repeated here.
S1405、将每个第一输入图像中提取的特征信息注册。S1405. Register the feature information extracted from each first input image.
其中,S1405的具体实现参照前述S803,此处不再赘述。The specific implementation of S1405 refers to the aforementioned S803, which will not be repeated here.
本申请实施例三提供一种训练对象位姿检测网络的方法,该方法可以对训练得到前述第一网络,提高第一网络输出的第一对象的位姿的准确性异。示例性的,该训练对象位姿检测网络的方法可以与图8示意的对象注册方法、以及本申请实施例二提供的优化方法结合使用,也可以独立使用,对其使用场景本申请实施例不予限定。The third embodiment of the present application provides a method for training an object pose detection network, which can train the aforementioned first network to improve the accuracy of the pose of the first object output by the first network. Exemplarily, the method for training the object pose detection network can be used in combination with the object registration method shown in FIG. 8 and the optimization method provided in the second embodiment of the present application, or can be used independently, and the usage scenarios of the method are not different from the embodiments of the present application. be limited.
具体的,本申请实施例三提供一种训练对象位姿检测网络的方法,通过第一对象的真实图像以及合成图像,进一步优化对象位姿检测网络对真实图像中对象位姿的预测能力(泛化性)。Specifically, the third embodiment of the present application provides a method for training an object pose detection network, which further optimizes the prediction ability of the object pose detection network for the object pose in the real image by using the real image and the synthesized image of the first object (pan chemical).
本申请实施例三提供的训练对象位姿检测网络的方法如图15所示,包括:The method for training an object pose detection network provided in Embodiment 3 of the present application is shown in FIG. 15 , including:
S1501、获取包括第一对象的多个第二输入图像。S1501. Acquire a plurality of second input images including the first object.
其中,该多个第二输入图像包括第一对象的真实图像和/或第一对象的多个第三合成图像。Wherein, the plurality of second input images include real images of the first object and/or a plurality of third composite images of the first object.
一种可能的实现方式中,第二输入图像可以包括第一对象的多个真实图像。In a possible implementation manner, the second input image may include multiple real images of the first object.
另一种可能的实现方式中,第二输入图像可以包括第一对象的多个真实图像,及第一对象的多个第三合成图像。该多个第三合成图像是由第一对象的三维模型在多个第三位姿下渲染得到的。多个第三位姿不同。In another possible implementation manner, the second input image may include multiple real images of the first object and multiple third composite images of the first object. The multiple third composite images are obtained by rendering the three-dimensional model of the first object in multiple third poses. The multiple third poses are different.
示例性的,该多个第三位姿可以在球面取。多个第三位姿可以为图9所示的球面上均匀采样得到的多个位姿,本申请实施例对于球面上采样位姿的密度不予限定,可以根据实际需求选取。Exemplarily, the plurality of third poses can be taken on a spherical surface. The multiple third poses may be multiple poses obtained by uniform sampling on the spherical surface as shown in FIG. 9 , and the density of the sampling poses on the spherical surface is not limited in this embodiment of the present application, and may be selected according to actual needs.
另一种可能的实现方式中,第二输入图像可以包括第一对象的多个第三合成图像。In another possible implementation manner, the second input image may include multiple third composite images of the first object.
具体的,在该S1501中,可以根据用户输入第一对象的带纹理信息的三维模型,通过常规渲染、或者可微渲染,或者其他渲染方式,在多个第三位姿上,合成第一对象在不同角度的2D图像,记录为第三合成图像。Specifically, in this S1501, the first object can be synthesized in multiple third poses by conventional rendering, or differentiable rendering, or other rendering methods according to the three-dimensional model with texture information of the first object input by the user 2D images at different angles, recorded as the third composite image.
S1502、将多个第二输入图像分别输入第二网络进行位姿识别,得到第二网络输出的每个第二输入图像中所述第一对象的第四位姿。S1502 , inputting the plurality of second input images into the second network respectively for pose recognition, and obtaining the fourth pose of the first object in each second input image output by the second network.
其中,第二网络可以用于识别图像中第一对象的位姿。Wherein, the second network can be used to identify the pose of the first object in the image.
一种可能的实现方式中,第二网络可以为基础单对象位姿检测网络。基础单对象位姿检测网络为通用的仅识别图像中的单个对象的神经网络,是配置的初始模型。In a possible implementation manner, the second network may be a basic single-object pose detection network. The basic single-object pose detection network is a general-purpose neural network that only recognizes a single object in an image, and is the initial model of the configuration.
具体的,基础单对象位姿检测网络的输入为图像,输出为该图像中的该网络可识别的单个对象的位姿。进一步可选的,基础单对象位姿检测网络的输出还可以包括该网络可识别的单个对象的黑白图像(mask)。Specifically, the input of the basic single-object pose detection network is an image, and the output is the pose of a single object recognizable by the network in the image. Further optionally, the output of the basic single-object pose detection network may also include a black-and-white image (mask) of a single object recognizable by the network.
其中,一个对象的mask是指图像中仅包含该对象的区域的黑白形式。Among them, the mask of an object refers to the black and white version of the area in the image that contains only that object.
另一种可能的实现方式中,第二网络可以为按照第一对象的图像,对基础单对象位姿检测网络进行训练后的神经网络。In another possible implementation manner, the second network may be a neural network after training the basic single-object pose detection network according to the image of the first object.
示例性的,由于第三合成图像是在已知的第三位姿下渲染生成的,因此每幅第三合成图像中第一对象相对于相机的位姿就是已知的,可以将每幅第三合成图像输入基础单对象位姿检测网络,得到预测的每幅第三合成图像中第一对象的位姿;然后根据预测的位姿,与每幅第三合成图像中第一对象的实际位姿,计算基础单对象位姿检测网络迭代过程中的损失,构建损失函数,进而根据该损失函数,训练基础单对象位姿检测网络直至收敛,作为第二网络。需要说明的是,本申请对于神经网络的训练过程不予赘述。Exemplarily, since the third composite image is generated by rendering under the known third pose, the pose of the first object relative to the camera in each third composite image is known, and each third composite image can be The three composite images are input to the basic single-object pose detection network, and the predicted pose of the first object in each third composite image is obtained; then, according to the predicted pose, the actual position of the first object in each third composite image is compared pose, calculate the loss in the iterative process of the basic single-object pose detection network, construct a loss function, and then according to the loss function, train the basic single-object pose detection network until convergence, as the second network. It should be noted that the training process of the neural network will not be described in detail in this application.
再一种可能的实现方式中,第二网络可以为当前使用的第一网络。In another possible implementation manner, the second network may be the currently used first network.
S1503、根据第一对象的三维模型,可微渲染获取第一对象在每个第四位姿下的第四合成图像。S1503. According to the three-dimensional model of the first object, microrendering obtains a fourth composite image of the first object in each fourth pose.
其中,获取一个第四位姿的第二输入图像与所述一个第四位姿渲染得到的第四合成图像相对应。Wherein, the second input image obtained from a fourth pose corresponds to a fourth composite image rendered by the fourth pose.
S1504、根据每个所述第四合成图像与其对应的第二输入图像的第二差异信息,构建第二损失函数。S1504. Construct a second loss function according to the second difference information between each fourth composite image and its corresponding second input image.
其中,第二差异信息用于指示第四合成图像与其对应的第二输入图像的差异。The second difference information is used to indicate the difference between the fourth composite image and its corresponding second input image.
示例性的,第二差异信息包括下述内容中一项或多项:第四合成图像中第一对象的黑白图像与其对应的第二输入图像的中第一对象的黑白图像的交并比(intersection over union,IOU)之差、获取第四合成图像的第四位姿与第四合成图像经过第一网络得 到的位姿之差、第四合成图像与其对应的第二输入图像中与第四合成图像中的第一对象相同位置的区域图像的相似度。Exemplarily, the second difference information includes one or more of the following: the intersection ratio ( The difference between the intersection over union, IOU), the difference between the fourth pose obtained from the fourth composite image and the pose obtained by the fourth composite image through the first network, the difference between the fourth composite image and its corresponding second input image and the fourth composite image The similarity of the region images in the same position of the first object in the composite image.
其中,两个黑白图像的IOU之差:是指两个mask的交集的面积与并集的面积之比。Among them, the difference between the IOUs of the two black and white images: refers to the ratio of the area of the intersection of the two masks to the area of the union.
两个位姿之差:是指两个位姿的数学表达计算差值。具体的,差值为平移和旋转之差的和。给定两个位姿分别为R 1,T 1和R 2,T 2,则平移的差为T 1和T 2两个向量的欧氏距离;旋转的差则为
Figure PCTCN2021140241-appb-000016
其中Tr表示矩阵的迹,arccos为反余弦函数,R 1 T为R 1的转置。
The difference between the two poses: refers to the difference between the mathematical expressions of the two poses. Specifically, the difference is the sum of the difference between translation and rotation. Given two poses R 1 , T 1 and R 2 , T 2 , the difference in translation is the Euclidean distance of the two vectors T 1 and T 2 ; the difference in rotation is
Figure PCTCN2021140241-appb-000016
where Tr represents the trace of the matrix, arccos is the inverse cosine function, and R 1 T is the transpose of R 1 .
两个图像的相似度,可以为两幅图像的像素颜色差或者特征图差。The similarity of the two images may be the difference in pixel color or the difference in feature maps of the two images.
其中,第二输入图像中第一对象的区域图像,可以根据与其对应的第四合成图像中第一对象的黑白图像,裁剪第二输入图像得到。例如,第四合成图像中第一对象的黑白图像可以为二值化图,将该二值化图与对应的第二输入图像相乘,则可以裁剪得到第二输入图像中第一对象的区域图像。The region image of the first object in the second input image can be obtained by cropping the second input image according to the black and white image of the first object in the corresponding fourth composite image. For example, the black-and-white image of the first object in the fourth composite image can be a binarized image, and the binarized image is multiplied by the corresponding second input image, and then the region of the first object in the second input image can be obtained by cropping image.
可选的,当第二输入图像中的数量为M(M大于或等于2)个,则第四合成图像也有M个,第四合成图像与其对应的第二输入图像的第二差异信息也就有M个,L i可以为第四合成图像与其对应的第二输入图像的第二差异信息的计算值。 Optionally, when the number of the second input images is M (M is greater than or equal to 2), then there are also M fourth composite images, and the second difference information between the fourth composite image and its corresponding second input image is also There are M, and Li can be the calculated value of the second difference information between the fourth composite image and its corresponding second input image.
具体的,S1504中计算第二网络在不同第二输入图像上的位姿之间的差异,构建为第二损失函数,从而能够通过优化这些差异来提升第二网络在真实图像上的泛化性。Specifically, in S1504, the difference between the poses of the second network on different second input images is calculated and constructed as a second loss function, so that the generalization of the second network on real images can be improved by optimizing these differences .
一种可能的实现方式中,第二损失函数Loss 2满足如下表达式:
Figure PCTCN2021140241-appb-000017
In a possible implementation, the second loss function Loss 2 satisfies the following expression:
Figure PCTCN2021140241-appb-000017
其中,X大于或等于1;λ i为权值,L i为用于表示第四合成图像与其对应的第二输入图像的第二差异信息的计算值。X小于或等于第二差异信息的种类数。 Wherein, X is greater than or equal to 1; λ i is a weight value, and L i is a calculated value used to represent the second difference information between the fourth composite image and its corresponding second input image. X is less than or equal to the number of kinds of the second difference information.
一种可能的实现方式中,第一损失函数Loss 2可以为Loss 2=λ 1L 12L 23L 3In a possible implementation manner, the first loss function Loss 2 may be Loss 21 L 12 L 23 L 3 .
其中,L 1为S1502输出的第二输入图像中第一对象的黑白图像,与S1503中得到的第四合成图像中第一对象的黑白图像的IoU之差的计算值;L 2为S1502输出的第二输入图像中第一对象的位姿与使用第二网络检测第四合成图像得到的第一对象的位姿之间的差的计算值;L 3为第四合成图像与其对应的第二输入图像中与该第四合成图像中的第一对象相同区域的部分的视觉感知相似度的计算值。 Wherein, L 1 is the calculated value of the difference between the black and white image of the first object in the second input image output by S1502 and the IoU of the black and white image of the first object in the fourth composite image obtained in S1503; L 2 is the output value of S1502 Calculated value of the difference between the pose of the first object in the second input image and the pose of the first object obtained by using the second network to detect the fourth composite image ; L3 is the fourth composite image and its corresponding second input The calculated value of the visual perception similarity of the part of the image in the same area as the first object in the fourth composite image.
应理解,构建第二损失函数时的预设权值λ i可以根据实际需求配置,本申请实施例对此不予限定。 It should be understood that the preset weight λ i when constructing the second loss function may be configured according to actual requirements, which is not limited in this embodiment of the present application.
S1505、根据第二损失函数更新第二网络,得到第一网络。S1505. Update the second network according to the second loss function to obtain the first network.
其中,第一网络识别的图像中第一对象的位姿与图像中所述第一对象的真实位姿的差异,小于所述第二网络识别的图像中第一对象的位姿与图像中所述第一对象的真实位姿差异。The difference between the pose of the first object in the image recognized by the first network and the real pose of the first object in the image is smaller than the pose of the first object in the image recognized by the second network and the pose of the first object in the image. Describe the real pose difference of the first object.
需要说明的是,S1505中根据第二损失函数更新第二网络,得到第一网络的过程,可以参照神经网络的训练过程,本申请实施例对此不予赘述。It should be noted that, for the process of updating the second network according to the second loss function in S1505 to obtain the first network, reference may be made to the training process of the neural network, which is not repeated in this embodiment of the present application.
示例性的,图15示意的训练对象位姿检测网络的方法的过程也可以如图16所示,对象无标注的真实图像以及第三合成图像构成的第二输入图像,输入第二网络,输出 第四位姿。第四位姿以及对象的三维模型,输入可微渲染器,输出第四合成图像及对应的mask。根据第四合成图像及对应的mask、第二输入图像经过第二网络的结果,构建第二损失函数,采用第二损失函数对第二网络进行训练,得到第一网络。Exemplarily, the process of the method for training the object pose detection network shown in FIG. 15 can also be shown in FIG. 16 , the second input image composed of the unlabeled real image of the object and the third synthetic image is input to the second network, and the output is Fourth pose. The fourth pose and the 3D model of the object are input to the differentiable renderer, and the fourth composite image and the corresponding mask are output. According to the result of passing the fourth synthetic image, the corresponding mask, and the second input image through the second network, a second loss function is constructed, and the second loss function is used to train the second network to obtain the first network.
通过本申请提供的训练对象位姿检测网络的方法,采用单对象位姿检测网络,即使新增可识别对象,训练时间短,也不会影响其他可识别对象的识别效果;另外,采用对象的真实图像与多个位姿下可微渲染合成图像的差异构建损失函数,提高了单对象位姿检测网络的精度。Through the method for training an object pose detection network provided by this application, a single object pose detection network is used, even if a new recognizable object is added, the training time is short, and the recognition effect of other recognizable objects will not be affected; The difference between real images and differentiable-renderable synthetic images in multiple poses constructs a loss function, which improves the accuracy of the single-object pose detection network.
本申请实施例三提供的训练对象位姿检测网络的方法具体可以由如图5b所示的训练设备520执行,该优化方法中的第二合成图像可以是如图5b所示的数据库530中维护的训练数据。可选的,实施例三提供的训练对象位姿检测网络的方法中的S1501至S1503中的部分或全部可以在训练设备520中执行,也可以在训练设备520之前由其他功能模块预先执行,即先对从所述数据库530中接收或者获取到的训练数据进行预处理,如S1501至S1503所述的过程,得到第二输入图像及第四合成图像,作为所述训练设备520的输入,并由所述训练设备520执行S1504至S1505。The method for training the object pose detection network provided in the third embodiment of the present application may be specifically performed by the training device 520 as shown in FIG. 5b, and the second composite image in the optimization method may be maintained in the database 530 as shown in FIG. 5b. training data. Optionally, some or all of S1501 to S1503 in the method for training an object pose detection network provided in Embodiment 3 may be executed in the training device 520, or may be pre-executed by other functional modules before the training device 520, that is, First, preprocess the training data received or acquired from the database 530, as described in the process of S1501 to S1503, to obtain the second input image and the fourth composite image, which are used as the input of the training device 520, and are sent by The training device 520 performs S1504 to S1505.
可选的,本申请实施例二提供的优化方法可以由CPU处理,也可以由CPU和GPU共同处理,也可以不用GPU,而使用其他适合用于神经网络计算的处理器,本申请不做限制。Optionally, the optimization method provided in the second embodiment of the present application may be processed by the CPU, or may be jointly processed by the CPU and the GPU, or other processors suitable for neural network computing may be used without the use of the GPU, which is not limited in this application. .
本申请实施例四还提供一种显示方法,应用于终端设备。该显示方法可以与前述对象注册方法、优化方法、训练对象位姿检测网络的方法结合使用,也可以单独使用,本申请实施例对此不予具体限定。The fourth embodiment of the present application further provides a display method, which is applied to a terminal device. The display method may be used in combination with the aforementioned object registration method, optimization method, and method for training an object pose detection network, or may be used alone, which is not specifically limited in this embodiment of the present application.
如图17所示,本申请实施例提供的显示方法,可以包括:As shown in FIG. 17 , the display method provided by the embodiment of the present application may include:
S1701、终端设备获取第一图像。S1701. A terminal device acquires a first image.
其中,第一图像是指终端设备获取的任一图像。The first image refers to any image acquired by the terminal device.
一种可能的实现方式中,终端设备可以通过摄像装置拍摄获取取景框中的图像,作为第一图像。In a possible implementation manner, the terminal device may acquire an image in the viewfinder by using a camera to capture the image as the first image.
示例性的,终端设备可以为手机,用户可以在该手机中启动APP,并在APP界面输入获取图像的指令,手机启动摄像头拍摄获取第一图像。Exemplarily, the terminal device may be a mobile phone, the user may start an APP in the mobile phone, and input an image acquisition instruction on the APP interface, and the mobile phone starts a camera to capture the first image.
示例性的,终端设备可以为智能眼镜,用户在佩戴该智能眼镜后,通过取景框捕获视野中的图像,作为第一图像。Exemplarily, the terminal device may be smart glasses. After wearing the smart glasses, the user captures an image in the field of view through a viewfinder as the first image.
另一种可能的实现方式中,终端设备可以加载本地存储的图像,作为第一图像。In another possible implementation manner, the terminal device may load a locally stored image as the first image.
当然,本申请实施例中对于S1701中终端获取第一图像的方式不予限定。Of course, in this embodiment of the present application, the manner in which the terminal acquires the first image in S1701 is not limited.
S1702、终端设备确定第一图像中是否包括可识别对象。S1702. The terminal device determines whether a recognizable object is included in the first image.
具体的,终端根据线下配置的识别物体的网络,判断第一图像中是否包含可识别对象。Specifically, the terminal determines whether the first image contains an identifiable object according to a network for identifying objects configured offline.
可选的,该识别物体的网络可以为当前的位姿检测网络,也可以为前述实施例中描述的第一网络或第二网络。Optionally, the network for identifying objects may be the current pose detection network, or may be the first network or the second network described in the foregoing embodiments.
一种可能的实现方式中,S1702中终端设备可以根据当前用于识别对象的位姿以及标识的网络,判断第一图像中是否包含可识别对象。In a possible implementation manner, in S1702, the terminal device may determine whether the first image contains an identifiable object according to the pose and the network currently used to identify the object.
示例性的,终端设备可以将第一图像输入用于识别对象的位姿以及标识的网络, 若该网络输出了一个或多个对象的位姿以及标识,则确定第一图像中包括可识别对象,否则,第一图像中不包括可识别对象。Exemplarily, the terminal device may input the first image into a network for recognizing poses and identifiers of objects, and if the network outputs poses and identifiers of one or more objects, it is determined that the first image includes recognizable objects. , otherwise, no identifiable objects are included in the first image.
另一种可能的实现方式中,S1702中终端设备可以根据本申请前述实施例中描述的第一网络,以及注册对象得到的多对象类别分类器,判断第一图像中是否包含可识别对象。In another possible implementation manner, the terminal device in S1702 may determine whether the first image contains a recognizable object according to the first network described in the foregoing embodiments of the present application and the multi-object category classifier obtained by registering the object.
示例性的,终端设备可以提取第一图像的特征,与本申请前述实施例中描述的多对象类别分类器中的特征进行匹配,若存在匹配的特征,则确定第一图像中包括可识别对象,否则,第一图像中不包括可识别对象。Exemplarily, the terminal device may extract the features of the first image, and perform matching with the features in the multi-object category classifier described in the foregoing embodiments of the present application. If there are matching features, it is determined that the first image includes recognizable objects. , otherwise, no identifiable objects are included in the first image.
示例性的,S1702中可以提取第一图像中的特征信息,该特征信息用于指示所述第一图像中可识别的特征;然后判断特征库中是否存在与提取的特征信息的匹配距离满足预设条件的特征信息;若特征库中存在与特征信息的匹配距离满足预设条件的特征信息,确定第一图像中包括一个或多个可识别对象;若特征库中不存在与特征信息的匹配距离满足预设条件的特征信息,确定第一图像未包括任一可识别对象。其中,特征库中存储了不同对象的一个或多个特征信息。Exemplarily, in S1702, feature information in the first image can be extracted, and the feature information is used to indicate the identifiable features in the first image; Set the feature information of the condition; if there is feature information in the feature library whose matching distance with the feature information satisfies the preset condition, it is determined that one or more identifiable objects are included in the first image; if there is no match with the feature information in the feature library The distance satisfies the feature information of the preset condition, it is determined that the first image does not include any identifiable object. Among them, the feature library stores one or more feature information of different objects.
一种可能的实现方式中,预设条件可以包括小于或等于预设阈值。可以根据实际需求配置预设阈值的取值,本申请实施例对此不予限定。In a possible implementation manner, the preset condition may include less than or equal to a preset threshold. The value of the preset threshold may be configured according to actual requirements, which is not limited in this embodiment of the present application.
可选的,若S1702中终端设备确定第一图像中包括一个或多个可识别对象,则执行S1703和S1704。若S1702中终端设备确定第一图像中未包括任一可识别对象,则执行S1705。Optionally, if the terminal device determines in S1702 that the first image includes one or more identifiable objects, S1703 and S1704 are performed. If the terminal device determines in S1702 that no identifiable object is included in the first image, S1705 is executed.
S1703、输出第一信息,第一信息用于提示检测到第一图像中包括可识别对象。S1703. Output first information, where the first information is used to prompt detection that the first image includes an identifiable object.
可选的,第一信息可以为文字信息或者语音信息,或者其他形式,本申请实施例对此不予限定。Optionally, the first information may be text information, voice information, or other forms, which are not limited in this embodiment of the present application.
示例性的,第一信息的内容可以为“已检测到可识别对象,请保持当前角度”,该第一信息的内容可以通过文字形式叠加显示于第一图像上,或者,该第一信息的内容可以通过终端设备的扬声器播放。Exemplarily, the content of the first information may be "a recognizable object has been detected, please keep the current angle", and the content of the first information may be superimposed and displayed on the first image in the form of text, or the content of the first information may be displayed on the first image. The content can be played through the speakers of the terminal device.
S1704、通过第一图像包括的每个可识别对象对应的第一网络,获取第一图像中每个可识别对象的位姿;按照每个可识别对象的位姿,显示每个可识别对象对应的虚拟内容。S1704: Obtain the pose of each recognizable object in the first image through the first network corresponding to each recognizable object included in the first image; and display the corresponding pose of each recognizable object according to the pose of each recognizable object virtual content.
其中,对象对应的虚拟内容可以根据实际需求配置,本申请实施例不予限定。例如可以为展览物品的介绍信息,或者,也可以为产品的属性信息,或者其他。The virtual content corresponding to the object may be configured according to actual requirements, which is not limited in this embodiment of the present application. For example, it can be the introduction information of the exhibition item, or it can also be the attribute information of the product, or others.
S1705、输出第二信息,第二信息用于提示未检测到可识别对象,调整视角以获取第二图像。S1705. Output second information, where the second information is used to prompt that no identifiable object is detected, and adjust the viewing angle to acquire the second image.
其中,第二图像与第一图像不同。Wherein, the second image is different from the first image.
可选的,第二信息可以为文字信息或者语音信息,或者其他形式,本申请实施例对此不予限定。Optionally, the second information may be text information, voice information, or other forms, which are not limited in this embodiment of the present application.
示例性的,第二信息的内容可以为“未检测到可识别对象,请调整设备角度”,该第二信息的内容可以通过文字形式叠加显示于第一图像上,或者,该第二信息的内容可以通过终端设备的扬声器播放。Exemplarily, the content of the second information may be "No identifiable object is detected, please adjust the angle of the device", the content of the second information may be superimposed and displayed on the first image in the form of text, or, the content of the second information may be displayed on the first image. The content can be played through the speakers of the terminal device.
通过本申请提供的显示方法,将图像中是否包括可识别对象向用户输出提示,使 得用户可以直观获取图像中是否包括可识别对象,提高了用户体验。Through the display method provided by the present application, a prompt is output to the user whether the image includes a identifiable object, so that the user can intuitively obtain whether the image includes an identifiable object, and the user experience is improved.
示例性的,S1702中终端设备根据本申请前述实施例中描述的第一网络,以及注册对象得到的多对象类别分类器,判断第一图像中是否包含可识别对象的过程,可以如图18所示,该过程可以包括:Exemplarily, in S1702, the terminal device determines whether the first image contains recognizable objects according to the first network described in the foregoing embodiments of the present application and the multi-object category classifier obtained by registering the object, as shown in FIG. 18 . As shown, the process can include:
S1801、终端设备提取第一图像中的局部特征点。S1801. The terminal device extracts local feature points in the first image.
需要说明的是,本申请实施例对于S1801中提取局部特征点的具体方案不予限定,但是S1801中终端设备提取局部特征点的方法,需与S803中保持一致,以保证方案准确性。It should be noted that the embodiments of the present application do not limit the specific scheme for extracting local feature points in S1801, but the method for extracting local feature points by the terminal device in S1801 needs to be consistent with that in S803 to ensure the accuracy of the scheme.
S1802、终端设备获取第一图像中的一个或多个第一局部特征点。S1802. The terminal device acquires one or more first local feature points in the first image.
其中,第一局部特征点的描述子与特征库中的局部特征点的描述子的匹配距离小于或等于第一阈值。第一阈值的取值可以根据实际需求配置,本申请实施例对此不予限定。The matching distance between the descriptor of the first local feature point and the descriptor of the local feature point in the feature library is less than or equal to the first threshold. The value of the first threshold may be configured according to actual requirements, which is not limited in this embodiment of the present application.
其中,描述子之间的匹配距离,是指表示描述子的向量之间的距离。例如,欧式距离、汉明距离等。The matching distance between descriptors refers to the distance between vectors representing the descriptors. For example, Euclidean distance, Hamming distance, etc.
具体的,特征库中存储了不同对象的局部特征点的描述子。例如,该特征库可以为前述实施例中描述的多对象类别分类器。Specifically, descriptors of local feature points of different objects are stored in the feature library. For example, the feature library may be the multi-object category classifier described in the foregoing embodiments.
S1803、终端设备根据第一局部特征点,确定第一图像中一个或多个ROI。S1803. The terminal device determines one or more ROIs in the first image according to the first local feature point.
其中,一个ROI中包括一个对象。Among them, one ROI includes one object.
具体的,终端设备可以结合图像的颜色信息、深度信息,将第一局部特征点分类为不同的对象,将集中分布的同一个对象的第一局部特征点的区域,确定为一个ROI,得到一个或多个ROI。Specifically, the terminal device can combine the color information and depth information of the image to classify the first local feature points into different objects, and determine the area of the first local feature points of the same object that are centrally distributed as an ROI to obtain a or multiple ROIs.
S1804、终端设备提取每个ROI中的全局特征。S1804, the terminal device extracts global features in each ROI.
需要说明的是,本申请实施例对于S1804中提取全局特征的具体方案不予限定,但是S1804中终端设备提取全局特征的方法,需与S803中保持一致,以保证方案准确性。It should be noted that the embodiment of the present application does not limit the specific scheme for extracting global features in S1804, but the method for extracting global features by the terminal device in S1804 needs to be consistent with that in S803 to ensure the accuracy of the scheme.
S1805、终端设备根据每个ROI中的全局特征,确定第一图像中是否包含可识别对象。S1805. The terminal device determines whether the first image contains an identifiable object according to the global feature in each ROI.
示例性的,若每个ROI中的全局特征中存在一个或多个第一全局特征,确定第一图像中包括第一全局特征对应的可识别对象。其中,第一全局特征的描述子与特征库中的全局特征的描述子的匹配距离小于或等于第二阈值。特征库中还存储了不同对象的全局特征的描述子。若每个ROI中的全局特征中不存在第一全局特征,确定第一图像未包括任一可识别对象。Exemplarily, if there are one or more first global features in the global features in each ROI, it is determined that the first image includes an identifiable object corresponding to the first global features. The matching distance between the descriptor of the first global feature and the descriptor of the global feature in the feature library is less than or equal to the second threshold. Descriptors of global features of different objects are also stored in the feature library. If the first global feature does not exist in the global features in each ROI, it is determined that the first image does not include any identifiable object.
其中,第二阈值的取值可以根据实际需求配置,本申请实施例对此不予限定。The value of the second threshold may be configured according to actual requirements, which is not limited in this embodiment of the present application.
下面通过具体的示例,对本申请提供的显示方法进行说明。The display method provided by the present application will be described below through specific examples.
假设某一商家提供了一款APP,用于用户购物时,通过手机取景捕获图像,识别图像中包含的商品,并显示商品对应的产品介绍。在线下对各个商品的三维对象按照本申请实施例提供的方案进行了注册。当某一用户在该商家购物时,使用手机中的该APP捕获到的场景图1如图19示意的手机界面所示,该APP通过图18示意的方法,提取场景图1中的特征,识别出场景图1中包含可识别对象(智能音箱),并输出“已 检测到可识别对象,请保持当前角度”如图20a所示的手机界面。接下来,该APP调用智能音箱对应的第二网络获取该场景图1中该智能音箱的6DOF位姿,然后按照该6DOF位姿显示智能音箱对应的虚拟内容“大家好,我是小艺,能听歌,讲故事,讲笑话,还能懂得无数百科知识的华为智能音箱”,该虚实结合显示内容如图20b所示的手机界面。Suppose a merchant provides an APP, which is used for users to capture images through mobile phone framing, identify the products contained in the images, and display the product introduction corresponding to the products. The three-dimensional objects of each commodity are registered offline according to the solutions provided in the embodiments of the present application. When a user is shopping at the merchant, the scene captured by the APP in the mobile phone is shown in Figure 1 as shown in the mobile phone interface shown in Figure 19. The APP uses the method shown in Figure 18 to extract the features in the scene Figure 1, identify The scene shown in Figure 1 contains identifiable objects (smart speakers), and outputs the mobile phone interface as shown in Figure 20a, "A identifiable object has been detected, please keep the current angle." Next, the APP calls the second network corresponding to the smart speaker to obtain the 6DOF pose of the smart speaker in Figure 1 of the scene, and then displays the virtual content corresponding to the smart speaker according to the 6DOF pose "Hello everyone, I'm Xiaoyi, can I A Huawei smart speaker that can listen to songs, tell stories, tell jokes, and understand countless encyclopedic knowledge."
上述主要从设备的工作原理的角度对本发明实施例提供的方案进行了介绍。可以理解的是,电子设备等为了实现上述功能,其包含了执行各个功能相应的硬件结构和/或软件模块。本领域技术人员应该很容易意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,本发明能够以硬件或硬件和计算机软件的结合形式来实现。某个功能究竟以硬件还是计算机软件驱动硬件的方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本发明的范围。The foregoing mainly introduces the solutions provided by the embodiments of the present invention from the perspective of the working principle of the device. It can be understood that, in order to realize the above-mentioned functions, the electronic device or the like includes corresponding hardware structures and/or software modules for executing each function. Those skilled in the art should easily realize that the present invention can be implemented in hardware or a combination of hardware and computer software in conjunction with the units and algorithm steps of each example described in the embodiments disclosed herein. Whether a function is performed by hardware or computer software driving hardware depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each particular application, but such implementations should not be considered beyond the scope of the present invention.
本发明实施例可以根据上述方法示例对执行本申请方法的装置等进行功能模块的划分,例如,可以对应各个功能划分各个功能模块,也可以将两个或两个以上的功能集成在一个处理模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。需要说明的是,本发明实施例中对模块的划分是示意性的,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。In this embodiment of the present invention, functional modules may be divided into apparatuses that execute the method of the present application according to the foregoing method examples. For example, each functional module may be divided corresponding to each function, or two or more functions may be integrated into one processing module. middle. The above-mentioned integrated modules can be implemented in the form of hardware, and can also be implemented in the form of software function modules. It should be noted that, the division of modules in the embodiment of the present invention is schematic, and is only a logical function division, and there may be other division manners in actual implementation.
在采用对应各个功能划分各个功能模块的情况下,图21示意了本申请实施例提供的一种对象注册装置210,用于实现上述实施例一中的功能。如图21所示,对象注册装置210可以包括:第一获取单元2101、提取单元2102以及注册单元2103。第一获取单元2101用于执行图8中的过程S801,或者图10中的S801a和S801b,或者图14中的S1402、S1403;提取单元2102用于执行图8或图10中的过程S802,或者图14中的S1404;注册单元2103用于执行图8或图10中的过程S803,或者图14中的S1405。其中,上述方法实施例涉及的各步骤的所有相关内容均可以援引到对应功能模块的功能描述,在此不再赘述。In the case where each functional module is divided according to each function, FIG. 21 illustrates an object registration apparatus 210 provided by an embodiment of the present application, which is used to implement the functions in the above-mentioned first embodiment. As shown in FIG. 21 , the object registration apparatus 210 may include: a first acquisition unit 2101 , an extraction unit 2102 and a registration unit 2103 . The first acquiring unit 2101 is configured to execute the process S801 in FIG. 8 , or S801a and S801b in FIG. 10 , or S1402 and S1403 in FIG. 14 ; the extracting unit 2102 is configured to execute the process S802 in FIG. 8 or FIG. 10 , or S1404 in FIG. 14 ; the registration unit 2103 is configured to execute the process S803 in FIG. 8 or FIG. 10 , or S1405 in FIG. 14 . Wherein, all relevant contents of the steps involved in the above method embodiments can be cited in the functional descriptions of the corresponding functional modules, which will not be repeated here.
在采用对应各个功能划分各个功能模块的情况下,图22示意了本申请实施例提供的一种优化装置,用于实现上述实施例二中的功能。如图22所示,优化装置220可以包括:处理单元2201、可微渲染器2202、截图单元2203、构建单元2204以及更新单元2205。处理单元2201用于执行图12中的过程S1201;可微渲染器2202用于执行图12中的过程S1202;截图单元2203用于执行图12中的过程S1203;构建单元2204用于执行图12中的过程S1204;更新单元2205用于执行图12中的过程S1205。其中,上述方法实施例涉及的各步骤的所有相关内容均可以援引到对应功能模块的功能描述,在此不再赘述。In the case where each functional module is divided according to each function, FIG. 22 illustrates an optimization apparatus provided by an embodiment of the present application, which is used to implement the functions in the foregoing second embodiment. As shown in FIG. 22 , the optimization apparatus 220 may include: a processing unit 2201 , a differentiable renderer 2202 , a screenshot unit 2203 , a construction unit 2204 and an update unit 2205 . The processing unit 2201 is used to execute the process S1201 in FIG. 12 ; the differentiable renderer 2202 is used to execute the process S1202 in FIG. 12 ; the screenshot unit 2203 is used to execute the process S1203 in FIG. 12 ; process S1204; the updating unit 2205 is used to execute the process S1205 in FIG. 12 . Wherein, all relevant contents of the steps involved in the above method embodiments can be cited in the functional descriptions of the corresponding functional modules, which will not be repeated here.
在采用对应各个功能划分各个功能模块的情况下,图23示意了本申请实施例提供的一种训练对象位姿检测网络的装置230,用于实现上述实施例三中的功能。如图23所示,训练对象位姿检测网络的装置230可以包括:第二获取单元2301、处理单元2302、可微渲染器2303、构建单元2304以及更新单元2305。第二获取单元2301用于执行图15中的过程S1501;处理单元2302用于执行图15中的过程S1502;可微渲染器2303用于执行图15中的过程S1503;构建单元2304用于执行图15中的过程S1504;更新 单元2305用于执行图15中的过程S1505。其中,上述方法实施例涉及的各步骤的所有相关内容均可以援引到对应功能模块的功能描述,在此不再赘述。In the case where each functional module is divided according to each function, FIG. 23 illustrates an apparatus 230 for training an object pose detection network provided by an embodiment of the present application, which is used to implement the functions in the third embodiment. As shown in FIG. 23 , the apparatus 230 for training an object pose detection network may include: a second acquisition unit 2301 , a processing unit 2302 , a differentiable renderer 2303 , a construction unit 2304 and an update unit 2305 . The second acquisition unit 2301 is used to execute the process S1501 in FIG. 15; the processing unit 2302 is used to execute the process S1502 in FIG. 15; the differentiable renderer 2303 is used to execute the process S1503 in FIG. 15; the construction unit 2304 is used to execute the process S1503 of FIG. The process S1504 in 15; the updating unit 2305 is used to execute the process S1505 in FIG. 15 . Wherein, all relevant contents of the steps involved in the above method embodiments can be cited in the functional descriptions of the corresponding functional modules, which will not be repeated here.
在采用对应各个功能划分各个功能模块的情况下,图24示意了本申请实施例提供的一种显示装置240,用于实现上述实施例四中的功能。如图24所示,显示装置240可以包括:第一获取单元2401、输出单元1402以及处理单元1403。第一获取单元2401用于执行图17中的过程S1701;输出单元1402用于执行图17中的过程S1703或S1705;处理单元1403用于执行图17中的过程S1704。其中,上述方法实施例涉及的各步骤的所有相关内容均可以援引到对应功能模块的功能描述,在此不再赘述。In the case where each functional module is divided corresponding to each function, FIG. 24 illustrates a display device 240 provided by an embodiment of the present application, which is used to implement the functions in the above-mentioned fourth embodiment. As shown in FIG. 24 , the display device 240 may include: a first acquisition unit 2401 , an output unit 1402 and a processing unit 1403 . The first acquisition unit 2401 is used for executing the process S1701 in FIG. 17 ; the output unit 1402 is used for executing the process S1703 or S1705 in FIG. 17 ; the processing unit 1403 is used for executing the process S1704 in FIG. 17 . Wherein, all relevant contents of the steps involved in the above method embodiments can be cited in the functional descriptions of the corresponding functional modules, which will not be repeated here.
图25提供了一种装置250的硬件结构示意图。该装置250可以是本申请实施例提供的对象注册装置,用于执行本申请实施例一提供的对象注册方法。或者,该装置250可以是本申请实施例提供的优化装置,用于执行本申请实施例二提供的优化方法。或者,该装置250可以是本申请实施例提供的训练对象位姿检测网络的装置,用于执行本申请实施例三提供的训练对象位姿检测网络的方法。或者,该装置250可以是本申请实施例提供的显示装置,用于执行本申请实施例四提供的显示方法。FIG. 25 provides a schematic diagram of the hardware structure of an apparatus 250 . The apparatus 250 may be an object registration apparatus provided in this embodiment of the present application, and is configured to execute the object registration method provided in the first embodiment of this application. Alternatively, the apparatus 250 may be the optimization apparatus provided in this embodiment of the present application, and is configured to execute the optimization method provided in the second embodiment of the present application. Alternatively, the apparatus 250 may be an apparatus for training an object pose detection network provided in this embodiment of the present application, and configured to execute the method for training an object pose detection network provided in the third embodiment of the present application. Alternatively, the device 250 may be the display device provided in this embodiment of the present application, and configured to execute the display method provided by the fourth embodiment of the present application.
如图25所示,装置250(该装置250具体可以是一种计算机设备)可以包括存储器2501、处理器2502、通信接口2503以及总线2504。其中,存储器2501、处理器2502、通信接口2503通过总线2504实现彼此之间的通信连接。As shown in FIG. 25 , the apparatus 250 (the apparatus 250 may specifically be a computer device) may include a memory 2501 , a processor 2502 , a communication interface 2503 and a bus 2504 . The memory 2501 , the processor 2502 , and the communication interface 2503 are connected to each other through the bus 2504 for communication.
存储器2501可以是只读存储器(read only memory,ROM),静态存储设备,动态存储设备或者随机存取存储器(random access memory,RAM)。存储器2501可以存储程序,当存储器2501中存储的程序被处理器2502执行时,处理器2502和通信接口2503用于执行本申请实施例一至实施例四提供的方法的各个步骤。The memory 2501 may be a read only memory (ROM), a static storage device, a dynamic storage device, or a random access memory (RAM). The memory 2501 can store programs, and when the programs stored in the memory 2501 are executed by the processor 2502, the processor 2502 and the communication interface 2503 are used to execute various steps of the methods provided in Embodiments 1 to 4 of this application.
处理器2502可以采用通用的中央处理器(central processing unit,CPU),微处理器,应用专用集成电路(application specific integrated circuit,ASIC),图形处理器(graphics processing unit,GPU)或者一个或多个集成电路,用于执行相关程序,以实现本申请实施例的对象注册装置、优化装置、训练对象位姿检测网络的装置、显示装置中的单元所需执行的功能,或者执行本申请方法实施例一至实施例四中任一实施例提供的方法。The processor 2502 can be a general-purpose central processing unit (CPU), a microprocessor, an application specific integrated circuit (ASIC), a graphics processor (graphics processing unit, GPU), or one or more An integrated circuit for executing a relevant program to implement the functions required to be performed by the object registration device, the optimization device, the device for training the object pose detection network, and the unit in the display device according to the embodiments of the present application, or to execute the method embodiments of the present application. The method provided in any one of Embodiments 1 to 4.
处理器2502还可以是一种集成电路芯片,具有信号的处理能力。在实现过程中,本申请的训练对象位姿检测网络的方法的各个步骤可以通过处理器2502中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器2502还可以是通用处理器、数字信号处理器(digital signal processing,DSP)、专用集成电路(ASIC)、现成可编程门阵列(field programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器2501,处理器2502读取存储器2501中的信息,结合其硬件完成本申请实施例的对象注册装置、优化装置、训练 对象位姿检测网络的装置、显示装置中包括的单元所需执行的功能,或者执行本申请方法实施例一至实施例四中任一实施例提供的。The processor 2502 may also be an integrated circuit chip with signal processing capability. In the implementation process, each step of the method for training an object pose detection network of the present application may be completed by an integrated logic circuit of hardware in the processor 2502 or instructions in the form of software. The above-mentioned processor 2502 can also be a general-purpose processor, a digital signal processor (digital signal processing, DSP), an application-specific integrated circuit (ASIC), an off-the-shelf programmable gate array (field programmable gate array, FPGA) or other programmable logic devices. , discrete gate or transistor logic devices, discrete hardware components. The methods, steps, and logic block diagrams disclosed in the embodiments of this application can be implemented or executed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in conjunction with the embodiments of the present application may be directly embodied as executed by a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor. The software modules may be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other storage media mature in the art. The storage medium is located in the memory 2501, and the processor 2502 reads the information in the memory 2501, and combines its hardware to complete the object registration device, the optimization device, the device for training the object pose detection network, and the unit included in the display device of the embodiment of the present application. function to be executed, or execute any one of the first to fourth embodiments of the method of the present application.
通信接口2503使用例如但不限于收发器一类的收发装置,来实现装置250与其他设备或通信网络之间的通信。例如,可以通过通信接口2503获取训练数据(如本申请方法实施例所述的真实图像和合成图像)。The communication interface 2503 uses a transceiver device such as, but not limited to, a transceiver to implement communication between the device 250 and other devices or a communication network. For example, training data (such as real images and synthetic images described in the method embodiments of the present application) can be acquired through the communication interface 2503 .
总线2504可包括在装置250各个部件(例如,存储器2501、处理器2502、通信接口2503)之间传送信息的通路。Bus 2504 may include a pathway for communicating information between various components of device 250 (eg, memory 2501, processor 2502, communication interface 2503).
应理解,对象注册装置210中的第一获取单元2101、提取单元2102以及注册单元2103相当于装置250中的处理器2502。优化装置220中的处理单元2201、可微渲染器2202、截图单元2203、构建单元2204以及更新单元2205相当于装置250中的处理器2502。训练对象位姿检测网络的装置230第二获取单元2301、处理单元2302、可微渲染器2303、构建单元2304以及更新单元2305相当于装置250中的处理器2502。显示装置240中的第一获取单元2401、处理单元1403相当于装置250中的处理器2502,输出单元1402相当于装置250中的通信接口2503。It should be understood that the first acquisition unit 2101 , the extraction unit 2102 and the registration unit 2103 in the object registration apparatus 210 are equivalent to the processor 2502 in the apparatus 250 . The processing unit 2201 , the differentiable renderer 2202 , the screenshot unit 2203 , the constructing unit 2204 and the updating unit 2205 in the optimization device 220 are equivalent to the processor 2502 in the device 250 . The second acquiring unit 2301 , the processing unit 2302 , the differentiable renderer 2303 , the constructing unit 2304 and the updating unit 2305 in the apparatus 230 for training the object pose detection network are equivalent to the processor 2502 in the apparatus 250 . The first acquiring unit 2401 and the processing unit 1403 in the display device 240 are equivalent to the processor 2502 in the device 250 , and the output unit 1402 is equivalent to the communication interface 2503 in the device 250 .
应注意,尽管图25所示的装置250仅仅示出了存储器、处理器、通信接口,但是在具体实现过程中,本领域的技术人员应当理解,装置250还包括实现正常运行所必须的其他器件。同时,根据具体需要,本领域的技术人员应当理解,装置250还可包括实现其他附加功能的硬件器件。此外,本领域的技术人员应当理解,装置250也可仅仅包括实现本申请实施例所必须的器件,而不必包括图25中所示的全部器件。It should be noted that although the apparatus 250 shown in FIG. 25 only shows a memory, a processor, and a communication interface, in the specific implementation process, those skilled in the art should understand that the apparatus 250 also includes other devices necessary for normal operation . Meanwhile, according to specific needs, those skilled in the art should understand that the apparatus 250 may further include hardware devices for implementing other additional functions. In addition, those skilled in the art should understand that the apparatus 250 may only include the necessary devices for implementing the embodiments of the present application, and does not necessarily include all the devices shown in FIG. 25 .
可以理解,所述装置250相当于图5b中的所述训练设备520或执行设备510。本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。It can be understood that the apparatus 250 is equivalent to the training device 520 or the execution device 510 in FIG. 5b. Those of ordinary skill in the art can realize that the units and algorithm steps of each example described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each particular application, but such implementations should not be considered beyond the scope of this application.
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机程序或指令。在计算机上加载和执行所述计算机程序或指令时,全部或部分地执行本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、网络设备、用户设备或者其它可编程装置。所述计算机程序或指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机程序或指令可以从一个网站站点、计算机、服务器或数据中心通过有线或无线方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是集成一个或多个可用介质的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,例如,软盘、硬盘、磁带;也可以是光介质,例如,数字视频光盘(digital video disc,DVD);还可以是半导体介质,例如,固态硬盘(solid state drive,SSD)。In the above-mentioned embodiments, it may be implemented in whole or in part by software, hardware, firmware or any combination thereof. When implemented in software, it can be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer programs or instructions. When the computer program or instructions are loaded and executed on a computer, the processes or functions described in the embodiments of the present application are executed in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, network equipment, user equipment, or other programmable apparatus. The computer program or instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer program or instructions may be downloaded from a website site, computer, A server or data center transmits by wire or wireless to another website site, computer, server or data center. The computer-readable storage medium may be any available medium that can be accessed by a computer, or a data storage device such as a server, data center, or the like that integrates one or more available media. The usable medium can be a magnetic medium, such as a floppy disk, a hard disk, and a magnetic tape; it can also be an optical medium, such as a digital video disc (DVD); it can also be a semiconductor medium, such as a solid state drive (solid state drive). , SSD).
在本申请的各个实施例中,如果没有特殊说明以及逻辑冲突,不同的实施例之间 的术语和/或描述具有一致性、且可以相互引用,不同的实施例中的技术特征根据其内在的逻辑关系可以组合形成新的实施例。In the various embodiments of the present application, if there is no special description or logical conflict, the terms and/or descriptions between different embodiments are consistent and can be referred to each other, and the technical features in different embodiments are based on their inherent Logical relationships can be combined to form new embodiments.
本申请中,“至少一个”是指一个或者多个,“多个”是指两个或两个以上。“和/或”,描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B的情况,其中A,B可以是单数或者复数。在本申请的文字描述中,字符“/”,一般表示前后关联对象是一种“或”的关系;在本申请的公式中,字符“/”,表示前后关联对象是一种“相除”的关系。In this application, "at least one" means one or more, and "plurality" means two or more. "And/or", which describes the relationship of the associated objects, indicates that there can be three kinds of relationships, for example, A and/or B, it can indicate that A exists alone, A and B exist at the same time, and B exists alone, where A, B can be singular or plural. In the text description of this application, the character "/" generally indicates that the related objects are a kind of "or" relationship; in the formula of this application, the character "/" indicates that the related objects are a kind of "division" Relationship.
可以理解的是,在本申请的实施例中涉及的各种数字编号仅为描述方便进行的区分,并不用来限制本申请的实施例的范围。上述各过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定。It can be understood that, the various numbers and numbers involved in the embodiments of the present application are only for the convenience of description, and are not used to limit the scope of the embodiments of the present application. The size of the sequence numbers of the above processes does not imply the sequence of execution, and the execution sequence of each process should be determined by its function and internal logic.

Claims (37)

  1. 一种对象注册方法,其特征在于,所述方法包括:An object registration method, characterized in that the method comprises:
    获取包括第一对象的多个第一输入图像,所述多个第一输入图像包括所述第一对象的真实图像和/或所述第一对象的多个第一合成图像;所述多个第一合成图像是由所述第一对象的三维模型在多个第一位姿下可微渲染得到的;所述多个第一位姿不同;acquiring a plurality of first input images including a first object, the plurality of first input images including a real image of the first object and/or a plurality of first composite images of the first object; the plurality of first input images The first composite image is obtained by differentiable rendering of the three-dimensional model of the first object under multiple first poses; the multiple first poses are different;
    分别提取多个所述第一输入图像的特征信息,所述特征信息用于指示所述第一对象在其所处的第一输入图像中的特征;Extracting feature information of a plurality of the first input images respectively, the feature information is used to indicate the features of the first object in the first input image where it is located;
    将每个所述第一输入图像中提取的所述特征信息,与所述第一对象的标识相对应,进行所述第一对象的注册。The feature information extracted from each of the first input images is corresponding to the identifier of the first object, and the registration of the first object is performed.
  2. 根据权利要求1所述的方法,其特征在于,所述特征信息包括局部特征点的描述子,以及全局特征的描述子。The method according to claim 1, wherein the feature information includes descriptors of local feature points and descriptors of global features.
  3. 根据权利要求1或2所述的方法,其特征在于,分别提取多个所述第一输入图像的特征信息,包括:The method according to claim 1 or 2, wherein extracting feature information of a plurality of the first input images respectively comprises:
    将多个所述第一输入图像分别输入第一网络进行位姿识别,得到每个所述第一输入图像中所述第一对象的位姿;其中,所述第一网络用于识别图像中所述第一对象的位姿;A plurality of the first input images are respectively input into the first network for pose recognition, and the pose of the first object in each of the first input images is obtained; wherein, the first network is used to identify the the pose of the first object;
    按照获取的所述第一对象的位姿,将所述第一对象的三维模型分别投影至每个所述第一输入图像,得到每个所述第一输入图像中的投影区域;According to the acquired pose of the first object, project the three-dimensional model of the first object to each of the first input images, respectively, to obtain a projection area in each of the first input images;
    分别在每个所述第一输入图像中的投影区域,提取所述特征信息。The feature information is extracted in each of the projection regions in the first input image, respectively.
  4. 根据权利要求1或2所述的方法,其特征在于,分别提取多个所述第一输入图像的特征信息,包括:The method according to claim 1 or 2, wherein extracting feature information of a plurality of the first input images respectively comprises:
    将多个所述第一输入图像分别输入第一网络进行黑白图像提取,获取每个所述第一输入图像中所述第一对象的黑白图像;其中,所述第一网络用于提取图像中所述第一对象的黑白图像;A plurality of the first input images are respectively input into a first network for black and white image extraction, and a black and white image of the first object in each of the first input images is obtained; wherein, the first network is used to extract the a black and white image of the first object;
    分别在每个所述第一输入图像中所述第一对象的黑白图像内,提取所述特征信息。The feature information is extracted from the black and white images of the first object in each of the first input images, respectively.
  5. 根据权利要求1-4任一项所述的方法,其特征在于,所述方法还包括:The method according to any one of claims 1-4, wherein the method further comprises:
    将所述第一对象的N个真实图像分别输入第一网络,得到所述第一网络输出的每个真实图像中所述第一对象的第二位姿;所述N大于或等于1;所述第一网络用于识别图像中所述第一对象的位姿;The N real images of the first object are respectively input into the first network to obtain the second pose of the first object in each real image output by the first network; the N is greater than or equal to 1; The first network is used to identify the pose of the first object in the image;
    根据所述第一对象的三维模型,采用可微渲染器在每个所述第二位姿下,可微渲染得到N个第二合成图像;获取一个第二位姿的真实图像与所述一个第二位姿渲染得到的第二合成图像相对应;According to the three-dimensional model of the first object, using a differentiable renderer under each second pose, differentiable rendering is used to obtain N second composite images; obtaining a real image of the second pose and the one The second composite image obtained by rendering the second pose corresponds to;
    分别截取每个所述真实图像中,与真实对象对应的第二合成图像中所述第一对象相同位置的区域,作为每个所述真实图像的前景图像;respectively intercepting the area of the same position of the first object in the second composite image corresponding to the real object in each of the real images, as the foreground image of each of the real images;
    根据N个所述真实图像的前景图像与其对应的所述第二合成图像的第一差异信息,构建第一损失函数;其中,所述第一差异信息用于指示前景图像与其对应的第二合成图像的差异;A first loss function is constructed according to the first difference information of the foreground images of the N real images and their corresponding second composite images; wherein the first difference information is used to indicate the foreground images and their corresponding second composite images differences in images;
    根据所述第一损失函数更新所述可微渲染器,使得所述可微渲染器的输出的合成图像逼近对象的真实图像。The differentiable renderer is updated according to the first loss function, so that the synthesized image of the output of the differentiable renderer approximates the real image of the object.
  6. 根据权利要求5所述的方法,其特征在于,所述第一差异信息包括下述信息中一项或多项:特征图差、像素颜色差、提取的特征描述子之差。The method according to claim 5, wherein the first difference information includes one or more of the following information: feature map difference, pixel color difference, and difference between extracted feature descriptors.
  7. 根据权利要求1-6任一项所述的方法,其特征在于,所述方法还包括:The method according to any one of claims 1-6, wherein the method further comprises:
    获取包括第一对象的多个第二输入图像,所述多个第二输入图像包括所述第一对象的真实图像和/或所述第一对象的多个第三合成图像;所述多个第三合成图像是由所述第一对象的三维模型在多个第三位姿下渲染得到的;所述多个第三位姿不同;acquiring a plurality of second input images including a first object, the plurality of second input images including a real image of the first object and/or a plurality of third composite images of the first object; the plurality of second input images The third composite image is obtained by rendering the three-dimensional model of the first object in multiple third poses; the multiple third poses are different;
    将所述多个第二输入图像分别输入第二网络进行位姿识别,得到所述第二网络输出的每个所述第二输入图像中所述第一对象的第四位姿;所述第二网络用于识别图像中所述第一对象的位姿;The plurality of second input images are respectively input into the second network for pose recognition, and the fourth pose of the first object in each of the second input images output by the second network is obtained; The second network is used to identify the pose of the first object in the image;
    根据第一对象的三维模型,可微渲染获取所述第一对象在每个所述第四位姿下的第四合成图像;获取一个第四位姿的第二输入图像与所述一个第四位姿渲染得到的第四合成图像相对应;According to the three-dimensional model of the first object, microrendering obtains a fourth composite image of the first object in each of the fourth poses; obtains a second input image of the fourth pose and the one fourth pose The fourth composite image obtained by pose rendering corresponds to;
    根据每个所述第四合成图像与其对应的第二输入图像的第二差异信息,构建第二损失函数;所述第二差异信息用于指示所述第四合成图像与其对应的所述第二输入图像的差异;A second loss function is constructed according to the second difference information of each fourth composite image and its corresponding second input image; the second difference information is used to indicate the fourth composite image and its corresponding second loss function the difference of the input image;
    根据所述第二损失函数更新所述第二网络,得到第一网络;所述第一网络识别的图像中第一对象的位姿与图像中所述第一对象的真实位姿的差异,小于所述第二网络识别的图像中第一对象的位姿与图像中所述第一对象的真实位姿差异。The second network is updated according to the second loss function to obtain the first network; the difference between the pose of the first object in the image recognized by the first network and the real pose of the first object in the image is less than The pose of the first object in the image recognized by the second network is different from the real pose of the first object in the image.
  8. 根据权利要求7所述的方法,其特征在于,The method of claim 7, wherein:
    所述第二损失函数Loss 2满足如下表达式:
    Figure PCTCN2021140241-appb-100001
    所述X大于或等于1,所述λ i为权值,所述L i用于表示所述第四合成图像与其对应的所述第二输入图像的第二差异信息的计算值。
    The second loss function Loss 2 satisfies the following expression:
    Figure PCTCN2021140241-appb-100001
    The X is greater than or equal to 1, the λ i is a weight, and the Li is used to represent the calculated value of the second difference information between the fourth composite image and the corresponding second input image.
  9. 根据权利要求7或8所述的方法,其特征在于,所述第二差异信息包括下述内容中一项或多项:所述第四合成图像中所述第一对象的黑白图像与其对应的第二输入图像的中所述第一对象的黑白图像的交并比IOU之差、获取所述第四合成图像的第四位姿与所述第四合成图像经过所述第一网络得到的位姿之差、所述第四合成图像与其对应的第二输入图像中与所述第四合成图像中的所述第一对象相同位置的区域图像的相似度。The method according to claim 7 or 8, wherein the second difference information comprises one or more of the following: the black and white image of the first object in the fourth composite image and its corresponding black and white image In the second input image, the difference between the intersection ratio IOU of the black and white image of the first object, the fourth pose obtained from the fourth composite image, and the position obtained by the fourth composite image through the first network The difference in pose, and the similarity between the fourth composite image and its corresponding second input image and the region image at the same position as the first object in the fourth composite image.
  10. 一种显示方法,其特征在于,所述方法包括:A display method, characterized in that the method comprises:
    获取第一图像;get the first image;
    若所述第一图像中包括一个或多个可识别对象,输出第一信息,所述第一信息用于提示检测到所述第一图像中包括可识别对象;通过所述第一图像包括的每个可识别对象对应的位姿检测网络,获取所述第一图像中每个可识别对象的位姿;按照每个所述可识别对象的位姿,显示每个所述可识别对象对应的虚拟内容;If the first image includes one or more identifiable objects, output first information, where the first information is used to prompt detection that the first image includes identifiable objects; The pose detection network corresponding to each recognizable object obtains the pose of each recognizable object in the first image; and displays the corresponding pose of each recognizable object according to the pose of each recognizable object. virtual content;
    若所述第一图像中未包括任一可识别对象,输出第二信息,所述第二信息用于提示未检测到可识别对象,调整视角以获取第二图像,所述第二图像与所述第一图像不同。If the first image does not include any identifiable object, output second information, the second information is used to prompt that no identifiable object is detected, adjust the viewing angle to obtain a second image, the second image and the The first image is different.
  11. 根据权利要求10所述的方法,其特征在于,所述方法还包括:The method of claim 10, wherein the method further comprises:
    提取所述第一图像中的特征信息,所述特征信息用于指示所述第一图像中可识别的特征;extracting feature information in the first image, the feature information being used to indicate recognizable features in the first image;
    判断特征库中是否存在与所述特征信息的匹配距离满足预设条件的特征信息;其中,所述特征库中存储了不同对象的一个或多个特征信息;Judging whether there is feature information whose matching distance with the feature information meets a preset condition in the feature library; wherein, the feature library stores one or more feature information of different objects;
    若所述特征库中存在与所述特征信息的匹配距离满足预设条件的特征信息,确定所述第一图像中包括一个或多个可识别对象;If there is feature information whose matching distance with the feature information satisfies a preset condition in the feature library, determine that the first image includes one or more identifiable objects;
    若所述特征库中不存在与所述特征信息的匹配距离满足预设条件的特征信息,确定所述第一图像未包括任一可识别对象。If there is no feature information whose matching distance with the feature information satisfies a preset condition in the feature library, it is determined that the first image does not include any recognizable object.
  12. 根据权利要求10所述的方法,其特征在于,所述方法还包括:The method of claim 10, wherein the method further comprises:
    获取所述第一图像中的一个或多个第一局部特征点,所述第一局部特征点的描述子与特征库中的局部特征点的描述子的匹配距离小于或等于第一阈值;所述特征库中存储了不同对象的局部特征点的描述子;Obtaining one or more first local feature points in the first image, and the matching distance between the descriptors of the first local feature points and the descriptors of the local feature points in the feature library is less than or equal to a first threshold; The descriptors of local feature points of different objects are stored in the descriptor library;
    根据所述第一局部特征点,确定所述第一图像中一个或多个感兴趣区域ROI;一个ROI中包括一个对象;According to the first local feature point, one or more regions of interest ROIs in the first image are determined; one ROI includes one object;
    提取每个所述ROI中的全局特征;extracting global features in each of said ROIs;
    若每个所述ROI中的全局特征中存在一个或多个第一全局特征,确定所述第一图像中包括所述第一全局特征对应的可识别对象;其中,所述第一全局特征的描述子与所述特征库中的全局特征的描述子的匹配距离小于或等于第二阈值;所述特征库中还存储了不同对象的全局特征的描述子;If there are one or more first global features in the global features in each of the ROIs, it is determined that the first image includes identifiable objects corresponding to the first global features; wherein, the first global feature The matching distance between the descriptor and the descriptor of the global feature in the feature library is less than or equal to the second threshold; the descriptor of the global feature of different objects is also stored in the feature library;
    若每个所述ROI中的全局特征中不存在所述第一全局特征,确定所述第一图像未包括任一可识别对象。If the first global feature does not exist in the global features in each of the ROIs, it is determined that the first image does not include any identifiable object.
  13. 根据权利要求10-12任一项所述的方法,其特征在于,所述方法还包括:The method according to any one of claims 10-12, wherein the method further comprises:
    获取包括第一对象的多个第一输入图像,所述多个第一输入图像包括所述第一对象的真实图像和/或所述第一对象的多个第一合成图像;所述多个第一合成图像是由所述第一对象的三维模型在多个第一位姿下可微渲染得到的;所述多个第一位姿不同;acquiring a plurality of first input images including a first object, the plurality of first input images including a real image of the first object and/or a plurality of first composite images of the first object; the plurality of first input images the first composite image is obtained by differentiable rendering of the three-dimensional model of the first object under multiple first poses; the multiple first poses are different;
    分别提取多个所述第一输入图像的特征信息,所述特征信息用于指示所述第一对象在其所处的第一输入图像中的特征;Extracting feature information of a plurality of the first input images respectively, the feature information is used to indicate the features of the first object in the first input image where it is located;
    将每个所述第一输入图像中提取的所述特征信息,与所述第一对象的标识相对应存储于特征库,进行所述第一对象的注册。The feature information extracted from each of the first input images is stored in a feature library corresponding to the identifier of the first object, and the registration of the first object is performed.
  14. 根据权利要求11或13所述的方法,其特征在于,所述特征信息包括局部特征点的描述子,以及全局特征的描述子。The method according to claim 11 or 13, wherein the feature information includes descriptors of local feature points and descriptors of global features.
  15. 根据权利要求10-14任一项所述的方法,其特征在于,所述方法还包括:The method according to any one of claims 10-14, wherein the method further comprises:
    获取包括第一对象的多个第二输入图像,所述多个第二输入图像包括所述第一对象的真实图像和/或所述第一对象的多个第三合成图像;所述多个第三合成图像是由所述第一对象的三维模型在多个第三位姿下渲染得到的;所述多个第三位姿不同;acquiring a plurality of second input images including a first object, the plurality of second input images including a real image of the first object and/or a plurality of third composite images of the first object; the plurality of second input images The third composite image is obtained by rendering the three-dimensional model of the first object in multiple third poses; the multiple third poses are different;
    将所述多个第二输入图像分别输入第二网络进行位姿识别,得到所述第二网络输出的每个所述第二输入图像中所述第一对象的第四位姿;所述第二网络用于识别图像中所述第一对象的位姿;The plurality of second input images are respectively input into the second network for pose recognition, and the fourth pose of the first object in each of the second input images output by the second network is obtained; The second network is used to identify the pose of the first object in the image;
    根据第一对象的三维模型,可微渲染获取所述第一对象在每个所述第四位姿下的 第四合成图像;获取一个第四位姿的第二输入图像与所述一个第四位姿渲染得到的第四合成图像相对应;According to the three-dimensional model of the first object, microrendering obtains a fourth composite image of the first object in each of the fourth poses; obtains a second input image of the fourth pose and the one fourth pose The fourth composite image obtained by pose rendering corresponds to;
    根据每个所述第四合成图像与其对应的第二输入图像的第二差异信息,构建第二损失函数;所述第二差异信息用于指示所述第四合成图像与其对应的所述第二输入图像的差异;A second loss function is constructed according to the second difference information of each fourth composite image and its corresponding second input image; the second difference information is used to indicate the fourth composite image and its corresponding second loss function the difference of the input image;
    根据所述第二损失函数更新所述第二网络,得到第一网络;所述第一网络识别的图像中第一对象的位姿与图像中所述第一对象的真实位姿的差异,小于所述第二网络识别的图像中第一对象的位姿与图像中所述第一对象的真实位姿差异。The second network is updated according to the second loss function to obtain the first network; the difference between the pose of the first object in the image recognized by the first network and the real pose of the first object in the image is less than The pose of the first object in the image recognized by the second network is different from the real pose of the first object in the image.
  16. 根据权利要求15所述的方法,其特征在于,The method of claim 15, wherein:
    所述第二损失函数Loss 2满足如下表达式:
    Figure PCTCN2021140241-appb-100002
    所述X大于或等于1,所述λ i为权值,所述L i用于表示所述第四合成图像与其对应的所述第二输入图像的第二差异信息的计算值。
    The second loss function Loss 2 satisfies the following expression:
    Figure PCTCN2021140241-appb-100002
    The X is greater than or equal to 1, the λ i is a weight, and the Li is used to represent the calculated value of the second difference information between the fourth composite image and the corresponding second input image.
  17. 根据权利要求15或16所述的方法,其特征在于,所述第二差异信息包括下述内容中一项或多项:所述第四合成图像中所述第一对象的黑白图像与其对应的第二输入图像的中所述第一对象的黑白图像的交并比IOU之差、获取所述第四合成图像的第四位姿与所述第四合成图像经过所述第一网络得到的位姿之差、所述第四合成图像与其对应的第二输入图像中与所述第四合成图像中的所述第一对象相同位置的区域图像的相似度。The method according to claim 15 or 16, wherein the second difference information comprises one or more of the following: the black and white image of the first object in the fourth composite image and its corresponding black and white image In the second input image, the difference between the intersection ratio IOU of the black and white image of the first object, the fourth pose obtained from the fourth composite image, and the position obtained by the fourth composite image through the first network The difference in pose, and the similarity between the fourth composite image and its corresponding second input image in the region image at the same position as the first object in the fourth composite image.
  18. 一种对象注册装置,其特征在于,所述装置包括:An object registration device, characterized in that the device comprises:
    第一获取单元,用于获取包括第一对象的多个第一输入图像,所述多个第一输入图像包括所述第一对象的真实图像和/或所述第一对象的多个第一合成图像;所述多个第一合成图像是由所述第一对象的三维模型在多个第一位姿下可微渲染得到的;所述多个第一位姿不同;A first acquiring unit, configured to acquire a plurality of first input images including a first object, the plurality of first input images including a real image of the first object and/or a plurality of first input images of the first object a composite image; the plurality of first composite images are obtained by differentiable rendering of the three-dimensional model of the first object in a plurality of first poses; the plurality of first poses are different;
    提取单元,用于分别提取多个所述第一输入图像的特征信息,所述特征信息用于指示所述第一对象在其所处的第一输入图像中的特征;an extraction unit, configured to separately extract a plurality of feature information of the first input image, where the feature information is used to indicate the feature of the first object in the first input image where it is located;
    注册单元,用于将所述提取单元在每个所述第一输入图像中提取的所述特征信息,与所述第一对象的标识相对应,进行所述第一对象的注册。A registration unit, configured to perform the registration of the first object by corresponding the feature information extracted by the extraction unit in each of the first input images to the identifier of the first object.
  19. 根据权利要求18所述的装置,其特征在于,所述特征信息包括局部特征点的描述子,以及全局特征的描述子。The apparatus according to claim 18, wherein the feature information includes descriptors of local feature points and descriptors of global features.
  20. 根据权利要求18或19所述的装置,其特征在于,所述提取单元具体用于:The device according to claim 18 or 19, wherein the extraction unit is specifically used for:
    将多个所述第一输入图像分别输入第一网络进行位姿识别,得到每个所述第一输入图像中所述第一对象的位姿;其中,所述第一网络用于识别图像中所述第一对象的位姿;A plurality of the first input images are respectively input into the first network for pose recognition, and the pose of the first object in each of the first input images is obtained; wherein, the first network is used to identify the the pose of the first object;
    按照获取的所述第一对象的位姿,将所述第一对象的三维模型分别投影至每个所述第一输入图像,得到每个所述第一输入图像中的投影区域;According to the acquired pose of the first object, project the three-dimensional model of the first object to each of the first input images, respectively, to obtain a projection area in each of the first input images;
    分别在每个所述第一输入图像中的投影区域,提取所述特征信息。The feature information is extracted in each of the projection regions in the first input image, respectively.
  21. 根据权利要求18或19所述的装置,其特征在于,所述提取单元具体用于:The device according to claim 18 or 19, wherein the extraction unit is specifically used for:
    将多个所述第一输入图像分别输入第一网络进行黑白图像提取,获取每个所述第 一输入图像中所述第一对象的黑白图像;其中,所述第一网络用于提取图像中所述第一对象的黑白图像;A plurality of the first input images are respectively input into a first network for black and white image extraction, and a black and white image of the first object in each of the first input images is obtained; wherein, the first network is used to extract the a black and white image of the first object;
    分别在每个所述第一输入图像中所述第一对象的黑白图像内,提取所述特征信息。The feature information is extracted from the black and white images of the first object in each of the first input images, respectively.
  22. 根据权利要求18-21任一项所述的装置,其特征在于,所述装置还包括:The device according to any one of claims 18-21, wherein the device further comprises:
    处理单元,用于将所述第一对象的N个真实图像分别输入第一网络,得到所述第一网络输出的每个真实图像中所述第一对象的第二位姿;所述N大于或等于1;所述第一网络用于识别图像中所述第一对象的位姿;a processing unit, configured to input N real images of the first object into the first network respectively, to obtain the second pose of the first object in each real image output by the first network; the N is greater than or equal to 1; the first network is used to identify the pose of the first object in the image;
    可微渲染器,用于根据所述第一对象的三维模型,在每个所述第二位姿下,可微渲染得到N个第二合成图像;获取一个第二位姿的真实图像与所述一个第二位姿渲染得到的第二合成图像相对应;The differentiable renderer is used to obtain N second composite images through differentiable rendering according to the three-dimensional model of the first object under each of the second poses; obtain a real image of the second pose and all the second poses. The second composite image obtained by rendering a second pose corresponds to;
    截取单元,用于分别截取每个所述真实图像中,与真实对象对应的第二合成图像中所述第一对象相同位置的区域,作为每个所述真实图像的前景图像;an intercepting unit, configured to intercept the area of the same position of the first object in the second composite image corresponding to the real object in each of the real images, as the foreground image of each of the real images;
    构建单元,用于根据N个所述真实图像的前景图像与其对应的所述第二合成图像的第一差异信息,构建第一损失函数;其中,所述第一差异信息用于指示前景图像与其对应的第二合成图像的差异;a construction unit, configured to construct a first loss function according to the first difference information of the foreground images of the N real images and their corresponding second synthetic images; wherein the first difference information is used to indicate that the foreground images are different from the corresponding second synthetic images. the difference of the corresponding second composite image;
    更新单元,用于根据所述第一损失函数更新所述可微渲染器,使得所述可微渲染器的输出的合成图像逼近对象的真实图像。An update unit, configured to update the differentiable renderer according to the first loss function, so that the synthesized image output by the differentiable renderer approximates the real image of the object.
  23. 根据权利要求22所述的装置,其特征在于,所述第一差异信息包括下述信息中一项或多项:特征图差、像素颜色差、提取的特征描述子之差。The apparatus according to claim 22, wherein the first difference information includes one or more of the following information: difference in feature map, difference in pixel color, and difference between extracted feature descriptors.
  24. 根据权利要求18-23任一项所述的装置,其特征在于,所述装置还包括:The device according to any one of claims 18-23, wherein the device further comprises:
    第二获取单元,用于获取包括第一对象的多个第二输入图像,所述多个第二输入图像包括所述第一对象的真实图像和/或所述第一对象的多个第三合成图像;所述多个第三合成图像是由所述第一对象的三维模型在多个第三位姿下渲染得到的;所述多个第三位姿不同;a second acquiring unit, configured to acquire a plurality of second input images including a first object, the plurality of second input images including a real image of the first object and/or a plurality of third images of the first object a composite image; the plurality of third composite images are obtained by rendering the three-dimensional model of the first object in a plurality of third poses; the plurality of third poses are different;
    处理单元,用于将所述多个第二输入图像分别输入第二网络进行位姿识别,得到所述第二网络输出的每个所述第二输入图像中所述第一对象的第四位姿;所述第二网络用于识别图像中所述第一对象的位姿;A processing unit, configured to input the plurality of second input images into the second network respectively for pose recognition, and obtain the fourth position of the first object in each of the second input images output by the second network pose; the second network is used to identify the pose of the first object in the image;
    可微渲染器,用于根据第一对象的三维模型,可微渲染获取所述第一对象在每个所述第四位姿下的第四合成图像;获取一个第四位姿的第二输入图像与所述一个第四位姿渲染得到的第四合成图像相对应;A differentiable renderer, configured to obtain a fourth composite image of the first object in each of the fourth poses through differentiable rendering according to the three-dimensional model of the first object; obtain a second input of the fourth pose The image corresponds to the fourth composite image obtained by rendering the one fourth pose;
    构建单元,用于根据每个所述第四合成图像与其对应的第二输入图像的第二差异信息,构建第二损失函数;所述第二差异信息用于指示所述第四合成图像与其对应的所述第二输入图像的差异;a construction unit, configured to construct a second loss function according to the second difference information of each fourth composite image and its corresponding second input image; the second difference information is used to indicate that the fourth composite image corresponds to it the difference of the second input image;
    更新单元,用于根据所述第二损失函数更新所述第二网络,得到第一网络;所述第一网络识别的图像中第一对象的位姿与图像中所述第一对象的真实位姿的差异,小于所述第二网络识别的图像中第一对象的位姿与图像中所述第一对象的真实位姿差异。an update unit, configured to update the second network according to the second loss function to obtain a first network; the pose of the first object in the image identified by the first network and the real position of the first object in the image The difference in pose is smaller than the difference between the pose of the first object in the image recognized by the second network and the real pose of the first object in the image.
  25. 根据权利要求24所述的装置,其特征在于,The apparatus of claim 24, wherein:
    所述第二损失函数Loss 2满足如下表达式:
    Figure PCTCN2021140241-appb-100003
    所述X大于或等于1, 所述λ i为权值,所述L i用于表示所述第四合成图像与其对应的所述第二输入图像的第二差异信息的计算值。
    The second loss function Loss 2 satisfies the following expression:
    Figure PCTCN2021140241-appb-100003
    The X is greater than or equal to 1, the λ i is a weight, and the Li is used to represent the calculated value of the second difference information between the fourth composite image and the corresponding second input image.
  26. 根据权利要求24或25所述的装置,其特征在于,所述第二差异信息包括下述内容中一项或多项:所述第四合成图像中所述第一对象的黑白图像与其对应的第二输入图像的中所述第一对象的黑白图像的交并比IOU之差、获取所述第四合成图像的第四位姿与所述第四合成图像经过所述第一网络得到的位姿之差、所述第四合成图像与其对应的第二输入图像中与所述第四合成图像中的所述第一对象相同位置的区域图像的相似度。The apparatus according to claim 24 or 25, wherein the second difference information includes one or more of the following: the black and white image of the first object in the fourth composite image and its corresponding black and white image In the second input image, the difference between the intersection ratio IOU of the black and white image of the first object, the fourth pose obtained from the fourth composite image, and the position obtained by the fourth composite image through the first network The difference in pose, and the similarity between the fourth composite image and its corresponding second input image in the region image at the same position as the first object in the fourth composite image.
  27. 一种显示装置,其特征在于,所述装置包括第一获取单元、输出单元以及处理单元;其中:A display device, characterized in that the device comprises a first acquisition unit, an output unit and a processing unit; wherein:
    所述第一获取单元,用于获取第一图像;the first acquisition unit, configured to acquire a first image;
    所述输出单元,用于若所述第一图像中包括一个或多个可识别对象,输出第一信息,所述第一信息用于提示检测到所述第一图像中包括可识别对象;若所述第一图像中未包括任一可识别对象,输出第二信息,所述第二信息用于提示未检测到可识别对象,调整视角以使得所述第一获取单元获取第二图像,所述第二图像与所述第一图像不同;The output unit is configured to output first information if the first image includes one or more recognizable objects, where the first information is used to prompt detection that the first image includes recognizable objects; if The first image does not include any identifiable object, output second information, the second information is used to prompt that no identifiable object is detected, adjust the viewing angle so that the first acquisition unit acquires the second image, so the second image is different from the first image;
    所述处理单元,用于若所述第一图像中包括一个或多个可识别对象,通过所述第一图像包括的每个可识别对象对应的位姿检测网络,获取所述第一图像中每个可识别对象的位姿;按照每个所述可识别对象的位姿,显示每个所述可识别对象对应的虚拟内容。The processing unit is configured to, if the first image includes one or more identifiable objects, obtain the information in the first image through the pose detection network corresponding to each identifiable object included in the first image. The pose of each recognizable object; according to the pose of each recognizable object, the virtual content corresponding to each recognizable object is displayed.
  28. 根据权利要求27所述的装置,其特征在于,所述装置还包括:The apparatus of claim 27, wherein the apparatus further comprises:
    提取单元,用于提取所述第一图像中的特征信息,所述特征信息用于指示所述第一图像中可识别的特征;an extraction unit, configured to extract feature information in the first image, where the feature information is used to indicate recognizable features in the first image;
    判断单元,用于判断特征库中是否存在与所述特征信息的匹配距离满足预设条件的特征信息;其中,所述特征库中存储了不同对象的一个或多个特征信息;a judging unit for judging whether there is feature information whose matching distance with the feature information satisfies a preset condition in the feature library; wherein, the feature library stores one or more feature information of different objects;
    第一确定单元,用于若所述特征库中存在与所述特征信息的匹配距离满足预设条件的特征信息,确定所述第一图像中包括一个或多个可识别对象;若所述特征库中不存在与所述特征信息的匹配距离满足预设条件的特征信息,确定所述第一图像未包括任一可识别对象。a first determining unit, configured to determine that the first image includes one or more identifiable objects if there is feature information whose matching distance with the feature information meets a preset condition in the feature library; If there is no feature information whose matching distance with the feature information satisfies a preset condition in the library, it is determined that the first image does not include any identifiable object.
  29. 根据权利要求27所述的装置,其特征在于,所述装置还包括:The apparatus of claim 27, wherein the apparatus further comprises:
    第二获取单元,用于获取所述第一图像中的一个或多个第一局部特征点,所述第一局部特征点的描述子与特征库中的局部特征点的描述子的匹配距离小于或等于第一阈值;所述特征库中存储了不同对象的局部特征点的描述子;a second acquiring unit, configured to acquire one or more first local feature points in the first image, where the matching distance between the descriptors of the first local feature points and the descriptors of the local feature points in the feature library is less than or equal to the first threshold; descriptors of local feature points of different objects are stored in the feature library;
    第二确定单元,用于根据所述第一局部特征点,确定所述第一图像中一个或多个感兴趣区域ROI;一个ROI中包括一个对象;a second determining unit, configured to determine one or more regions of interest ROIs in the first image according to the first local feature points; an ROI includes an object;
    提取单元,用于提取每个所述ROI中的全局特征;an extraction unit for extracting global features in each of the ROIs;
    第一确定单元,用于若每个所述ROI中的全局特征中存在一个或多个第一全局特征,确定所述第一图像中包括所述第一全局特征对应的可识别对象;若每个所述ROI中的全局特征中不存在所述第一全局特征,确定所述第一图像未包括任一可识别对象; 其中,所述第一全局特征的描述子与所述特征库中的全局特征的描述子的匹配距离小于或等于第二阈值;所述特征库中还存储了不同对象的全局特征的描述子。A first determining unit, configured to determine that the first image includes a recognizable object corresponding to the first global feature if there are one or more first global features in the global features in each of the ROIs; The first global feature does not exist in the global features in the ROI, and it is determined that the first image does not include any identifiable object; wherein, the descriptor of the first global feature is the same as the one in the feature library. The matching distance of the descriptors of the global features is less than or equal to the second threshold; the descriptors of the global features of different objects are also stored in the feature library.
  30. 根据权利要求27-29任一项所述的装置,其特征在于,所述装置还包括:The device according to any one of claims 27-29, wherein the device further comprises:
    第三获取单元,用于获取包括第一对象的多个第一输入图像,所述多个第一输入图像包括所述第一对象的真实图像和/或所述第一对象的多个第一合成图像;所述多个第一合成图像是由所述第一对象的三维模型在多个第一位姿下可微渲染得到的;所述多个第一位姿不同;a third acquiring unit, configured to acquire a plurality of first input images including a first object, the plurality of first input images including a real image of the first object and/or a plurality of first input images of the first object a composite image; the plurality of first composite images are obtained by differentiable rendering of the three-dimensional model of the first object in a plurality of first poses; the plurality of first poses are different;
    提取单元,用于分别提取多个所述第一输入图像的特征信息,所述特征信息用于指示所述第一对象在其所处的第一输入图像中的特征;an extraction unit, configured to separately extract a plurality of feature information of the first input image, where the feature information is used to indicate the feature of the first object in the first input image where it is located;
    注册单元,用于将每个所述第一输入图像中提取的所述特征信息,与所述第一对象的标识相对应存储于特征库,进行所述第一对象的注册。The registration unit is configured to store the feature information extracted from each of the first input images and the identifier of the first object in a feature database, and register the first object.
  31. 根据权利要求28或30所述的装置,其特征在于,所述特征信息包括局部特征点的描述子,以及全局特征的描述子。The apparatus according to claim 28 or 30, wherein the feature information includes descriptors of local feature points and descriptors of global features.
  32. 根据权利要求27-31任一项所述的装置,其特征在于,所述装置还包括:The device according to any one of claims 27-31, wherein the device further comprises:
    第四获取单元,用于获取包括第一对象的多个第二输入图像,所述多个第二输入图像包括所述第一对象的真实图像和/或所述第一对象的多个第三合成图像;所述多个第三合成图像是由所述第一对象的三维模型在多个第三位姿下渲染得到的;所述多个第三位姿不同;a fourth acquisition unit, configured to acquire a plurality of second input images including a first object, the plurality of second input images including a real image of the first object and/or a plurality of third images of the first object a composite image; the plurality of third composite images are obtained by rendering the three-dimensional model of the first object in a plurality of third poses; the plurality of third poses are different;
    处理单元,用于将所述多个第二输入图像分别输入第二网络进行位姿识别,得到所述第二网络输出的每个所述第二输入图像中所述第一对象的第四位姿;所述第二网络用于识别图像中所述第一对象的位姿;A processing unit, configured to input the plurality of second input images into the second network respectively for pose recognition, and obtain the fourth position of the first object in each of the second input images output by the second network pose; the second network is used to identify the pose of the first object in the image;
    可微渲染器,用于根据第一对象的三维模型,可微渲染获取所述第一对象在每个所述第四位姿下的第四合成图像;获取一个第四位姿的第二输入图像与所述一个第四位姿渲染得到的第四合成图像相对应;A differentiable renderer, configured to obtain a fourth composite image of the first object in each of the fourth poses through differentiable rendering according to the three-dimensional model of the first object; obtain a second input of the fourth pose The image corresponds to the fourth composite image obtained by rendering the one fourth pose;
    构建单元,用于根据每个所述第四合成图像与其对应的第二输入图像的第二差异信息,构建第二损失函数;所述第二差异信息用于指示所述第四合成图像与其对应的所述第二输入图像的差异;a construction unit, configured to construct a second loss function according to the second difference information of each fourth composite image and its corresponding second input image; the second difference information is used to indicate that the fourth composite image corresponds to it the difference of the second input image;
    更新单元,用于根据所述第二损失函数更新所述第二网络,得到第一网络;所述第一网络识别的图像中第一对象的位姿与图像中所述第一对象的真实位姿的差异,小于所述第二网络识别的图像中第一对象的位姿与图像中所述第一对象的真实位姿差异。an update unit, configured to update the second network according to the second loss function to obtain a first network; the pose of the first object in the image identified by the first network and the real position of the first object in the image The difference in pose is smaller than the difference between the pose of the first object in the image recognized by the second network and the real pose of the first object in the image.
  33. 根据权利要求32所述的装置,其特征在于,The apparatus of claim 32, wherein
    所述第二损失函数Loss 2满足如下表达式:
    Figure PCTCN2021140241-appb-100004
    所述X大于或等于1,所述λ i为权值,所述L i用于表示所述第四合成图像与其对应的所述第二输入图像的第二差异信息的计算值。
    The second loss function Loss 2 satisfies the following expression:
    Figure PCTCN2021140241-appb-100004
    The X is greater than or equal to 1, the λ i is a weight, and the Li is used to represent the calculated value of the second difference information between the fourth composite image and the corresponding second input image.
  34. 根据权利要求32或33所述的装置,其特征在于,所述第二差异信息包括下述内容中一项或多项:所述第四合成图像中所述第一对象的黑白图像与其对应的第二输入图像的中所述第一对象的黑白图像的交并比IOU之差、获取所述第四合成图像的第四位姿与所述第四合成图像经过所述第一网络得到的位姿之差、所述第四合成图像 与其对应的第二输入图像中与所述第四合成图像中的所述第一对象相同位置的区域图像的相似度。The apparatus according to claim 32 or 33, wherein the second difference information includes one or more of the following: the black and white image of the first object in the fourth composite image and its corresponding black and white image In the second input image, the difference between the intersection ratio IOU of the black and white image of the first object, the fourth pose obtained from the fourth composite image, and the position obtained by the fourth composite image through the first network The difference in pose, and the similarity between the fourth composite image and its corresponding second input image in the region image at the same position as the first object in the fourth composite image.
  35. 一种电子设备,其特征在于,所述电子设备包括:处理器和存储器;An electronic device, characterized in that the electronic device comprises: a processor and a memory;
    所述存储器与所述处理器连接;所述存储器用于存储计算机指令,当所述处理器执行所述计算机指令时,使得所述电子设备执行如权利要求1-9中任一项所述的对象注册方法,或者,使得所述电子设备执行如权利要求10-17中任一项所述的显示方法。The memory is connected to the processor; the memory is used for storing computer instructions, and when the processor executes the computer instructions, the electronic device is made to perform the method according to any one of claims 1-9. The object registration method, or, causing the electronic device to execute the display method according to any one of claims 10-17.
  36. 一种计算机可读存储介质,其特征在于,包括指令,当其在计算机上运行时,使得计算机执行权利要求1-9中任一项所述的对象注册方法,或者,使得计算机执行权利要求10-17中任一项所述的显示方法。A computer-readable storage medium, characterized in that it includes an instruction that, when running on a computer, causes the computer to execute the object registration method described in any one of claims 1-9, or causes the computer to execute claim 10 The display method described in any one of -17.
  37. 一种计算机程序产品,其特征在于,当其在计算机上运行时,使得计算机执行权利要求1-9中任一项所述的对象注册方法,或者,使得计算机执行权利要求10-17中任一项所述的显示方法。A computer program product, characterized in that, when it is run on a computer, it causes the computer to execute the object registration method described in any one of claims 1-9, or causes the computer to execute any one of claims 10-17 The display method described in item.
PCT/CN2021/140241 2020-12-29 2021-12-21 Object registration method and apparatus WO2022143314A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011607387.9 2020-12-29
CN202011607387.9A CN114758334A (en) 2020-12-29 2020-12-29 Object registration method and device

Publications (1)

Publication Number Publication Date
WO2022143314A1 true WO2022143314A1 (en) 2022-07-07

Family

ID=82260190

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/140241 WO2022143314A1 (en) 2020-12-29 2021-12-21 Object registration method and apparatus

Country Status (2)

Country Link
CN (1) CN114758334A (en)
WO (1) WO2022143314A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116168312B (en) * 2023-02-23 2023-09-08 兰州交通大学 AR auxiliary assembly three-dimensional registration method and system under complex scene end-to-end

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190147642A1 (en) * 2017-11-15 2019-05-16 Google Llc Learning to reconstruct 3d shapes by rendering many 3d views
US20190304134A1 (en) * 2018-03-27 2019-10-03 J. William Mauchly Multiview Estimation of 6D Pose
EP3579196A1 (en) * 2018-06-05 2019-12-11 Cristian Sminchisescu Human clothing transfer method, system and device
CN110574366A (en) * 2017-05-24 2019-12-13 古野电气株式会社 Image generation device
CN111510701A (en) * 2020-04-22 2020-08-07 Oppo广东移动通信有限公司 Virtual content display method and device, electronic equipment and computer readable medium
CN111783986A (en) * 2020-07-02 2020-10-16 清华大学 Network training method and device and posture prediction method and device
CN111832360A (en) * 2019-04-19 2020-10-27 北京三星通信技术研究有限公司 Prompt message processing method and device, electronic equipment and readable storage medium
CN111950477A (en) * 2020-08-17 2020-11-17 南京大学 Single-image three-dimensional face reconstruction method based on video surveillance

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110574366A (en) * 2017-05-24 2019-12-13 古野电气株式会社 Image generation device
US20190147642A1 (en) * 2017-11-15 2019-05-16 Google Llc Learning to reconstruct 3d shapes by rendering many 3d views
US20190304134A1 (en) * 2018-03-27 2019-10-03 J. William Mauchly Multiview Estimation of 6D Pose
EP3579196A1 (en) * 2018-06-05 2019-12-11 Cristian Sminchisescu Human clothing transfer method, system and device
CN111832360A (en) * 2019-04-19 2020-10-27 北京三星通信技术研究有限公司 Prompt message processing method and device, electronic equipment and readable storage medium
CN111510701A (en) * 2020-04-22 2020-08-07 Oppo广东移动通信有限公司 Virtual content display method and device, electronic equipment and computer readable medium
CN111783986A (en) * 2020-07-02 2020-10-16 清华大学 Network training method and device and posture prediction method and device
CN111950477A (en) * 2020-08-17 2020-11-17 南京大学 Single-image three-dimensional face reconstruction method based on video surveillance

Also Published As

Publication number Publication date
CN114758334A (en) 2022-07-15

Similar Documents

Publication Publication Date Title
KR102222642B1 (en) Neural network for object detection in images
Betancourt et al. The evolution of first person vision methods: A survey
CN111738122B (en) Image processing method and related device
CN111782879B (en) Model training method and device
CN110135336B (en) Training method, device and storage medium for pedestrian generation model
WO2021078001A1 (en) Image enhancement method and apparatus
US20220309836A1 (en) Ai-based face recognition method and apparatus, device, and medium
WO2021047587A1 (en) Gesture recognition method, electronic device, computer-readable storage medium, and chip
CN111930964B (en) Content processing method, device, equipment and storage medium
CN112036331A (en) Training method, device and equipment of living body detection model and storage medium
WO2021190078A1 (en) Method and apparatus for generating short video, and related device and medium
WO2022179604A1 (en) Method and apparatus for determining confidence of segmented image
WO2024021742A1 (en) Fixation point estimation method and related device
US11145088B2 (en) Electronic apparatus and method for controlling thereof
CN111753498A (en) Text processing method, device, equipment and storage medium
WO2022156473A1 (en) Video playing method and electronic device
WO2022143314A1 (en) Object registration method and apparatus
CN113449548A (en) Method and apparatus for updating object recognition model
CN113538227A (en) Image processing method based on semantic segmentation and related equipment
CN113642359B (en) Face image generation method and device, electronic equipment and storage medium
CN112037305A (en) Method, device and storage medium for reconstructing tree-like organization in image
CN115661941A (en) Gesture recognition method and electronic equipment
WO2022068522A1 (en) Target tracking method and electronic device
CN114399622A (en) Image processing method and related device
CN114970576A (en) Identification code identification method, related electronic equipment and computer readable storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21914053

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21914053

Country of ref document: EP

Kind code of ref document: A1