CN112085789A - Pose estimation method, device, equipment and medium - Google Patents

Pose estimation method, device, equipment and medium Download PDF

Info

Publication number
CN112085789A
CN112085789A CN202010802444.2A CN202010802444A CN112085789A CN 112085789 A CN112085789 A CN 112085789A CN 202010802444 A CN202010802444 A CN 202010802444A CN 112085789 A CN112085789 A CN 112085789A
Authority
CN
China
Prior art keywords
feature
target image
block
resolution
coordinate system
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010802444.2A
Other languages
Chinese (zh)
Inventor
张能波
王磊
程俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Institute of Advanced Technology of CAS
Original Assignee
Shenzhen Institute of Advanced Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Institute of Advanced Technology of CAS filed Critical Shenzhen Institute of Advanced Technology of CAS
Priority to CN202010802444.2A priority Critical patent/CN112085789A/en
Publication of CN112085789A publication Critical patent/CN112085789A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/10Terrestrial scenes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Abstract

The application is suitable for the technical field of computers, and provides a pose estimation method, which comprises the following steps: acquiring a target image presenting a target object; inputting a target image into a pre-trained control point prediction model for processing to obtain coordinate information of a plurality of control points; determining attitude parameters of the target object based on coordinate information of the control points, wherein the attitude parameters are used for conversion between a first coordinate system and a second coordinate system, the first coordinate system is a coordinate system taking the center of the target object as an origin, and the second coordinate system is a coordinate system of a camera for collecting a target image; and determining the pose of the target object according to the pose parameters. The posture parameters converted between the coordinate system of the target object and the coordinate system of the camera are determined through the coordinates of the control points, so that the relative position and the rotation angle of the target object in the coordinate system of the camera can be obtained, and the accurate pose of the target object can be obtained.

Description

Pose estimation method, device, equipment and medium
Technical Field
The application belongs to the technical field of computers, and particularly relates to a pose estimation method, a pose estimation device, pose estimation equipment and a pose estimation medium.
Background
A smart device, such as a smart robot, generally needs to know the position of an object in an environment, so as to interact with the object in the environment based on the position of the object in the environment. In the related art, the smart device may generally use a sensor installed on its own, such as a laser sensor, to know the position of an object in the environment. Due to the fact that the sensor is adopted, the distance and the angle of the whole object relative to the intelligent device can be obtained usually, and the rotation angle of the object cannot be obtained, accurate interaction between the intelligent device and the object in the environment cannot be achieved.
Therefore, in the related art, in order to realize accurate interaction between the smart device and the object in the environment, a more accurate position of the object in the environment needs to be obtained.
Disclosure of Invention
The embodiment of the application provides a pose estimation method, a pose estimation device, pose estimation equipment and a pose estimation medium, and aims to solve the problem that the position of an object obtained by an intelligent device by adopting a sensor installed on the intelligent device in the related art is not accurate enough.
In a first aspect, an embodiment of the present application provides a pose estimation method, where the method includes:
acquiring a target image presenting a target object;
inputting a target image into a pre-trained control point prediction model for processing to obtain coordinate information of a plurality of control points, wherein the plurality of control points are used for describing the outline of a target object, the processing of the target image by the control point prediction model comprises extracting multi-scale feature information from the target image and determining the coordinate information of the plurality of control points based on the multi-scale feature information, and the multi-scale feature information is information extracted based on a plurality of preset resolutions and used for describing the features of the target image;
determining attitude parameters of the target object based on coordinate information of the control points, wherein the attitude parameters are used for conversion between a first coordinate system and a second coordinate system, the first coordinate system is a coordinate system taking the center of the target object as an origin, and the second coordinate system is a coordinate system of a camera for collecting a target image;
and determining the pose of the target object according to the pose parameters.
Further, extracting multi-scale feature information from the target image, including:
extracting N characteristic blocks with different resolutions from a target image;
and converting the N characteristic blocks with different resolutions into N characteristic blocks with the same resolution to obtain multi-scale characteristic information, wherein N is an integer greater than 1.
Further, N is 3, the extracted feature blocks are respectively a feature block of a first resolution, a feature block of a second resolution, and a feature block of a third resolution, the first resolution is greater than the second resolution, and the second resolution is greater than the third resolution; and
converting the N feature blocks with different resolutions into N feature blocks with the same resolution, comprising:
downsampling the feature block with the first resolution to obtain a first feature block;
performing 1 × 1 convolution on the feature block of the second resolution to obtain a second feature block;
up-sampling the feature block of the third resolution to obtain a third feature block;
the multi-scale feature information comprises a first feature block, a second feature block and a third feature block, and the resolutions of the first feature block, the second feature block and the third feature block are the same.
Further, the multi-scale feature information comprises N feature blocks with the same resolution; and
determining coordinate information of a plurality of control points based on the multi-scale feature information, including:
extracting the global maximum pooling value of each feature block in the N feature blocks with the same resolution;
and determining coordinate information of a plurality of control points according to the global maximum pooling value of each feature block and the N feature blocks with the same resolution.
Further, N is 3, the N feature blocks with the same resolution are respectively a first feature block, a second feature block and a third feature block, the first feature block includes i feature maps, the second feature map includes j feature maps, the third feature map includes k feature maps, and i, j and k are integers greater than or equal to 1; and
extracting a global maximum pooling value of each feature block in the N feature blocks with the same resolution, comprising:
acquiring the maximum numerical value of each characteristic diagram in the i characteristic diagrams of the first characteristic block, and determining the average value of the maximum numerical values of the i characteristic diagrams as the global maximum pooling value of the first characteristic block;
acquiring the maximum numerical value of each characteristic diagram in the j characteristic diagrams of the second characteristic block, and determining the average value of the maximum numerical values of the j characteristic diagrams as the global maximum pooling value of the second characteristic block;
and acquiring the maximum numerical value of each characteristic diagram in the k characteristic diagrams of the third characteristic block, and determining the average value of the maximum numerical values of the k characteristic diagrams as the global maximum pooling value of the third characteristic block.
Further, determining coordinate information of a plurality of control points according to the global maximum pooling value of each feature block and the N feature blocks with the same resolution, includes:
carrying out linear activation processing operation on a vector consisting of the global maximum pooling values of the feature blocks to obtain first result data;
performing full connection operation on the first result data to obtain second result data;
and obtaining coordinate information of the plurality of control points according to the characteristics of the target image described by the product of the N characteristic blocks with the same resolution and the second result data.
Further, obtaining coordinate information of a plurality of control points according to the features of the target image described by the product of the N feature blocks with the same resolution and the second result data, includes:
determining the object type and the object center point coordinates of the target object according to the characteristics of the target image described by the product of the N characteristic blocks with the same resolution and the second result data;
and determining the coordinates of a plurality of control points of the target object according to the object type, the coordinates of the center point of the object and a preset relative position relationship corresponding to the object type, wherein the relative position relationship is used for describing the position relationship between the coordinates of each control point of the target object and the coordinates of the center point of the object.
Further, the control point prediction model is obtained by training the following steps:
acquiring a training sample set, wherein training samples in the training sample set comprise sample images and labeled coordinates of control points of objects represented by the sample images;
selecting training samples from a set of training samples, and performing the following training steps: inputting a sample image of the selected training sample into an initial model to obtain actual output; inputting the marking coordinates of the control points of the object presented by the input sample image and the obtained actual output into a preset loss function to obtain a loss value; in response to the fact that the loss value is smaller than a preset loss threshold value, taking the initial model as a trained control point prediction model;
and responding to the loss value being larger than or equal to the preset loss threshold value, adjusting the parameters of the initial model to obtain an adjusted initial model, taking the adjusted initial model as the initial model, selecting unselected training samples from the training sample set, and continuing to execute the training step.
In a second aspect, an embodiment of the present application provides a pose estimation apparatus, including:
the image acquisition unit is used for acquiring a target image of a target object;
the coordinate prediction unit is used for inputting a target image into a pre-trained control point prediction model for processing to obtain coordinate information of a plurality of control points, wherein the plurality of control points are used for describing the outline of a target object, the processing of the target image by the control point prediction model comprises the steps of extracting multi-scale feature information from the target image and determining the coordinate information of the plurality of control points based on the multi-scale feature information, and the multi-scale feature information is information which is extracted based on a plurality of preset resolutions and used for describing the features of the target image;
the parameter determining unit is used for determining attitude parameters of the target object based on the coordinate information of the control points, wherein the attitude parameters are used for converting a first coordinate system and a second coordinate system, the first coordinate system is a coordinate system taking the center of the target object as an origin, and the second coordinate system is a coordinate system of a camera for acquiring a target image;
and the pose determining unit is used for determining the pose of the target object according to the pose parameters.
In a third aspect, an embodiment of the present application provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of the pose estimation method when executing the computer program.
In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, where a computer program is stored, and the computer program, when executed by a processor, implements the steps of the pose estimation method.
In a fifth aspect, embodiments of the present application provide a computer program product, which, when run on an electronic device, causes the electronic device to execute the pose estimation method of any one of the first aspects.
Compared with the related technology, the embodiment of the application has the beneficial effects that: and inputting the acquired target image into a pre-trained control point prediction model, and predicting the coordinates of a plurality of control points of the target object presented by the target image through the control point prediction model. Then, a pose parameter for conversion between the coordinate system of the target object and the coordinate system of the camera is determined by predicting the obtained coordinates of the plurality of control points, thereby determining the pose of the object based on the pose parameter. Because the distribution of each control point meets the distribution characteristics of the target object in the coordinate system, the posture parameters converted between the coordinate system of the target object and the coordinate system of the camera are determined through the coordinates of the control points, the relative position and the rotation angle of the target object in the coordinate system of the camera can be obtained, and the method is favorable for obtaining the more accurate posture of the target object. In addition, different features can be extracted from the target image based on different resolutions, and rich and accurate features can be extracted from the target image based on multiple resolutions, so that the control point prediction model predicts the coordinate information of the control point by adopting multi-scale feature information extracted based on multiple preset resolutions, the predicted coordinate information of the control point can be more accurate, and the accuracy of the pose prediction of the target object can be further improved.
It is understood that the beneficial effects of the second aspect to the fifth aspect can be referred to the related description of the first aspect, and are not described herein again.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings required to be used in the embodiments or the related technical descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic flowchart of a pose estimation method according to an embodiment of the present application;
FIG. 2 is a schematic diagram illustrating an effect of control points for a target image according to an embodiment of the present application;
FIG. 3 is a schematic diagram of a relationship between a coordinate system of an object and a coordinate system of a camera according to an embodiment of the present application;
FIG. 4 is a diagram illustrating a process for extracting multi-scale feature information according to an embodiment of the present disclosure;
fig. 5 is a schematic flowchart of a pose estimation method according to another embodiment of the present application;
fig. 6 is a schematic flowchart of a pose estimation method according to another embodiment of the present application;
fig. 7 is a network structure diagram of a control point prediction model according to an embodiment of the present application;
fig. 8 is a schematic structural diagram of a pose estimation apparatus provided in an embodiment of the present application;
fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.
It should be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items. Furthermore, in the description of the present application and the appended claims, the terms "first," "second," "third," and the like are used for distinguishing between descriptions and not necessarily for describing or implying relative importance.
It should also be appreciated that reference throughout this specification to "one embodiment" or "some embodiments," or the like, means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," or the like, in various places throughout this specification are not necessarily all referring to the same embodiment, but rather "one or more but not all embodiments" unless specifically stated otherwise. The terms "comprising," "including," "having," and variations thereof mean "including, but not limited to," unless expressly specified otherwise.
The application provides a pose estimation method, which can extract abundant and accurate features from a target image by extracting the features from different resolutions, so that coordinate information of a control point obtained by prediction is more accurate, and the accuracy of pose prediction of a target object is improved.
In order to explain the technical means of the present application, the following examples are given below.
Example one
Referring to fig. 1, an embodiment of the present application provides a pose estimation method, including:
step 101, acquiring a target image presenting a target object.
The target object may be various objects. As an example, the target object may be a beverage bottle, a stone, a person, a dog, or the like.
In this embodiment, the execution subject of the pose estimation method may be an intelligent device, such as an intelligent robot, an unmanned vehicle, or the like. The execution subject may capture the target image through a camera installed on the smart device.
And 102, inputting the target image into a pre-trained control point prediction model for processing to obtain coordinate information of a plurality of control points.
Wherein the plurality of control points are used for describing the outline of the target object. Here, the control points are points describing the outline of the target object, and the number of the control points is usually plural. For example, there may be 4, 8, etc. control points. For example, if the target object is a rectangular parallelepiped, there may be 8 control points, and each control point may be 8 vertexes of the rectangular parallelepiped.
The control point prediction model may be configured to analyze a correspondence between the target image and coordinate information of a plurality of control points of the target object represented by the target image.
The processing of the target image by the control point prediction model includes extracting multi-scale feature information from the target image and determining coordinate information of a plurality of control points based on the multi-scale feature information. The multi-scale feature information is extracted based on multiple preset resolutions and is used for describing features of the target image.
In practical applications, after a convolution layer performs convolution processing on an image, one or more feature maps with the same resolution for describing features of the image can be generally extracted. And the resolution of the feature map extracted by one convolutional layer is usually fixed. Therefore, the control point prediction model may extract the features of the target image using the convolution layer in which the resolution of the output feature map is the preset resolution. When there are multiple preset resolutions, the features of the target image may be extracted for the convolutional layer with the preset resolution using the resolution of one output feature map for each preset resolution.
In practice, when extracting features of a target image based on multiple preset resolutions, the feature maps with different resolutions output by each convolutional layer may be directly used as the multi-scale feature information, or after processing the feature maps with different resolutions output by each convolutional layer according to a preset processing rule, each processed feature map may be used as the multi-scale feature information.
The preset processing rule may be a preset rule for processing feature maps with different resolutions. As an example, the preset processing rule may be a rule for converting feature maps having different resolutions into feature maps having the same resolution. Note that, if the preset processing rule is a rule for converting feature maps with different resolutions into feature maps with the same resolution, in order to reduce the calculation resources required for converting the resolutions, the resolution of each feature map with different resolutions is usually converted into the same resolution as that of one of the feature maps. For example, if the resolution of the feature map output by the first layer of convolutional layer is 40 × 40, the resolution of the feature map output by the second layer of convolutional layer is 20 × 20, and the resolution of the feature map output by the third layer of convolutional layer is 10 × 10. In this case, the resolution of each feature map may be converted to 40 × 40, the resolution of each feature map may be converted to 20 × 20, or the resolution of each feature map may be converted to 10 × 10.
In practice, the plurality of convolution layers in the control point prediction model for extracting features of a plurality of preset resolutions of the target image are usually a convolution network. Optionally, the convolutional network is an anonymous network (darknet network). When the convolutional network is a dark net, since the resolution of the feature map output by each convolutional layer is known, the corresponding convolutional layer can be selected based on the preset resolution. For example, if the predetermined resolutions are 52 × 52, 26 × 26, and 13 × 13, the resolution of the feature map output by the 8 th layer convolutional layer of the darknet network is 52 × 52, the resolution of the feature map output by the 13 th layer convolutional layer of the darknet network is 26 × 26, and the resolution of the feature map output by the 23 rd layer convolutional layer of the darknet network is 13 × 13. In this case, a characteristic diagram of the output of the 8 th, 13 th and 23 th convolutional layers of the darknet network can be extracted.
It should be noted that, because the dark net network generally has the advantages of high speed, low background false detection rate, high universality and the like, the adoption of the dark net network to extract the features of the target image is helpful to realize faster extraction of the feature map capable of accurately describing the features of the target object.
After extracting the multi-scale feature information of the target image, the control point prediction model may obtain the coordinate information of the plurality of control points of the target object by inputting the extracted multi-scale feature information into the pooling layer, the linear activation layer, the full-link layer, and the like. The number of the pooling layers may be one or more, the number of the linear activation layers may be one or more, and the number of the fully-connected layers may be one or more.
It can be understood that the control point prediction model may be trained in advance by the intelligent device, or a file corresponding to the control point prediction model may be migrated to the intelligent device after being trained in advance by another device. That is, the execution subject for training the control point prediction model may be the same as or different from the execution subject for performing control point prediction using the control point prediction model. For example, when the initial model is trained by other equipment, after the initial model is trained by other equipment, the model parameters of the initial model are fixed to obtain a file corresponding to the control point prediction model. And then migrate the file to the smart device.
In this embodiment, when obtaining the target image, the executing entity may obtain coordinate information of the plurality of control points of the target object by inputting the target image into a control point prediction model trained in advance.
Fig. 2 is a schematic diagram illustrating the effect of the control points of the object represented by a certain target image obtained by the target image. As shown in fig. 2, the target image is an image including a beverage bottle B, P1-P8 in the figure are 8 control points of the beverage bottle predicted by the control point prediction model, and P0 is a center point of the beverage bottle B.
And 103, determining the attitude parameters of the target object based on the coordinate information of the plurality of control points.
The attitude parameters are used for converting a first coordinate system and a second coordinate system, the first coordinate system is a coordinate system taking the center of the target object as an origin, and the second coordinate system is a coordinate system of a camera for collecting the target image;
the pose parameters typically include a rotation matrix and an offset vector.
The coordinate system of the target object is usually a coordinate system with the center of the target object as an origin. The coordinate system of the camera is usually a coordinate system with the center of the camera as the origin.
Fig. 3 shows a schematic view of the relation between the coordinate system of the object and the coordinate system of the camera. As shown in fig. 3, for any point on the object, for example, the vertex of the rectangular parallelepiped, if the point uses the center of the object as the origin of coordinates, the coordinate of the point in the coordinate system of the object can be obtained as Pobject. If the point uses the center of the camera as the origin of coordinates, the coordinate of the point in the coordinate system of the camera can be obtained as Pcamera. The coordinates of the same point are different in different coordinate systems, and the same point generally satisfies the following coordinate transformation relationship in different coordinate systems:
Pcamera=R·Pobject+t
where R is a rotation matrix from the coordinate system of the object to the coordinate system of the camera, and t is an offset vector from the coordinate system of the object to the coordinate system of the camera.
In this embodiment, after obtaining the coordinates of the plurality of control points of the target object, the executing body may determine the attitude parameters of the target object by using a related technology or a technology developed in the future. As an example, the executing subject may first scan the target object with the laser radar sensor to obtain the point cloud data of the target object. And then, determining points corresponding to the control points of the target object in the point cloud data. And finally, substituting the coordinates of the point corresponding to the control point of the target object in the point cloud data and the coordinates of the control point into the coordinate conversion relation, and calculating to obtain the attitude parameter of the target object. It should be noted that the point cloud data generally refers to a set of three-dimensional coordinate points in the coordinate system of the camera, each corresponding to a point on the outer surface of the object. The set of three-dimensional coordinate points may be used to represent the shape of the external surface of an object.
And step 104, determining the pose of the target object according to the pose parameters.
In this embodiment, after obtaining the pose parameters of the target object, the executing entity may directly substitute the coordinates of each control point into the coordinate transformation relationship, and calculate the pose of the target object. The pose of the target object generally refers to three-dimensional coordinates and a rotation angle of the target object in a coordinate system of the camera.
In this embodiment, the executing subject may input the acquired target image into a control point prediction model trained in advance, and predict coordinates of a plurality of control points of the target object represented by the target image through the control point prediction model. Then, a pose parameter for conversion between the coordinate system of the target object and the coordinate system of the camera is determined by predicting the obtained coordinates of the plurality of control points, thereby determining the pose of the object based on the pose parameter. Because the distribution of each control point meets the distribution characteristics of the target object in the coordinate system, the posture parameters converted between the coordinate system of the target object and the coordinate system of the camera are determined through the coordinates of the control points, the relative position and the rotation angle of the target object in the coordinate system of the camera can be obtained, and the method is favorable for obtaining the more accurate posture of the target object. In addition, different features can be extracted from the target image based on different resolutions, and rich and accurate features can be extracted from the target image based on multiple resolutions, so that the control point prediction model predicts the coordinate information of the control point by adopting multi-scale feature information extracted based on multiple preset resolutions, the predicted coordinate information of the control point can be more accurate, and the accuracy of the pose prediction of the target object can be further improved.
In some optional implementations of the embodiment, the extracting multi-scale feature information from the target image includes: n feature blocks with different resolutions are extracted from a target image. And converting the N characteristic blocks with different resolutions into N characteristic blocks with the same resolution to obtain multi-scale characteristic information, wherein N is an integer greater than 1.
The feature block is usually a set of a plurality of feature maps having the same resolution. Here, the multi-scale feature information is N feature blocks obtained by conversion and having the same resolution.
Fig. 4 shows a schematic diagram of a process of extracting multi-scale feature information. As shown in fig. 4, a plurality of feature blocks with different resolutions, i.e., a feature block 401A, a feature block 402A, and a feature block 403A, can be extracted from the target image 400. Then, the feature blocks 401A, 402A, and 403A with different resolutions may be converted into feature blocks with the same resolution, so as to obtain a plurality of feature blocks with the same resolution, which are the feature block 401B, 402B, and 403B, respectively. In this case, the obtained feature blocks 401B, 402B, and 403B are the multi-scale feature information.
It is to be noted that, in order to facilitate the resolution conversion calculation, the resolutions of the N feature blocks extracted from the target image generally have an integer multiple relationship. As an example, if N is 2, 2 feature blocks may be extracted from the target image, and if the resolution of one of the feature blocks is 10 × 10, the resolution of the other feature block may be 20 × 20.
According to the implementation mode, the characteristic blocks with different resolutions are extracted, so that richer characteristics of the target image can be extracted. And then, the feature blocks with different resolutions are converted into the feature blocks with the same resolution, so that the comprehensive analysis of the features of the feature blocks with various resolutions can be realized, and the more accurate feature extraction of the target image is facilitated.
In some optional implementations of this embodiment, if N is 3, the extracted feature blocks are a feature block of a first resolution, a feature block of a second resolution, and a feature block of a third resolution, respectively, where the first resolution is greater than the second resolution, and the second resolution is greater than the third resolution. In this case, the converting the N feature blocks with different resolutions into N feature blocks with the same resolution includes:
and downsampling the feature block with the first resolution to obtain a first feature block.
And performing 1 × 1 convolution on the feature block of the second resolution to obtain a second feature block.
And upsampling the feature block of the third resolution to obtain a third feature block.
The multi-scale feature information comprises a first feature block, a second feature block and a third feature block, and the resolutions of the first feature block, the second feature block and the third feature block are the same.
For example, if the first resolution is 52 × 52, the second resolution is 26 × 26, and the third resolution is 13 × 13, the feature block of the first resolution may be downsampled by 2 times to obtain the first feature block with the resolution of 26 × 26. And performing 1 × 1 convolution on the feature block of the second resolution to obtain a second feature block with the resolution of 26 × 26. And performing 2 times upsampling on the feature block of the third resolution to obtain a third feature block with the resolution of 26 × 26.
In this implementation, the resolution of each feature block is converted to be the same as the resolution of one of the feature blocks by downsampling the feature block with a high resolution, upsampling the feature block with a low resolution, and performing 1 × 1 convolution on the feature block with a medium resolution, so that the computational resources consumed for converting the resolution can be reduced. In addition, the feature blocks with different resolutions are converted into the feature blocks with the same resolution by adopting a plurality of sampling modes, so that the loss of features caused by resolution conversion can be reduced, and the obtained multi-scale feature information can more accurately describe the features of the target image.
In some optional implementations of this embodiment, if the multi-scale feature information includes N feature blocks with the same resolution. In this case, the determining coordinate information of the plurality of control points based on the multi-scale feature information includes:
first, a global maximum pooling value of each of N feature blocks having the same resolution is extracted.
The global maximum pooling value is a data value obtained based on a maximum numerical value of a plurality of feature maps included in the feature block. As an example, the above-mentioned global maximum pooling value may be the largest one of the maximum numerical values of the respective feature maps included in the feature block. By way of further example, if the feature block includes 2 feature maps. The maximum value in one of the profiles is 20 and the maximum value in the other profile is 26. At this time, the above global maximum pooling value may be 26. As another example, the global maximum pooling value may be the second largest of the maximum values of the feature maps included in the feature block.
And then, determining coordinate information of a plurality of control points according to the global maximum pooling value of each feature block and the N feature blocks with the same resolution.
Here, after extracting the global maximum pooling value of each of the N feature blocks having the same resolution, the N global maximum pooling values may be obtained. As an example, the control point prediction model may obtain the coordinate information of the plurality of control points of the target object by inputting at least one of the N global maximum pooling values and the N feature blocks into a linear activation layer, a full connection layer, or the like. Wherein, the linear activation layer can be one or more, and the full connection layer can be one or more.
In this implementation, the more obvious the features of the target image are extracted due to the points in the feature map having the larger corresponding numerical values. Therefore, the maximum numerical value of the feature map is extracted, and the point in the feature map from which the salient feature of the target image is extracted can be obtained, thereby realizing extraction of the salient feature of the target image. The method is beneficial to improving the accuracy of predicting the coordinate information of the control point of the target object.
In some optional implementations of this embodiment, if N is 3, the N feature blocks with the same resolution are respectively a first feature block, a second feature block, and a third feature block, where the first feature block includes i feature maps, the second feature map includes j feature maps, and the third feature map includes k feature maps, where i, j, and k are integers greater than or equal to 1. In this case, the extracting the global maximum pooling value of each of the N feature blocks having the same resolution includes:
firstly, the maximum numerical value of each feature map in the i feature maps of the first feature block is obtained, and the average value of the maximum numerical values of the i feature maps is determined as the global maximum pooling value of the first feature block.
For example, if the value of i is 2, the first feature block includes two feature maps. If the maximum value in one of the profiles is 20, the maximum value in the other profile is 26. At this time, the above-mentioned global maximum pooling value may be 23, where 23 ═ 20+26 ÷ 2.
Then, the maximum numerical value of each feature map in the j feature maps of the second feature block is obtained, and the average value of the maximum numerical values of the j feature maps is determined as the global maximum pooling value of the second feature block.
And finally, acquiring the maximum numerical value of each characteristic diagram in the k characteristic diagrams of the third characteristic block, and determining the average value of the maximum numerical values of the k characteristic diagrams as the global maximum pooling value of the third characteristic block.
In this implementation, for each feature block of N feature blocks with the same resolution, the average of the maximum numerical values of the feature maps included in the feature block is used as the global maximum pooling value of the feature block, so that the features described by the feature maps in the feature block can be considered comprehensively, which is beneficial to extracting stable and reliable significant features of a target image, and further improves the accuracy of predicting the coordinate information of the control point of the target object.
In some optional implementations of this embodiment, the determining, according to the global maximum pooling value of each feature block and the N feature blocks with the same resolution, coordinate information of the plurality of control points includes:
first, a linear activation processing operation is performed on a vector composed of the global maximum pooling values of the feature blocks to obtain first result data.
Here, for convenience of description, a vector composed of the global maximum pooling values of the respective feature blocks may be referred to as a pooling vector.
And then, carrying out full connection operation on the first result data to obtain second result data.
Here, inputting the pooling vector into the linear activation layer typically implements a linear activation processing operation on the pooling vector to obtain first result data, and inputting the first result data into the full-connection layer implements a full-connection operation on the first result data to obtain second result data.
In practice, the control point prediction model may obtain the second result data by:
u=E(x)=σ(Wu(Wdx))
where u is the second result data, x is the pooling vector, E (x) denotes performing the excitation function calculation on x, WdIs a weight vector of the linear active layer, WuThe weight vector of the full connection layer is a leak-relu activation function, and the sigma is a Sigmoid activation function.
And finally, obtaining coordinate information of a plurality of control points according to the characteristics of the target image described by the product of the N characteristic blocks with the same resolution and the second result data.
In practice, the control point prediction model may calculate the product of the N feature blocks with the same resolution and the second result data by the following formula:
Figure BDA0002627884190000141
wherein the content of the first and second substances,
Figure BDA0002627884190000151
is the product of N feature blocks of the same resolution and second result data, Fscale(xdU) represents a pair xdAnd u performs a product operation, xdN feature blocks with the same resolution, and u is the second result data.
Here, since each element in the pooled vector may be used to describe a certain salient feature of the target image, the second result data obtained by sequentially processing the pooled vector through the linear activation layer and the full connection layer may also be used to describe the salient feature of the target image. Here, multiplying the N feature blocks with the same resolution by the second result data can realize that attention is paid to the salient features of the target image on the basis of the features of the target image described by the N feature blocks with the same resolution, thereby further improving the accuracy of the extracted features. The method is favorable for further improving the accuracy of predicting the coordinate information of the control point of the target object.
In some optional implementations of this embodiment, obtaining the coordinate information of the plurality of control points according to the feature of the target image described by the product of the N feature blocks with the same resolution and the second result data includes:
firstly, the object class and the object center point coordinate of the target object are determined according to the characteristics of the target image described by the product of the N characteristic blocks with the same resolution and the second result data.
Here, the N feature blocks with the same resolution are multiplied by the second result data to obtain a data set describing the features of the object represented by the target image, and the data in the data set is used for calculation to obtain information such as the area where the target object is located, the contour of the target object, the object type, and the coordinates of the center point of the object.
Then, the coordinates of the plurality of control points of the target object are determined according to the object type, the coordinates of the center point of the object, and a preset relative position relationship corresponding to the object type.
The relative position relationship is used for describing the position relationship between the coordinates of each control point of the target object and the coordinates of the center point of the object.
Here, after the object type and the object center point coordinates are calculated, the control point prediction model may search for the relative positional relationship corresponding to the current object type from a pre-stored object type-relative positional relationship table. The pre-stored object type-relative position relationship table may be a pre-established correspondence table in which a plurality of object types and relative position relationships are stored.
After the relative position relationship corresponding to the current object type is found, the control point prediction model can directly calculate the coordinates of the control point by adopting the found relative position relationship. The control point prediction model can also scale the searched relative position relation so as to enable the relative position relation after the scale scaling to be matched with the size of the region where the object in the target image is located, and then the coordinates of the control point are calculated by adopting the relative position relation after the scale scaling.
In this implementation, there are usually a plurality of control points, such as 8 control points. In the stage of training the control point prediction model, if the coordinates of the control points are obtained through direct training, the model is easy to be converged, and the consumed training time is long. Therefore, the realization mode obtains the object center point through training first, and then calculates the coordinates of each control point on the basis of the object center point, so that the model training difficulty can be reduced while obtaining the more accurate coordinates of the control points, and the model training time is saved.
In some optional implementations of this embodiment, the control point prediction model may be obtained by training as follows:
the method comprises the steps of firstly, obtaining a training sample set, wherein training samples in the training sample set comprise sample images and labeled coordinates of control points of objects represented by the sample images.
Step two, selecting training samples from the training sample set, and executing the following training steps: firstly, inputting a sample image of a selected training sample into an initial model to obtain actual output. Then, the annotation coordinates of the control points of the object represented by the input sample image and the obtained actual output are input into a preset loss function, and a loss value is obtained. And in response to the loss value being smaller than the preset loss threshold value, taking the initial model as a trained control point prediction model.
The above-mentioned loss function is a function for describing a degree of inconsistency between an actual output and a desired output in general. The preset loss threshold may be a preset value. In practice, the predetermined loss threshold is usually very small, e.g., 0.01.
And step three, responding to the loss value being larger than or equal to the preset loss threshold value, adjusting parameters of the initial model to obtain an adjusted initial model, taking the adjusted initial model as the initial model, selecting unselected training samples from the training sample set, and continuing to execute the training step.
In this implementation, a control point prediction model for predicting coordinates of a control point of an object represented in a target image can be obtained by training the initial model.
Alternatively, the loss function may include any one or more of the following (1), (2), and (3).
Figure BDA0002627884190000171
Wherein L isptFor loss values for control points, m is the number of control points, xiTo calculate the abscissa of the ith control point,
Figure BDA0002627884190000172
is the abscissa, y, of the pre-labeled ith control pointiTo calculate the ordinate of the ith control point,
Figure BDA0002627884190000173
is the vertical coordinate of the pre-labeled ith control point.
Figure BDA0002627884190000174
Wherein L isdFor loss values for object classification, M is the total number of classes, qcFor pre-labeled class labels, qcIs 0 or 1, qcThe value of 0 represents that the object presented by the sample image does not belong to the class c, qcThe value of 1 represents that the object represented by the sample image belongs to the class c, log () is a logarithmic function, pcThe probability that the object represented for the sample image belongs to class c.
Figure BDA0002627884190000175
Wherein L isconfA loss value of confidence for object classification, n being the number of feature data, ljFor pre-labeled confidence labels,/jIs 0 or 1, ljThe value of 0 represents that the confidence coefficient of the jth characteristic data is 0, ljThe value of 1 represents that the confidence coefficient of the jth characteristic data is 1, sigma () is sigmoid function, sjTo calculate the confidence of the jth feature data, | | is an absolute value operator.
In this implementation, the loss functions (1), (2), and (3) are respectively loss values of the control point prediction model calculated from different angles, and any one or more of the loss functions (1), (2), and (3) are used as the loss functions of the control point prediction model, so that the training of the control point prediction model can be completed when the loss values converge (are less than or equal to a preset loss threshold). It should be noted that, if the loss functions (1), (2), and (3) are used as the loss functions of the control point prediction model, the trained control point prediction model can be more stable.
Example two
The embodiment of the present application provides a pose estimation method, which is a further description of the first embodiment, and reference may be specifically made to the related description of the first embodiment where the same or similar to the first embodiment, and details are not described here again. Referring to fig. 5, the pose estimation method in the embodiment includes:
step 501, collecting a target image presenting a target object.
Step 502, inputting the target image into a pre-trained control point prediction model for processing, and obtaining coordinate information of a plurality of control points.
The control point prediction model is used for processing the target image, wherein the control points are used for describing the outline of the target object, the processing of the target image by the control point prediction model comprises the steps of extracting multi-scale feature information from the target image and determining coordinate information of the control points based on the multi-scale feature information, and the multi-scale feature information is information which is extracted based on multiple preset resolutions and is used for describing the features of the target image.
Step 503, determining the attitude parameters of the target object based on the coordinate information of the plurality of control points.
The attitude parameters are used for converting a first coordinate system and a second coordinate system, the first coordinate system is a coordinate system taking the center of the target object as an origin, and the second coordinate system is a coordinate system of a camera for collecting the target image.
And step 504, determining the pose of the target object according to the pose parameters.
In the present embodiment, the specific operations of steps 501-504 are substantially the same as the operations of steps 101-104 in the embodiment shown in fig. 1, and are not described herein again.
And 505, generating and executing an instruction for controlling the mechanical claw to grab the target object according to the pose of the target object.
The gripper is generally a component of the smart device for gripping an object.
In the present embodiment, the execution subject that executes the above-described pose estimation method is typically an intelligent robot.
In this embodiment, after obtaining the pose of the object, the execution body may generate an instruction for controlling the gripper to grip the target object. Therefore, the execution body can control the mechanical gripper to grab the target object by executing the instruction.
Here, the execution body may generate the instruction in various ways. As an example, the execution subject may generate the instruction using a combination of data describing the pose of the object. The execution subject may also encode data describing the pose of the object, and combine the encoded data to generate the instruction.
In this embodiment, since the accurate pose of the target object can be obtained through the steps 501-504, a more accurate instruction can be generated according to the obtained accurate pose of the target object, so as to realize accurate capture of the target object.
EXAMPLE III
The embodiment of the present application provides a pose estimation method, which is a further description of the first embodiment, and reference may be specifically made to the related description of the first embodiment where the same or similar to the first embodiment, and details are not described here again. Referring to fig. 6, the pose estimation method in the embodiment includes:
step 601, collecting a target image presenting a target object.
Step 602, inputting the target image into a control point prediction model trained in advance for processing, and obtaining coordinate information of a plurality of control points.
The control point prediction model is used for processing the target image, wherein the control points are used for describing the outline of the target object, the processing of the target image by the control point prediction model comprises the steps of extracting multi-scale feature information from the target image and determining coordinate information of the control points based on the multi-scale feature information, and the multi-scale feature information is information which is extracted based on multiple preset resolutions and is used for describing the features of the target image.
Step 603, determining the attitude parameters of the target object based on the coordinate information of the plurality of control points.
The attitude parameters are used for converting a first coordinate system and a second coordinate system, the first coordinate system is a coordinate system taking the center of the target object as an origin, and the second coordinate system is a coordinate system of a camera for collecting the target image.
And step 604, determining the pose of the target object according to the pose parameters.
In the present embodiment, the specific operation of steps 601-604 is substantially the same as the operation of steps 101-104 in the embodiment shown in fig. 1, and will not be described herein again.
And step 605, generating and executing an instruction for controlling the unmanned vehicle to drive to avoid the target object according to the pose of the target object.
In the present embodiment, the execution subject that executes the above-described pose estimation method is typically an unmanned vehicle.
In this embodiment, after obtaining the pose of the target object, the execution subject may generate an instruction for controlling the unmanned vehicle to travel away from the target object. In this way, the execution body can control the unmanned vehicle to travel away from the target object by executing the command.
Here, the execution body may generate the instruction in various ways. As an example, the execution subject may generate the instruction using a combination of data describing the pose of the target object. The execution subject may also encode data describing the pose of the target object, and combine the encoded data to generate the instruction.
In this embodiment, the accurate pose of the target object can be obtained through the steps 601-604, and a more accurate instruction can be generated according to the obtained accurate pose of the target object, so that the unmanned vehicle can be accurately controlled to run.
Example four
Fig. 7 shows a network structure diagram of the control point prediction model. As shown in fig. 7, the processing of the input target image by the control point prediction model may include the following specific steps:
firstly, extracting multi-scale characteristic information from a target image by adopting a convolution network.
Here, the convolutional network may be a darknet network having 24 convolutional layers. The control point prediction model can extract 3 feature blocks with different resolutions from the target image by extracting the feature blocks output from the 8 th layer, the 13 th layer and the 23 rd layer of the darknet network. Wherein each feature block comprises a plurality of feature maps. Then, the control point prediction model may convert the 3 feature blocks with different resolutions into 3 feature blocks with the same resolution. At this time, the obtained 3 feature blocks with the same resolution are the multi-scale feature information.
Wherein, the resolution of the characteristic diagram output by the 8 th layer convolution layer of the dark net network is 52 × 52, the resolution of the characteristic diagram output by the 13 th layer convolution layer is 26 × 26, and the resolution of the characteristic diagram output by the 23 rd layer convolution layer is 13 × 13. There is a multiple relationship between the resolutions of the signatures output by the 8 th, 13 th and 23 th convolutional layers of the darknet network. By selecting the output feature maps of the 8 th, 13 th and 23 th convolutional layers of the dark and dark networks, the computational resources consumed for converting the resolution can be reduced. In addition, because the dark net network generally has the advantages of high speed, low background false detection rate, strong universality and the like, the dark net network is adopted to extract the characteristics of the target image, which is beneficial to realizing the faster extraction of the characteristic diagram capable of accurately describing the characteristics of the target object.
And secondly, inputting the extracted multi-scale feature information into a pooling layer to obtain a global maximum pooling value of each feature block in the multi-scale feature information.
Here, the global maximum pooling value is a data value obtained based on the maximum numerical value of a plurality of feature maps included in the feature block.
Specifically, the average of the maximum numerical values of the feature maps included in the feature block may be used as the global maximum pooling value of the feature block. For example, if the value of i is 2, the first feature block includes two feature maps. If the maximum value in one of the profiles is 20, the maximum value in the other profile is 26. At this time, the above-mentioned global maximum pooling value may be 23, where 23 ═ 20+26 ÷ 2.
And thirdly, inputting a vector consisting of the global maximum pooling values of the feature blocks into a linear activation layer to obtain first result data.
Here, the activation function of the linear activation layer is a leak-relu function.
And fourthly, inputting the first result data into the full connection layer to obtain second result data.
Here, the activation function of the full connection layer is a Sigmoid function.
And fifthly, obtaining coordinate information of the control points according to the characteristics of the target image described by the product of the multi-scale characteristic information and the second result data.
In this embodiment, by extracting feature blocks with different resolutions, extraction of more abundant features of a target image can be realized. And then, the feature blocks with different resolutions are converted into feature blocks with the same resolution, so that the features of the feature blocks with various resolutions can be comprehensively analyzed, more accurate features of the target image can be extracted, and the accurate prediction of the coordinate information of the control point of the target object can be realized. In addition, the more obvious the feature of the target image is extracted because the points in the feature map having the larger corresponding numerical values are. The average value of the maximum numerical values of the feature maps included in the feature block is used as the global maximum pooling value of the feature block, so that the features described by the feature maps in the feature block can be comprehensively considered, the stable and reliable significant features of the target image can be extracted, and the accuracy of predicting the coordinate information of the control point of the target object is further improved.
EXAMPLE five
Fig. 8 shows a block diagram of a pose estimation apparatus 800 provided in an embodiment of the present application, corresponding to the pose estimation method in the above embodiment, and only shows a part related to the embodiment of the present application for convenience of explanation.
Referring to fig. 8, the apparatus includes:
an image acquisition unit 801 for acquiring a target image representing a target object;
a coordinate prediction unit 802, configured to input a target image into a pre-trained control point prediction model for processing, so as to obtain coordinate information of a plurality of control points, where the plurality of control points are used to describe a contour of a target object, and the processing of the target image by the control point prediction model includes extracting multi-scale feature information from the target image, and determining coordinate information of the plurality of control points based on the multi-scale feature information, where the multi-scale feature information is information extracted based on multiple preset resolutions and used to describe features of the target image;
a parameter determining unit 803, configured to determine a pose parameter of the target object based on the coordinate information of the plurality of control points, where the pose parameter is used for conversion between a first coordinate system and a second coordinate system, the first coordinate system is a coordinate system with a center of the target object as an origin, and the second coordinate system is a coordinate system of a camera that acquires the target image;
and a pose determining unit 804, configured to determine a pose of the target object according to the pose parameters.
In one embodiment, extracting multi-scale feature information from a target image comprises:
extracting N characteristic blocks with different resolutions from a target image;
and converting the N characteristic blocks with different resolutions into N characteristic blocks with the same resolution to obtain multi-scale characteristic information, wherein N is an integer greater than 1.
In one embodiment, N is 3, the extracted feature blocks are respectively a feature block of a first resolution, a feature block of a second resolution and a feature block of a third resolution, the first resolution is greater than the second resolution, and the second resolution is greater than the third resolution; and
converting the N feature blocks with different resolutions into N feature blocks with the same resolution, comprising:
downsampling the feature block with the first resolution to obtain a first feature block;
performing 1 × 1 convolution on the feature block of the second resolution to obtain a second feature block;
up-sampling the feature block of the third resolution to obtain a third feature block;
the multi-scale feature information comprises a first feature block, a second feature block and a third feature block, and the resolutions of the first feature block, the second feature block and the third feature block are the same.
In one embodiment, the multi-scale feature information includes N feature blocks of the same resolution; and
determining coordinate information of a plurality of control points based on the multi-scale feature information, including:
extracting the global maximum pooling value of each feature block in the N feature blocks with the same resolution;
and determining coordinate information of a plurality of control points according to the global maximum pooling value of each feature block and the N feature blocks with the same resolution.
In one embodiment, N is 3, N feature blocks with the same resolution are respectively a first feature block, a second feature block and a third feature block, the first feature block includes i feature maps, the second feature map includes j feature maps, the third feature map includes k feature maps, and i, j and k are integers greater than or equal to 1; and
extracting a global maximum pooling value of each feature block in the N feature blocks with the same resolution, comprising:
acquiring the maximum numerical value of each characteristic diagram in the i characteristic diagrams of the first characteristic block, and determining the average value of the maximum numerical values of the i characteristic diagrams as the global maximum pooling value of the first characteristic block;
acquiring the maximum numerical value of each characteristic diagram in the j characteristic diagrams of the second characteristic block, and determining the average value of the maximum numerical values of the j characteristic diagrams as the global maximum pooling value of the second characteristic block;
and acquiring the maximum numerical value of each characteristic diagram in the k characteristic diagrams of the third characteristic block, and determining the average value of the maximum numerical values of the k characteristic diagrams as the global maximum pooling value of the third characteristic block.
In one embodiment, determining coordinate information of a plurality of control points according to the global maximum pooling value of each feature block and the N feature blocks with the same resolution comprises:
carrying out linear activation processing operation on a vector consisting of the global maximum pooling values of the feature blocks to obtain first result data;
performing full connection operation on the first result data to obtain second result data;
and obtaining coordinate information of the plurality of control points according to the characteristics of the target image described by the product of the N characteristic blocks with the same resolution and the second result data.
In one embodiment, obtaining coordinate information of a plurality of control points according to the features of the target image described by the product of the N feature blocks with the same resolution and the second result data comprises:
determining the object type and the object center point coordinates of the target object according to the characteristics of the target image described by the product of the N characteristic blocks with the same resolution and the second result data;
and determining the coordinates of a plurality of control points of the target object according to the object type, the coordinates of the center point of the object and a preset relative position relationship corresponding to the object type, wherein the relative position relationship is used for describing the position relationship between the coordinates of each control point of the target object and the coordinates of the center point of the object.
In one embodiment, the control point prediction model is trained by:
acquiring a training sample set, wherein training samples in the training sample set comprise sample images and labeled coordinates of control points of objects represented by the sample images;
selecting training samples from a set of training samples, and performing the following training steps: inputting a sample image of the selected training sample into an initial model to obtain actual output; inputting the marking coordinates of the control points of the object presented by the input sample image and the obtained actual output into a preset loss function to obtain a loss value; in response to the fact that the loss value is smaller than a preset loss threshold value, taking the initial model as a trained control point prediction model;
and responding to the loss value being larger than or equal to the preset loss threshold value, adjusting the parameters of the initial model to obtain an adjusted initial model, taking the adjusted initial model as the initial model, selecting unselected training samples from the training sample set, and continuing to execute the training step.
The apparatus provided in this embodiment inputs the acquired target image into a control point prediction model trained in advance, and predicts coordinates of a plurality of control points of the target object represented by the target image through the control point prediction model. Then, a pose parameter for conversion between the coordinate system of the target object and the coordinate system of the camera is determined by predicting the obtained coordinates of the plurality of control points, thereby determining the pose of the object based on the pose parameter. Because the distribution of each control point meets the distribution characteristics of the target object in the coordinate system, the posture parameters converted between the coordinate system of the target object and the coordinate system of the camera are determined through the coordinates of the control points, the relative position and the rotation angle of the target object in the coordinate system of the camera can be obtained, and the method is favorable for obtaining the more accurate posture of the target object. In addition, different features can be extracted from the target image based on different resolutions, and rich and accurate features can be extracted from the target image based on multiple resolutions, so that the control point prediction model predicts the coordinate information of the control point by adopting multi-scale feature information extracted based on multiple preset resolutions, the predicted coordinate information of the control point can be more accurate, and the accuracy of the pose prediction of the target object can be further improved.
It should be noted that, for the information interaction, execution process, and other contents between the above-mentioned devices/units, the specific functions and technical effects thereof are based on the same concept as those of the embodiment of the method of the present application, and specific reference may be made to the part of the embodiment of the method, which is not described herein again.
EXAMPLE six
Fig. 9 is a schematic structural diagram of an electronic device 900 according to an embodiment of the present application. As shown in fig. 9, the electronic apparatus 900 of this embodiment includes: at least one processor 901 (only one processor is shown in fig. 9), a memory 902, and a computer program 903, such as a pose estimation program, stored in the memory 902 and executable on the at least one processor 901. The steps in any of the various method embodiments described above are implemented when the computer program 903 is executed by the processor 901. The steps in the embodiments of the respective pose estimation methods described above are implemented when the processor 901 executes the computer program 903. The processor 901, when executing the computer program 903, implements the functions of each module/unit in each device embodiment described above, for example, the functions of the units 801 to 804 shown in fig. 8.
Illustratively, the computer program 903 may be divided into one or more modules/units, which are stored in the memory 902 and executed by the processor 901 to accomplish the present application. One or more modules/units may be a series of computer program instruction segments capable of performing certain functions, which are used to describe the execution of computer program 903 in electronic device 900. For example, the computer program 903 may be divided into an image acquisition unit, a coordinate prediction unit, a parameter determination unit, and a pose determination unit, and specific functions of each unit are described in the foregoing embodiments, and are not described herein again.
The electronic device 900 may be a server, a desktop computer, a tablet computer, a cloud server, a mobile terminal, and other computing devices. The electronic device 900 may include, but is not limited to, a processor 901, a memory 902. Those skilled in the art will appreciate that fig. 9 is merely an example of an electronic device 900 and does not constitute a limitation of the electronic device 900 and may include more or fewer components than shown, or some components may be combined, or different components, e.g., the electronic device may also include input-output devices, network access devices, buses, etc.
The Processor 901 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The storage 902 may be an internal storage unit of the electronic device 900, such as a hard disk or a memory of the electronic device 900. The memory 902 may also be an external storage device of the electronic device 900, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), etc. provided on the electronic device 900. Further, the memory 902 may also include both internal storage units and external storage devices of the electronic device 900. The memory 902 is used for storing computer programs and other programs and data required by the electronic device. The memory 902 may also be used to temporarily store data that has been output or is to be output.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules, so as to perform all or part of the functions described above. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus/electronic device and method may be implemented in other ways. For example, the above-described apparatus/electronic device embodiments are merely illustrative, and for example, a module or a unit may be divided into only one logic function, and may be implemented in other ways, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated module, if implemented in the form of a software functional unit and sold or used as a separate product, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow in the method according to the embodiments described above may be implemented by a computer program, which is stored in a computer readable storage medium and used by a processor to implement the steps of the embodiments of the methods described above. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying computer program code, recording medium, U.S. disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution media, and the like. It should be noted that the computer readable medium may contain other components which may be suitably increased or decreased as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, in accordance with legislation and patent practice, the computer readable medium does not include electrical carrier signals and telecommunications signals.
The above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims (11)

1. A pose estimation method, wherein the method comprises:
acquiring a target image presenting a target object;
inputting the target image into a pre-trained control point prediction model for processing to obtain coordinate information of a plurality of control points, wherein the plurality of control points are used for describing the contour of the target object, the processing of the target image by the control point prediction model comprises extracting multi-scale feature information from the target image and determining the coordinate information of the plurality of control points based on the multi-scale feature information, and the multi-scale feature information is information extracted based on a plurality of preset resolutions and used for describing features of the target image;
determining attitude parameters of the target object based on the coordinate information of the control points, wherein the attitude parameters are used for converting a first coordinate system and a second coordinate system, the first coordinate system is a coordinate system taking the center of the target object as an origin, and the second coordinate system is a coordinate system of a camera for acquiring the target image;
and determining the pose of the target object according to the pose parameters.
2. The method of claim 1, wherein the extracting multi-scale feature information from the target image comprises:
extracting N characteristic blocks with different resolutions from the target image;
and converting the N characteristic blocks with different resolutions into N characteristic blocks with the same resolution to obtain the multi-scale characteristic information, wherein N is an integer greater than 1.
3. The method according to claim 2, wherein N is 3, the extracted feature blocks are respectively a feature block of a first resolution, a feature block of a second resolution, and a feature block of a third resolution, the first resolution is greater than the second resolution, and the second resolution is greater than the third resolution; and
the converting the N feature blocks with different resolutions into N feature blocks with the same resolution includes:
performing downsampling on the feature block with the first resolution to obtain a first feature block;
performing 1 × 1 convolution on the feature block of the second resolution to obtain a second feature block;
up-sampling the feature block of the third resolution to obtain a third feature block;
wherein the multi-scale feature information includes the first feature block, the second feature block, and the third feature block, and resolutions of the first feature block, the second feature block, and the third feature block are the same.
4. The method according to one of claims 1-3, wherein the multi-scale feature information comprises N feature blocks of the same resolution; and
the determining coordinate information of the plurality of control points based on the multi-scale feature information comprises:
extracting the global maximum pooling value of each feature block in the N feature blocks with the same resolution;
and determining the coordinate information of the plurality of control points according to the global maximum pooling value of each feature block and the N feature blocks with the same resolution.
5. The method according to claim 4, wherein N is 3, N feature blocks with the same resolution are respectively a first feature block, a second feature block and a third feature block, the first feature block comprises i feature maps, the second feature map comprises j feature maps, the third feature map comprises k feature maps, and i, j and k are integers greater than or equal to 1; and
the extracting the global maximum pooling value of each feature block in the N feature blocks with the same resolution includes:
acquiring the maximum numerical value of each characteristic diagram in the i characteristic diagrams of the first characteristic block, and determining the average value of the maximum numerical values of the i characteristic diagrams as the global maximum pooling value of the first characteristic block;
acquiring the maximum numerical value of each characteristic diagram in the j characteristic diagrams of the second characteristic block, and determining the average value of the maximum numerical values of the j characteristic diagrams as the global maximum pooling value of the second characteristic block;
and acquiring the maximum numerical value of each characteristic diagram in the k characteristic diagrams of the third characteristic block, and determining the average value of the maximum numerical values of the k characteristic diagrams as the global maximum pooling value of the third characteristic block.
6. The method of claim 4, wherein the determining the coordinate information of the plurality of control points according to the global maximum pooling value of each feature block and the N feature blocks with the same resolution comprises:
carrying out linear activation processing operation on a vector consisting of the global maximum pooling values of the feature blocks to obtain first result data;
performing full connection operation on the first result data to obtain second result data;
and obtaining coordinate information of the plurality of control points according to the characteristics of the target image described by the product of the N characteristic blocks with the same resolution and the second result data.
7. The method according to claim 6, wherein the obtaining the coordinate information of the plurality of control points according to the feature of the target image described by the product of the N feature blocks with the same resolution and the second result data comprises:
determining the object class and the object center point coordinate of the target object according to the feature of the target image described by the product of the N feature blocks with the same resolution and the second result data;
and determining the coordinates of a plurality of control points of the target object according to the object type, the coordinates of the object center point and a preset relative position relation corresponding to the object type, wherein the relative position relation is used for describing the position relation between the coordinates of each control point of the target object and the coordinates of the object center point.
8. The method of claim 1, wherein the control point prediction model is trained by:
acquiring a training sample set, wherein training samples in the training sample set comprise sample images and labeled coordinates of control points of objects represented by the sample images;
selecting training samples from the set of training samples, and performing the following training steps: inputting a sample image of the selected training sample into an initial model to obtain actual output; inputting the marking coordinates of the control points of the object presented by the input sample image and the obtained actual output into a preset loss function to obtain a loss value; in response to the fact that the loss value is smaller than a preset loss threshold value, taking the initial model as a trained control point prediction model;
and responding to the loss value being larger than or equal to the preset loss threshold value, adjusting parameters of the initial model to obtain an adjusted initial model, taking the adjusted initial model as the initial model, selecting unselected training samples from the training sample set, and continuing to execute the training step.
9. A pose estimation apparatus, wherein the apparatus comprises:
the image acquisition unit is used for acquiring a target image of a target object;
the coordinate prediction unit is used for inputting the target image into a pre-trained control point prediction model for processing to obtain coordinate information of a plurality of control points, wherein the plurality of control points are used for describing the outline of the target object, the processing of the target image by the control point prediction model comprises extracting multi-scale feature information from the target image and determining the coordinate information of the plurality of control points based on the multi-scale feature information, and the multi-scale feature information is information which is extracted based on a plurality of preset resolutions and used for describing the features of the target image;
a parameter determination unit, configured to determine a posture parameter of the target object based on coordinate information of the plurality of control points, where the posture parameter is used for conversion between a first coordinate system and a second coordinate system, the first coordinate system is a coordinate system with a center of the target object as an origin, and the second coordinate system is a coordinate system of a camera that acquires the target image;
and the pose determining unit is used for determining the pose of the target object according to the pose parameters.
10. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the method of any of claims 1 to 8 when executing the computer program.
11. A computer-readable storage medium, storing a computer program, wherein the computer program, when executed by a processor, implements the method of any of claims 1 to 8.
CN202010802444.2A 2020-08-11 2020-08-11 Pose estimation method, device, equipment and medium Pending CN112085789A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010802444.2A CN112085789A (en) 2020-08-11 2020-08-11 Pose estimation method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010802444.2A CN112085789A (en) 2020-08-11 2020-08-11 Pose estimation method, device, equipment and medium

Publications (1)

Publication Number Publication Date
CN112085789A true CN112085789A (en) 2020-12-15

Family

ID=73735712

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010802444.2A Pending CN112085789A (en) 2020-08-11 2020-08-11 Pose estimation method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN112085789A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112561990A (en) * 2021-01-21 2021-03-26 禾多科技(北京)有限公司 Positioning information generation method, device, equipment and computer readable medium
CN113553876A (en) * 2021-09-22 2021-10-26 长沙海信智能系统研究院有限公司 Bar code identification method, device, equipment and storage medium
CN113689484A (en) * 2021-08-25 2021-11-23 北京三快在线科技有限公司 Method and device for determining depth information, terminal and storage medium
WO2022142214A1 (en) * 2020-12-28 2022-07-07 北京市商汤科技开发有限公司 Vehicle pose determination method and apparatus, vehicle control method and apparatus, vehicle, and storage medium
CN114963025A (en) * 2022-04-19 2022-08-30 深圳市城市公共安全技术研究院有限公司 Leakage point positioning method and device, electronic equipment and readable storage medium
CN115414117A (en) * 2022-08-31 2022-12-02 北京长木谷医疗科技有限公司 Method and device for determining position coordinates of execution tail end of orthopedic surgery robot

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190244050A1 (en) * 2017-07-07 2019-08-08 Tencent Technology (Shenzhen) Company Limited Method, device and storage medium for determining camera posture information
CN110119148A (en) * 2019-05-14 2019-08-13 深圳大学 A kind of six-degree-of-freedom posture estimation method, device and computer readable storage medium
CN110322510A (en) * 2019-06-27 2019-10-11 电子科技大学 A kind of 6D position and orientation estimation method using profile information
CN111161349A (en) * 2019-12-12 2020-05-15 中国科学院深圳先进技术研究院 Object attitude estimation method, device and equipment
CN111339903A (en) * 2020-02-21 2020-06-26 河北工业大学 Multi-person human body posture estimation method
CN111445523A (en) * 2020-03-25 2020-07-24 中国农业科学院农业信息研究所 Fruit pose calculation method and device, computer equipment and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190244050A1 (en) * 2017-07-07 2019-08-08 Tencent Technology (Shenzhen) Company Limited Method, device and storage medium for determining camera posture information
CN110119148A (en) * 2019-05-14 2019-08-13 深圳大学 A kind of six-degree-of-freedom posture estimation method, device and computer readable storage medium
CN110322510A (en) * 2019-06-27 2019-10-11 电子科技大学 A kind of 6D position and orientation estimation method using profile information
CN111161349A (en) * 2019-12-12 2020-05-15 中国科学院深圳先进技术研究院 Object attitude estimation method, device and equipment
CN111339903A (en) * 2020-02-21 2020-06-26 河北工业大学 Multi-person human body posture estimation method
CN111445523A (en) * 2020-03-25 2020-07-24 中国农业科学院农业信息研究所 Fruit pose calculation method and device, computer equipment and storage medium

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022142214A1 (en) * 2020-12-28 2022-07-07 北京市商汤科技开发有限公司 Vehicle pose determination method and apparatus, vehicle control method and apparatus, vehicle, and storage medium
CN112561990A (en) * 2021-01-21 2021-03-26 禾多科技(北京)有限公司 Positioning information generation method, device, equipment and computer readable medium
CN113689484A (en) * 2021-08-25 2021-11-23 北京三快在线科技有限公司 Method and device for determining depth information, terminal and storage medium
CN113689484B (en) * 2021-08-25 2022-07-15 北京三快在线科技有限公司 Method and device for determining depth information, terminal and storage medium
CN113553876A (en) * 2021-09-22 2021-10-26 长沙海信智能系统研究院有限公司 Bar code identification method, device, equipment and storage medium
CN114963025A (en) * 2022-04-19 2022-08-30 深圳市城市公共安全技术研究院有限公司 Leakage point positioning method and device, electronic equipment and readable storage medium
CN114963025B (en) * 2022-04-19 2024-03-26 深圳市城市公共安全技术研究院有限公司 Leakage point positioning method and device, electronic equipment and readable storage medium
CN115414117A (en) * 2022-08-31 2022-12-02 北京长木谷医疗科技有限公司 Method and device for determining position coordinates of execution tail end of orthopedic surgery robot

Similar Documents

Publication Publication Date Title
CN112085789A (en) Pose estimation method, device, equipment and medium
CN111328396B (en) Pose estimation and model retrieval for objects in images
CN110363058B (en) Three-dimensional object localization for obstacle avoidance using one-shot convolutional neural networks
CN110458095B (en) Effective gesture recognition method, control method and device and electronic equipment
CN112287860B (en) Training method and device of object recognition model, and object recognition method and system
CN111080693A (en) Robot autonomous classification grabbing method based on YOLOv3
CN111523414A (en) Face recognition method and device, computer equipment and storage medium
CN110796686A (en) Target tracking method and device and storage device
CN111062263B (en) Method, apparatus, computer apparatus and storage medium for hand gesture estimation
CN112990010B (en) Point cloud data processing method and device, computer equipment and storage medium
US20220392201A1 (en) Image feature matching method and related apparatus, device and storage medium
CN114556268B (en) Gesture recognition method and device and storage medium
US20220277581A1 (en) Hand pose estimation method, device and storage medium
CN113807361B (en) Neural network, target detection method, neural network training method and related products
JP2023549036A (en) Efficient 3D object detection from point clouds
CN112287859A (en) Object recognition method, device and system, computer readable storage medium
CN113762003B (en) Target object detection method, device, equipment and storage medium
CN115457492A (en) Target detection method and device, computer equipment and storage medium
CN114387513A (en) Robot grabbing method and device, electronic equipment and storage medium
CN113795867A (en) Object posture detection method and device, computer equipment and storage medium
CN111914756A (en) Video data processing method and device
CN112395962A (en) Data augmentation method and device, and object identification method and system
CN116092178A (en) Gesture recognition and tracking method and system for mobile terminal
CN115082498A (en) Robot grabbing pose estimation method, device, equipment and storage medium
CN114091515A (en) Obstacle detection method, obstacle detection device, electronic apparatus, and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination