WO2023015409A1 - 物体姿态的检测方法、装置、计算机设备和存储介质 - Google Patents

物体姿态的检测方法、装置、计算机设备和存储介质 Download PDF

Info

Publication number
WO2023015409A1
WO2023015409A1 PCT/CN2021/111502 CN2021111502W WO2023015409A1 WO 2023015409 A1 WO2023015409 A1 WO 2023015409A1 CN 2021111502 W CN2021111502 W CN 2021111502W WO 2023015409 A1 WO2023015409 A1 WO 2023015409A1
Authority
WO
WIPO (PCT)
Prior art keywords
network
dimensional
image data
bounding box
coordinate system
Prior art date
Application number
PCT/CN2021/111502
Other languages
English (en)
French (fr)
Inventor
井雪
陈德健
陈建强
蔡佳然
项伟
Original Assignee
百果园技术(新加坡)有限公司
井雪
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 百果园技术(新加坡)有限公司, 井雪 filed Critical 百果园技术(新加坡)有限公司
Priority to CN202180002185.8A priority Critical patent/CN113795867A/zh
Priority to PCT/CN2021/111502 priority patent/WO2023015409A1/zh
Priority to EP21953048.2A priority patent/EP4365841A1/en
Publication of WO2023015409A1 publication Critical patent/WO2023015409A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/0002Inspection of images, e.g. flaw detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/60Analysis of geometric attributes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • G06T7/75Determining position or orientation of objects or cameras using feature-based methods involving models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10004Still image; Photographic image
    • G06T2207/10012Stereo images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Definitions

  • the embodiments of the present application relate to the technical field of computer vision, for example, to a method, device, computer equipment, and storage medium for detecting an object posture.
  • 3D (3-dimension, three-dimensional) target detection is usually performed, and the target object is detected in three-dimensional space.
  • business processing such as adding special effects, route planning, and motion trajectory planning.
  • Embodiments of the present application provide a method, device, computer equipment, and storage medium for detecting an object posture.
  • the embodiment of the present application provides a method for detecting the attitude of an object, including:
  • the image data has a target object
  • the embodiment of the present application also provides a device for detecting the posture of an object, including:
  • An image data acquisition module configured to acquire image data, the image data having a target object
  • the first posture information detection module is configured to input the image data into a two-dimensional detection model, and detect the two-dimensional first posture information when a three-dimensional bounding box is projected onto the image data, and the bounding box is used to detect the two-dimensional first posture information. the target object;
  • the second pose information mapping module is configured to map the first pose information into three-dimensional second pose information
  • a third posture information detection module configured to detect third posture information of the target object according to the second posture information.
  • the embodiment of the present application also provides a computer device, the computer device comprising:
  • a memory configured to store at least one program
  • the at least one processor is configured to execute the at least one program to implement the object posture detection method as described in the first aspect.
  • the embodiment of the present application also provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the object posture as described in the first aspect is realized detection method.
  • FIG. 1 is a flow chart of a method for detecting an object posture provided by an embodiment of the present application
  • FIG. 2 is an example diagram of detecting the posture of a target object provided by an embodiment of the present application
  • FIG. 3 is an example diagram of a one-stage network provided by another embodiment of the present application.
  • FIG. 4 is an example diagram of a two-stage network provided by another embodiment of the present application.
  • FIG. 5 is a flowchart of a method for detecting an object posture provided by another embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of an object posture detection device provided by another embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of a computer device provided by another embodiment of the present application.
  • FIG. 8 is a flow chart of detecting two-dimensional first pose information when a three-dimensional frame is projected to image data in an exemplary embodiment of the present application
  • FIG. 9 is a flow chart of detecting two-dimensional first pose information when a three-dimensional frame is projected onto image data in another exemplary embodiment of the present application.
  • Fig. 10 is a flow chart of mapping the first pose information into three-dimensional second pose information in another exemplary embodiment of the present application.
  • Fig. 11 is a flow chart of detecting third pose information of a target object according to second pose information in another exemplary embodiment of the present application.
  • 3D object detection methods can be mainly divided into the following four categories according to different input forms:
  • the first category is to input a frame of image data captured by a single camera.
  • the second category, binocular image is to input two frames of image data captured by a binocular camera from two directions.
  • the third category, point cloud, is the data of points in space collected by lidar.
  • the fourth category is the combination of point cloud and monocular image, that is, the data of a frame of image captured by a single camera and the point data of the space collected by the lidar are input at the same time.
  • the structure of the binocular camera head and the lidar is more complicated, it is difficult to transplant to the mobile terminal, and the cost is high, and the monocular image is usually used.
  • 3D object detection based on monocular images is mostly improved based on CenterNet (central network), and the information of the object is directly estimated end-to-end by the network.
  • CenterNet central network
  • this type of method is sensitive to rotation estimation, and the rotation has a slight error of 0.01. It also produces relatively large deviations in the information of objects, resulting in poor stability and accuracy.
  • the embodiment of the present application discloses a method, device, computer equipment and storage medium for detecting the attitude of an object, so as to improve the stability and accuracy of 3D object detection.
  • Fig. 1 is a flow chart of a method for detecting an object pose provided by an embodiment of the present application.
  • the 2D (2-dimension, two-dimensional) pose of the bounding box is mapped to a 3D pose, and the object is detected. 3D pose.
  • a method for detecting the posture of an object described in the embodiment of the present application can be executed by a detection device for the posture of the object.
  • the detection device for the posture of the object can be implemented by software and/or hardware, and can be configured in a computer device as a mobile terminal.
  • the computer includes, for example, a mobile phone, a tablet computer, and a smart wearable device, and the smart wearable device includes, for example, smart glasses, a smart watch, and the like.
  • Step 101 acquire image data.
  • operating systems such as Android (Android), iOS, and HarmonyOS (Hongmeng System) can be installed, and users can install applications required by users in these operating systems, such as live broadcast applications, short video applications, and beauty applications. , conferencing applications, and more.
  • Android Android
  • iOS iOS
  • HarmonyOS Heongmeng System
  • a computer device may be configured with one or more cameras, which are also called cameras. These cameras can be installed on the front of the computer equipment, also known as the front camera, or installed on the back of the computer equipment, also known as the rear camera.
  • These applications can use the image data in the local gallery or network gallery of the computer device as the image data to be used, and can also call the camera to collect image data, and so on.
  • the target object can be set according to the requirements of the business scene, for example, a cup 201, a notebook, a pen, a display screen, etc. as shown in FIG. 2 .
  • these applications call the camera to collect video data facing the target object.
  • the video data contains multiple frames of image data, and the target object is tracked in the multiple frames of image data by methods such as Kalman filtering and optical flow method.
  • Step 102 Input image data into a two-dimensional detection model, and detect two-dimensional first pose information when a three-dimensional bounding box is projected onto the image data.
  • the target object is in a real three-dimensional space, and a three-dimensional bounding box can be used to describe the posture of the target object in the three-dimensional space, wherein, as shown in FIG. 2 , the shape of the three-dimensional bounding box can include a cuboid 202, a cylinder, a sphere, Etc., the three-dimensional bounding box is the frame surrounding the target object 201 , and the three-dimensional bounding box can be used to detect the target object 201 .
  • the target object is expressed as a two-dimensional pixel point
  • the three-dimensional bounding box is also recorded in the image data by projection following the target object.
  • the three-dimensional bounding box is represented by two-dimensional pixel points. At this time, it can be calculated
  • the pose presented by the three-dimensional bounding box in the two-dimensional image data is recorded as first pose information.
  • a model for detecting the first pose information of the bounding box of the target object may be pre-trained, which is denoted as a two-dimensional detection model, for example, mobilenetV2, shufflenetV2, and so on.
  • all frame image data can be input into the two-dimensional detection model for detection, or the image data can be input into the two-dimensional detection model at regular intervals for detection, and the prediction results of the time interval can be replaced by tracking.
  • the prediction results of time intervals can be replaced by tracking.
  • each frame in order to obtain the results of each frame, each frame can be input into the model to obtain the results. In this way, it takes time to pass the model for each frame, and the delay is serious; Frames are passed through the model, for example, frame 0 is passed through the model, and frame 5 is passed through the model, but the results of each frame can be obtained, so the result of the first frame can be obtained by tracking the result of the 0th frame.
  • Step 103 mapping the first pose information into 3D second pose information.
  • solving the second pose information in the world coordinate system can be regarded as a PnP (Perspective-n-Point) problem, and the first pose information of the target object in the camera coordinate system
  • the first attitude information is mapped to the three-dimensional second attitude information in the world coordinate system through PnP (Perspective-n-Point), DLT (Direct Linear Transform, Direct Linear Transformation), EPnP (Efficient PnP), UPnP and other pose estimation algorithms.
  • the first pose information includes a center point, a vertex, and a depth
  • the vertex of the target object in the image coordinate system can be mapped to a vertex in the world coordinate system through a pose estimation algorithm.
  • depth can refer to the distance of an object from the camera when the camera takes the picture.
  • the vertices may refer to 8 vertices of the cuboid.
  • the EPnP algorithm can better handle how to solve the situation of the camera pose from the 3D point-2D point matching pair.
  • the EPnP algorithm is used to convert the camera coordinate system to A 2D point (such as a vertex) is mapped to a 3D point (such as a vertex) in the camera coordinate system.
  • the depth of the center point predicted by the model is divided by the depth estimated by EPnP to obtain a ratio, and each vertex is multiplied by With this ratio, the 3D point (such as a vertex) of the camera coordinate system of the real depth is obtained, and the 3D point (such as a vertex) in the world coordinate system is obtained by multiplying this 3D point (such as a vertex) by the external parameter of the camera.
  • Step 104 Detect third posture information of the target object according to the second posture information.
  • the attitude of the target object in the world coordinate system within the bounding box can be detected, which is recorded as the third three-dimensional attitude information.
  • the second pose information of the bounding box in the world coordinate system includes multiple vertices, and the position and orientation of the target object in the world coordinate system can be calculated according to the multiple vertices as the third pose information.
  • the first attitude information is two-dimensional attitude information, which corresponds to a bounding box and is in the image coordinate system.
  • the second attitude information is three-dimensional attitude information, which corresponds to a bounding box and is in the camera coordinate system.
  • the third attitude information is three-dimensional attitude information, which corresponds to the target object and is in the world coordinate system.
  • the bounding box of the 2D image is back-mapped back to the bounding box of the 3D object.
  • the image data is acquired, and the image data has the target object, and the image data is input into the two-dimensional detection model, and the two-dimensional first pose information when the three-dimensional bounding box is projected onto the image data is detected, and the bounding box uses To detect the target object, map the first pose information to the second three-dimensional pose information, detect the third pose information of the target object according to the second pose information, and restore the projection map from the projection map by predicting the projection map of the bounding box on the image data.
  • the 3D attitude information avoids the jitter caused by the slight error of the predicted rotation angle. Compared with the direct prediction of the 3D attitude information, the accuracy of this embodiment is higher, and the effect is more stable.
  • the two-dimensional detection model is an independent and complete network, that is, a one-stage network.
  • the two-dimensional detection model includes an encoder 310, a decoder 320, and a prediction network 330 , in this exemplary embodiment, as shown in FIG. 8 , step 102 may include the following steps 1021 , 1022 and 1023 .
  • Step 1021 Encode the image data in an encoder to obtain a first image feature.
  • an encoder may read the entire source data (ie image data) as a fixed-length code (ie first image feature).
  • the encoder 310 includes a convolutional layer (Conv Layer) 311, a first residual network 312, a second residual network 313, a third residual network 314, a fourth residual network 315 , the fifth residual network 316, wherein the first residual network 312, the second residual network 313, the third residual network 314, the fourth residual network 315 and the fifth residual network 316 each include one or more Bottleneck residual block (Bottleneck residual block), the number of output channels of the bottleneck residual block is n times the number of input channels, and n is a positive integer, such as 4.
  • the image data is convolved in the convolution layer 311 to obtain the first-level features; the first-level features are processed in the first residual network 312 to obtain the second-level features; In the residual network 313, the second-level features are processed to obtain the third-level features; in the third residual network 314, the third-level features are processed to obtain the fourth-level features; in the fourth residual network 315, the The fourth-level features are processed to obtain the fifth-level features; the fifth-level features are processed in the fifth residual network 316 to obtain the sixth-level features.
  • the output of the bottleneck residual block in this layer is the next Input to the layer bottleneck residual block.
  • the first-level features, the second-level features, the third-level features, the fourth-level features, the fifth-level features, and the sixth-level features are all first image features.
  • the number of bottleneck residual blocks in the first residual network 312 is smaller than the number of bottleneck residual blocks in the second residual network 313, for example, 1 in the first residual network 312 Layer bottleneck residual blocks, 2 layers of bottleneck residual blocks in the second residual network 313; the number of bottleneck residual blocks in the second residual network 313 is less than the number of bottleneck residual blocks in the third residual network 314, the third The number of bottleneck residual blocks in the residual network 314 is less than the number of bottleneck residual blocks in the fourth residual network 315, and the number of bottleneck residual blocks in the fourth residual network 315 is equal to the bottleneck residual blocks in the fifth residual network 316 the number of blocks.
  • the dimensions of the second-level features are higher than the dimensions of the third-level features
  • the dimensions of the third-level features are higher than the dimensions of the fourth-level features
  • the dimensions of the fourth-level features are higher than the dimensions of the fifth-level features
  • the fifth-level features The dimensionality of the features is higher than that of the sixth-level features.
  • the dimension of the second-level feature is 320 ⁇ 240 ⁇ 16
  • the dimension of the third-level feature is 160 ⁇ 120 ⁇ 24
  • the dimension of the fourth-level feature is 80 ⁇ 60 ⁇ 32
  • the dimension of the fifth-level feature is 40 ⁇ 30 ⁇ 64
  • the dimensions of the sixth-level features are 20 ⁇ 15 ⁇ 128.
  • the low-resolution information of the image data after multiple downsampling can provide the semantic information of the context of the target object in the entire image data. This semantic information can reflect the characteristics of the relationship between the target object and the environment of the target object.
  • the first image feature Helpful for target object detection.
  • Step 1022 Decode the first image feature in the decoder to obtain the second image feature.
  • the decoder may decode the code (ie, the first image feature) to output target data (ie, the second image feature).
  • the decoder 320 includes a transposed convolution layer (Transposed Convolution Layer) 321, a sixth residual network 322, wherein the sixth residual network 322 includes a plurality of bottleneck residual blocks, for example , the sixth residual network 322 includes 2 layers of bottleneck residual blocks.
  • Transposed Convolution Layer Transposed Convolution Layer
  • sixth residual network 322 includes a plurality of bottleneck residual blocks, for example , the sixth residual network 322 includes 2 layers of bottleneck residual blocks.
  • the first image features include multiple features, such as first-level features, second-level features, third-level features, fourth-level features, fifth-level features, sixth-level features, etc.
  • at least one feature can be selected for Upsampling combines high-level semantic information with low-level semantic information to increase the richness of voice information, increase the stability and accuracy of the two-dimensional detection model, and reduce the situation of missed detection and false detection.
  • the sixth-level feature data is convoluted in the transposed convolution layer 321 to obtain the seventh-level features; the fifth-level features obtained by upsampling are combined with the seventh-level features
  • the eighth-level features are spliced to realize the combination of high-level semantic information and low-level semantic information; the eighth-level features are processed in the sixth residual network 322 to obtain the second image features.
  • the output of the bottleneck residual block in this layer is the input of the bottleneck residual block in the next layer.
  • the dimension of the second image feature is higher than the dimension of the sixth level feature, for example, the dimension of the second image feature is 40 ⁇ 30 ⁇ 64, and the dimension of the sixth level feature is 20 ⁇ 15 ⁇ 128.
  • Step 1023 Map the second image feature into the two-dimensional first pose information of the bounding box in the prediction network.
  • the first pose information is two-dimensional pose information corresponding to the bounding box.
  • a two-dimensional detection model has multiple prediction networks, and the prediction network belongs to a branch network, which focuses on a certain data in the first pose information and can be implemented as a smaller structure.
  • the prediction network 330 includes a first prediction network 331, a second prediction network 332, a third prediction network 333, and a fourth prediction network 334, wherein the first prediction network 331 includes a plurality of bottleneck residuals Poor block, such as a 2-layer bottleneck residual block, the second prediction network 332 includes a plurality of bottleneck residual blocks, such as a 2-layer bottleneck residual block, and the third prediction network 333 includes a plurality of bottleneck residual blocks, such as a 2-layer bottleneck residual block For the difference block, the fourth prediction network 334 includes a plurality of bottleneck residual blocks, such as 2 layers of bottleneck residual blocks.
  • the first prediction network 331 includes a plurality of bottleneck residuals Poor block, such as a 2-layer bottleneck residual block
  • the second prediction network 332 includes a plurality of bottleneck residual blocks, such as a 2-layer bottleneck residual block
  • the third prediction network 333 includes a plurality of bottleneck residual blocks, such as a 2-layer bottleneck residual block
  • the fourth prediction network 334
  • the first posture information includes the center point, depth, size, and vertex
  • the second image features are input into the first prediction network 331, the second prediction network 332, the third prediction network 33, and the fourth prediction network 334 respectively.
  • size may refer to the length, width and height of a real object.
  • the second image feature is processed in the first prediction network 331 to obtain a Gaussian distribution map (center heatmap) of the bounding box, and a center point is found in the Gaussian distribution map, and the center point has a depth.
  • a Gaussian distribution map center heatmap
  • the second image feature is processed in the second prediction network 332 to obtain the depth of the bounding box.
  • the second image feature is processed in the third prediction network 333 to obtain the scale of the bounding box.
  • the second image feature is processed to obtain the offset distance (vertexes) of the vertices in the bounding box relative to the center point, and the offset distance can be obtained on the basis of the coordinates of the center point. Coordinates of multiple vertices.
  • the number of vertices and the relative positions of the vertices in the bounding box are also different. For example, if the bounding box is a cuboid, the bounding box has 8 vertices, and these vertices are the corner points of each face. If the bounding box is a cylinder, the bounding box has 8 vertices, which are the meeting points of the circumcircles of the bottom and top faces, and so on.
  • the two-dimensional detection model has a small number of levels and a simple structure, uses few computing resources, and consumes low calculation time, which can ensure real-time performance.
  • the two-dimensional detection model includes two mutually independent models, that is, a two-stage network.
  • the two-dimensional detection model includes a target detection model 410 and an encoding model 420, and the target The detection model 410 is cascaded with the coding model 420, that is, the output of the target detection model 410 is the input of the coding model 420, and the structure of the two-stage network is more complicated, which can avoid the situation that the prediction results on the small model converge into a bunch, so that the two-dimensional The detection model is more stable.
  • step 102 may include the following steps 1021 ′, 1022 ′ and 1023 ′.
  • Step 1021' in the target detection model, detect the two-dimensional first pose information of the part of the bounding box in the image data, and the area where the target object is located in the image data.
  • the target detection model can include one-stage (one stage) and two-stage (two stage), one-stage can include SSD (Single Shot MultiBox Detector), YOLO (you only look once), etc., two-stage includes R- CNN (Region with CNN features) series, R-CNN series such as R-CNN, fast-RCNN, faster-RCNN, etc.
  • the image data is input into the target detection model, and the target detection model can detect the two-dimensional first pose information of the part of the bounding box in the image data and the area where the target object is located in the image data.
  • the depth and size of the bounding box in the image data may be detected as the first pose information.
  • YOLOv5 is divided into three parts: backbone network, feature pyramid network, and branch network.
  • the backbone network refers to the convolutional neural network that aggregates and forms image features at different fine-grained levels;
  • the feature pyramid network refers to a series of network layers that mix and combine image features, and then pass the image features to the prediction layer.
  • the prediction layer is generally FPN (Feature Pyramid Networks, feature pyramid network) or PANet (Path Aggregation Network, path aggregation network);
  • branch network refers to predicting image features, generating bounding boxes and predicting the category of the target object, the depth and size of the target object. Therefore, the output of YOLOv5 is nc+5+3+1.
  • nc is the category number of the object.
  • the number 5 indicates that there are 5 variables, including classification confidence (c), center point (x, y) of the bounding box, width and height (w, h) of the bounding box, a total of 5 variables.
  • the number 3 indicates that there are 3 variables, including the size (length, width, and height) of the target object in the 3-dimensional space.
  • the number 1 indicates that there is one variable, that is, the depth of the target object in the camera coordinate system, that is, the distance of the object from the camera when shooting.
  • Step 1022' extract data in the region from the image data as region data.
  • the region where the target object is located in the image data 430 belongs to a two-dimensional bounding box, and the image data is cropped based on the region, and the data (pixels) in the region are taken out to realize the scaling of the image data. It is described as area data 431 .
  • Step 1023' Encode the area data in the encoding model to obtain the first two-dimensional pose information of the part of the bounding box.
  • the region data 431 is input into the encoding model 420 for encoding, and the remaining two-dimensional first pose information of the bounding box can be output.
  • the area data is encoded in the encoding model to obtain the center point and vertices of the bounding box as the first pose information.
  • the combination of a part of the first pose information detected by the target detection module and another part of the first pose information generated by the encoding module can be identified as the first pose information detected by the two-dimensional detection model.
  • the encoding model usually chooses a model with a simple structure and less calculation.
  • Efficientnet-lite0 includes multiple 1 ⁇ 1 convolutional layers, and multiple depths can be Separate convolutional layers, multiple residual connection layers and multiple fully connected layers, and the last fully connected layer estimates the center point and vertex of the target object.
  • an appropriate two-dimensional detection model can be selected according to the requirements of the business scenario. For the situation where users upload video data, the speed of the first-stage network is faster, and the accuracy of the second-stage network is higher.
  • the two-stage network can increase the tracking of the 2D detector (tracking, for example, capture the possible position information of the next frame based on the position information of the first frame), and it is not necessary to detect every frame of image data, so that the speed accuracy is higher. Faster and higher.
  • step 103 may include the following steps 1031 - 1034 .
  • Step 1031 query the control points in the world coordinate system and the camera coordinate system respectively.
  • the EPnP algorithm introduces control points, and any reference point can be expressed as a linear combination of four control points, such as a vertex and a center point.
  • control points can be selected at will, but in this embodiment, points with better effects can be selected as control points through experiments in advance, and the coordinates of the control points can be recorded, and the control points can be loaded as hyperparameters during use.
  • Step 1032 express the center point and vertices as the weighted sum of the control points in the world coordinate system and the camera coordinate system respectively.
  • the superscript w be the world coordinate system
  • the superscript c be the camera coordinate system
  • the reference point in the world coordinate system can be represented by four control points:
  • a ij is the homogeneous barycentric coordinates (homogeneous barycentric coordinates) configured for the control point, also known as weight, which belongs to the hyperparameter.
  • the reference point in the camera coordinate system can be represented by four control points:
  • a ij is the homogeneous center of gravity coordinates assigned to the control point, also known as weight.
  • the weight in the world coordinate system is the same as the weight in the camera coordinate system, which is a hyperparameter.
  • Step 1033 construct the constraint relationship between the depth, the center point and the vertices between the world coordinate system and the camera coordinate system.
  • the constraint relationship mentioned here may be the constraint relationship between the depth, center point and vertex in the world coordinate system, and the depth, center point and vertex in the camera coordinate system.
  • the constraint relationship between the reference point coordinates (such as vertex, center point) in the world coordinate system and the reference point (such as vertex, center point) in the camera coordinate system is obtained.
  • w i is the depth of the reference point (vertex, center point)
  • u i and v i are the x-coordinates and y-coordinates of the reference point (vertex, center point) in the camera coordinate system
  • A is the hyperparameter
  • f u , f v , u c , u v are internal parameters of the camera
  • It is the x-coordinate, y-coordinate, and z-coordinate of the reference point (vertex, center point) in the world coordinate system, a total of 12 unknown variables, which are brought into the solution.
  • Step 1034 concatenating constraint relations to obtain a linear equation.
  • serial constraint relationship can be characterized as forming a matrix of the constraint relationships of nine reference points, and concatenating them row by row.
  • Step 1035 solving the linear equation to map the vertices to the three-dimensional space.
  • n is a positive integer, such as 9, and the constraints of n reference points can be connected in series to obtain the following homogeneous linear equation:
  • x represents the coordinates (X, Y, Z) of the control point in the camera coordinate system, which is a 12-dimensional vector, and the four control points have a total of 12 unknown variables, and M is a 2n ⁇ 12 matrix.
  • x belongs to the right null space of M
  • v i is the right singular vector of matrix M
  • the singular value corresponding to v i is 0, which can be obtained by solving the null space eigenvalue of M T M:
  • the solution method is to solve the eigenvalues and eigenvectors of M T M, and the eigenvector with the eigenvalue of 0 is v i .
  • the size of M T M is 12 ⁇ 12.
  • the complexity of calculating M T M is O(n), therefore, the overall complexity of the algorithm is O(n).
  • N is related to the number of reference points, control points, camera focal length and noise.
  • ⁇ i is a linear combination that can be directly optimized or solved by an approximate solution when the number of N is set to obtain a definite solution.
  • step 104 may include the following steps 1041-1047.
  • Step 1041 in the world coordinate system and the camera coordinate system, calculate a new center point based on the vertices.
  • the coordinate system is generally established based on the center point, then the center point is the origin of the bounding box.
  • the average value of vertices is calculated in the camera coordinate system as a new center point, which is recorded as:
  • N is the number of vertices.
  • Step 1042 in the world coordinate system and the camera coordinate system, remove the new center point for the vertices.
  • Step 1043 after removing the new center point for the vertices, calculate the self-conjugate matrix.
  • the self-conjugate matrix H can be calculated, where the self-conjugate matrix H is the product of the transpose matrix of the vertices in the camera coordinate system and the vertices in the world coordinate system, expressed as follows:
  • N is the number of vertices
  • the vertex after the center of the camera coordinate system It is the vertex after removing the center of the world coordinate system
  • T is the transpose matrix
  • Step 1044 performing singular value decomposition on the self-conjugate matrix to obtain the product of the first orthogonal matrix, the diagonal matrix, and the transposed matrix of the second orthogonal matrix.
  • the coordinates in two coordinate systems are known, that is, the coordinates of the vertices in the world coordinate system and the coordinates of the vertices in the camera coordinate system, using the idea of SVD (Singular Value Decomposition, singular value decomposition)
  • SVD Single Value Decomposition, singular value decomposition
  • the pose transformation of the two coordinate systems can be obtained, that is, the SVD of the self-conjugate matrix H can be expressed as:
  • U is the first orthogonal matrix
  • is the diagonal matrix
  • V is the second orthogonal matrix
  • T is the transpose matrix
  • Step 1045 calculate the product between the second orthogonal matrix and the transpose matrix of the first orthogonal matrix, and use it as the direction of the target object in the world coordinate system.
  • R X
  • R is the direction of the target object in the world coordinate system.
  • Step 1046 Rotate the new center point in the world coordinate system according to the direction to obtain the projection point.
  • Step 1047 Subtract the projection point from the new center point in the camera coordinate system to obtain the position of the target object in the world coordinate system.
  • t is the position of the target object in the world coordinate system
  • R is the direction of the target object in the world coordinate system
  • It is the new center point in the world coordinate system.
  • Fig. 5 is a flow chart of a method for detecting the attitude of an object provided in another embodiment of the present application. This embodiment is based on the foregoing embodiments and adds special effect processing operations. The method includes the following steps:
  • Step 501 acquire image data.
  • Step 502 Input the image data into the 2D detection model, and detect the 2D first pose information when the 3D bounding box is projected onto the image data.
  • the bounding box is used to detect the target object.
  • Step 503 mapping the first pose information into 3D second pose information.
  • Step 504 Detect third posture information of the target object according to the second posture information.
  • Step 505. Determine the three-dimensional material that is suitable for the target object.
  • the server such as the server may pre-collect three-dimensional materials adapted to the type of the target object according to the requirements of the business scenario.
  • the mobile terminal can download these materials from the server to the local in advance according to certain rules (such as selecting basic materials, popular materials, etc.), or download the specified materials from the server to the local according to the operation triggered by the user during use. or, the user can locally select a three-dimensional material suitable for the target object on the mobile terminal, or extract part of the data from the data corresponding to the target object, and perform three-dimensional conversion on part of the data as a material, etc.
  • the material may be text data, image data, animation data, and the like.
  • the target object 201 is a beverage of a certain brand
  • the LOGO (logo) 203 of the brand can be used as a material.
  • the target object is a ball (such as football, basketball, volleyball, badminton, table tennis, etc.), it can be a special effect animation (such as feather, lightning, flame, etc.) adapted to the ball as a material.
  • a ball such as football, basketball, volleyball, badminton, table tennis, etc.
  • a special effect animation such as feather, lightning, flame, etc.
  • the target object is a container for holding water
  • aquatic animals and plants such as aquatic plants, fish, shrimp, etc. may be used as materials.
  • Step 506 configure fourth pose information for the material.
  • the fourth posture information is adapted to the first posture information and/or the third posture information.
  • Step 507. Display the material in the image data according to the fourth posture information.
  • a special effect generator can be preset, and the first posture information and/or the third posture information can be input into the special effect generator to generate fourth posture information for the material, and the image data can be generated according to the first posture information.
  • Four pose information renders the material, making the material conform to the situation of the target object and generating more natural special effects.
  • a part of the first pose information includes the size of the bounding box, and the third pose information includes the direction and position of the target object.
  • the position of the target object can be shifted outward by a specified distance, for example, the front of the target object is used as a reference plane to shift 10 cm, and the shifted position is used as the position of the material.
  • the fourth pose information may include the location of the material.
  • Reduce the size of the bounding box to a specified ratio (such as 10%), and use the reduced size as the size of the material.
  • the fourth pose information may include the size of the material.
  • the fourth gesture information may include the orientation of the material.
  • the above fourth posture information is only an example.
  • other fourth posture information can be set according to the actual situation, for example, the size of the bounding box is enlarged to a specified ratio (such as 1.5 times), and the enlarged Size is used as the size of the material, rotate the direction of the target object by a specified angle (such as 90°), use the rotated direction as the direction of the material, and so on.
  • a specified ratio such as 1.5 times
  • a specified angle such as 90°
  • those skilled in the art may also use other fourth attitude information according to actual needs.
  • the adaptation data can be released, such as short videos, live broadcast data, and so on.
  • the embodiment is expressed as a series of action combinations, but those skilled in the art should know that there are many kinds of action sequences, because according to the embodiment of the application, some steps can be in other order or at the same time conduct.
  • Fig. 6 is a structural block diagram of an object posture detection device provided in another embodiment of the present application, and the object posture detection device may include the following modules:
  • the image data acquisition module 601 is configured to acquire image data, and the image data has a target object;
  • the first posture information detection module 602 is configured to input the image data into the two-dimensional detection model, and detect the two-dimensional first posture information when the three-dimensional bounding box is projected onto the image data, and the bounding box is used for detection said target object;
  • the second posture information mapping module 603 is configured to map the first posture information into three-dimensional second posture information
  • the third posture information detection module 604 is configured to detect third posture information of the target object according to the second posture information.
  • the two-dimensional detection model includes an encoder, a decoder, and a prediction network
  • the first posture information detection module 602 includes:
  • An encoding module configured to encode the image data in the encoder to obtain a first image feature
  • a decoding module configured to decode the first image feature in the decoder to obtain a second image feature
  • the mapping module is configured to map the second image feature into two-dimensional first pose information of a bounding box in the prediction network.
  • the encoder includes a convolutional layer, a first residual network, a second residual network, a third residual network, a fourth residual network, and a fifth residual network, the The first residual network, the second residual network, the third residual network, the fourth residual network and the fifth residual network respectively include at least one bottleneck residual block;
  • the encoding module is also set to:
  • the fifth-level features are processed in the fifth residual network to obtain sixth-level features.
  • the number of the bottleneck residual blocks in the first residual network is smaller than the number of the bottleneck residual blocks in the second residual network, and the second residual network
  • the number of the bottleneck residual blocks in the third residual network is smaller than the number of the bottleneck residual blocks in the third residual network, and the number of the bottleneck residual blocks in the third residual network is smaller than the fourth residual block
  • the number of the bottleneck residual blocks in the network, the number of the bottleneck residual blocks in the fourth residual network is equal to the number of the bottleneck residual blocks in the fifth residual network;
  • the dimensions of the second-level features are higher than the dimensions of the third-level features, the dimensions of the third-level features are higher than the dimensions of the fourth-level features, and the dimensions of the fourth-level features are higher than the dimensions of the The dimension of the fifth-level feature, the dimension of the fifth-level feature is higher than the dimension of the sixth-level feature.
  • the decoder includes a transposed convolutional layer, a sixth residual network, and the sixth residual network includes a plurality of bottleneck residual blocks;
  • the decoding module is also set to:
  • the eighth-level features are processed in the sixth residual network to obtain second image features.
  • the dimension of the second image feature is higher than the dimension of the sixth level feature.
  • the prediction network includes a first prediction network, a second prediction network, a third prediction network, and a fourth prediction network, the first prediction network, the second prediction network, the The third prediction network and the fourth prediction network respectively include a plurality of bottleneck residual blocks;
  • the mapping module is also set to:
  • the second image feature is processed in the fourth prediction network to obtain the offset distance of the vertices in the bounding box relative to the central point.
  • the two-dimensional detection model includes a target detection model and an encoding model, and the target detection model and the encoding model are cascaded;
  • the first posture information detection module 602 :
  • the target detection module is configured to, in the target detection model, detect the two-dimensional first pose information of the part of the bounding box in the image data, and the area where the target object is located in the image data;
  • an area data extraction module configured to extract data in the area from the image data as area data
  • the region data encoding module is configured to encode the region data in the encoding model to obtain the two-dimensional first pose information of the part of the bounding box.
  • the target detection module is also set to:
  • the regional data coding module is also set to:
  • the region data is coded to obtain the center point and vertices of the bounding box.
  • the first pose information includes a center point, a vertex, and a depth
  • the second posture information mapping module 603 includes:
  • the control point query module is set to query the control points in the world coordinate system and the camera coordinate system respectively;
  • the point representation module is configured to represent the center point and the vertex as the weighted sum of the control points in the world coordinate system and the camera coordinate system respectively;
  • a constraint relationship building module configured to build the constraint relationship between the depth, the center point, and the vertex between the world coordinate system and the camera coordinate system;
  • a linear equation generation module configured to connect the constraints in series to obtain a linear equation
  • a linear equation solving module configured to solve the linear equation to map the vertices to a three-dimensional space.
  • the third posture information detection module 604 includes:
  • the center point calculation module is set to calculate a new center point based on the vertices in the world coordinate system and the camera coordinate system respectively;
  • the central point removal module is configured to remove the new central point from the vertex in the world coordinate system and the camera coordinate system respectively;
  • a self-conjugate matrix calculation module configured to calculate a self-conjugate matrix, the self-conjugate matrix being the product between the vertex in the camera coordinate system and the transpose matrix of the vertex in the world coordinate system;
  • Singular value decomposition module is configured to carry out singular value decomposition to described self-conjugate matrix, obtains the product between the transpose matrix of first orthogonal matrix, diagonal matrix, second orthogonal matrix;
  • a direction calculation module configured to calculate the product of the second orthogonal matrix and the transpose matrix of the first orthogonal matrix, as the direction of the target object in the world coordinate system;
  • a projection point calculation module configured to rotate the new center point in the world coordinate system according to the direction to obtain a projection point
  • the position calculation module is configured to subtract the projected point from the new center point in the camera coordinate system to obtain the position of the target object in the world coordinate system.
  • it also includes:
  • a material determining module configured to determine a three-dimensional material suitable for the target object
  • the fourth posture information configuration module is configured to configure fourth posture information for the material, and the fourth posture information is posture information adapted to the first posture information and the third posture information;
  • a material display module configured to display the material in the image data according to the fourth pose information.
  • the first pose information includes the size of the bounding box, and the third pose information includes the direction and position of the target object;
  • the fourth posture information configuration module :
  • a position offset module configured to offset the position of the target object by a specified distance, and use the offset position as the position of the material
  • a size reduction module configured to reduce the size of the bounding box to a specified ratio, and use the reduced size as the size of the material
  • the direction configuration module is configured to configure the direction of the target object as the direction of the material.
  • the object posture detection device provided in the embodiment of the present application can execute the object posture detection method provided in any embodiment of the present application, and has corresponding functional modules and beneficial effects for executing the method.
  • FIG. 7 is a schematic structural diagram of a computer device provided by another embodiment of the present application.
  • FIG. 7 shows a block diagram of an exemplary computer device 12 suitable for implementing embodiments of the present application.
  • the computer device 12 shown in FIG. 7 is only one example.
  • computer device 12 takes the form of a general-purpose computing device.
  • Components of computer device 12 may include one or more processors or processing units 16 , system memory 28 , bus 18 connecting various system components including system memory 28 and processing unit 16 .
  • System memory 28 may also be referred to as memory.
  • Bus 18 represents one or more of a variety of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor, or a local bus using any of a variety of bus structures.
  • bus structures include, for example, the Industry Standard Architecture (ISA) bus, the Micro Channel Architecture (MCA) bus, the Enhanced ISA bus, the Video Electronics Standards Association (VESA) ) Local bus and Peripheral Component Interconnect (PCI) bus.
  • ISA Industry Standard Architecture
  • MCA Micro Channel Architecture
  • VESA Video Electronics Standards Association
  • PCI Peripheral Component Interconnect
  • Computer device 12 typically includes a variety of computer system readable media. These media can be any available media that can be accessed by computer device 12 and include both volatile and nonvolatile media, removable and non-removable media.
  • System memory 28 may include computer system readable media in the form of volatile memory, such as random access memory (Random Access Memory, RAM) 30 and/or cache memory 32 .
  • Computer device 12 may include other removable/non-removable, volatile/nonvolatile computer system storage media.
  • storage system 34 may be configured to read and write to non-removable, non-volatile magnetic media (not shown in FIG. 7 ).
  • a disk drive for reading and writing to a removable non-volatile disk such as a floppy disk
  • a removable non-volatile disk such as a CD-ROM, DVD-ROM or other optical media
  • CD-ROM drive for reading and writing.
  • each drive may be connected to bus 18 via one or more data media interfaces.
  • System memory 28 may include at least one program product having a set (eg, at least one) of program modules configured to perform the functions of various embodiments of the present application.
  • a program/utility tool 40 may be stored, for example, in system memory 28 as a set (at least one) of program modules 42 including an operating system, one or more application programs, other program modules, and program data, which Each or some combination of the examples may include the implementation of a network environment.
  • the program modules 42 generally perform the functions and/or methods of the embodiments described herein.
  • the computer device 12 may also communicate with one or more external devices 14 (e.g., a keyboard, pointing device, display 24, etc.), and with one or more devices that enable a user to interact with the computer device 12, and/or with Any device (eg, network card, modem, etc.) that enables the computing device 12 to communicate with one or more other computing devices. Such communication may occur through input/output (I/O) interface 22 .
  • computer device 12 can also communicate with one or more networks through network adapter 20, one or more networks such as local area network (Local Area Network, LAN), wide area network (Wide Area Network, WAN) and/or public network, public network For example the Internet. As shown in FIG.
  • network adapter 20 communicates with other modules of computer device 12 via bus 18 .
  • Other hardware and/or software modules can also be used in conjunction with computer device 12, and other hardware and/or software modules include: microcode, device drivers, redundant processing units, external disk drive arrays, disk arrays (Redundant Arrays of Independent Disks, RAID ) systems, tape drives, and data backup storage systems.
  • the processing unit 16 executes a variety of functional applications and data processing by running the programs stored in the system memory 28 , such as implementing the detection method of the object posture provided by the embodiment of the present application.
  • Another embodiment of the present application also provides a computer-readable storage medium.
  • a computer program is stored on the computer-readable storage medium.
  • the computer program is executed by a processor, multiple processes of the method for detecting the posture of an object are implemented, and can achieve The same technical effects are not repeated here to avoid repetition.
  • the computer-readable storage medium may include, for example, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or any combination thereof.
  • a computer readable storage medium is, for example, an electrical connection with one or more conductors, a portable computer disk, a hard disk, a random access memory (RAM), a read-only memory (Read-Only Memory, ROM), an erasable programmable Read memory (Erasable Programmable Read-Only Memory, EPROM), flash memory, optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in combination with an instruction execution system, apparatus, or device.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Quality & Reliability (AREA)
  • Geometry (AREA)
  • Image Analysis (AREA)

Abstract

本申请实施例提供了一种物体姿态的检测方法、装置、计算机设备和存储介质,该方法包括:获取图像数据,图像数据中具有目标物体,将图像数据输入二维检测模型中,检测三维的边界框投影至图像数据时的二维的第一姿态信息,边界框用于检测目标物体,将第一姿态信息映射为三维的第二姿态信息,根据第二姿态信息检测目标物体的第三姿态信息。

Description

物体姿态的检测方法、装置、计算机设备和存储介质 技术领域
本申请实施例涉及计算机视觉的技术领域,例如涉及一种物体姿态的检测方法、装置、计算机设备和存储介质。
背景技术
在短视频、直播、自动驾驶、AR(Augmented Reality,增强现实)、机器人等业务场景中,通常会进行3D(3-dimension,三维)目标检测,及对作为目标的物体检测物体在三维空间的信息,进行添加特效、路线规划、运动轨迹规划等业务处理。
发明内容
本申请实施例提出了一种物体姿态的检测方法、装置、计算机设备和存储介质。
第一方面,本申请实施例提供了一种物体姿态的检测方法,包括:
获取图像数据,所述图像数据中具有目标物体;
将所述图像数据输入二维检测模型中,检测三维的边界框投影至所述图像数据时的二维的第一姿态信息,所述边界框用于检测所述目标物体;
将所述第一姿态信息映射为三维的第二姿态信息;
根据所述第二姿态信息检测所述目标物体的第三姿态信息。
第二方面,本申请实施例还提供了一种物体姿态的检测装置,包括:
图像数据获取模块,设置为获取图像数据,所述图像数据中具有目标物体;
第一姿态信息检测模块,设置为将所述图像数据输入二维检测模型中,检测三维的边界框投影至所述图像数据时的二维的第一姿态信息,所述边界框用于检测所述目标物体;
第二姿态信息映射模块,设置为将所述第一姿态信息映射为三维的第二姿态信息;
第三姿态信息检测模块,设置为根据所述第二姿态信息检测所述目标物体的第三姿态信息。
第三方面,本申请实施例还提供了一种计算机设备,所述计算机设备包括:
至少一个处理器;
存储器,设置为存储至少一个程序,
所述至少一个处理器,设置为执行所述至少一个程序以实现如第一方面所述的物体姿态的检测方法。
第四方面,本申请实施例还提供了一种计算机可读存储介质,所述计算机可读存储介质上存储计算机程序,所述计算机程序被处理器执行时实现如第一方面所述的物体姿态的检测方法。
附图说明
图1为本申请一个实施例提供的一种物体姿态的检测方法的流程图;
图2为本申请一个实施例提供的一种检测目标物体姿态的示例图;
图3为本申请另一实施例提供的一种一阶段网络的示例图;
图4为本申请另一实施例提供的一种两阶段网络的示例图;
图5是本申请另一实施例提供的一种物体姿态的检测方法的流程图;
图6为本申请另一实施例提供的一种物体姿态的检测装置的结构示意图;
图7为本申请另一实施例提供的一种计算机设备的结构示意图;
图8为本申请示例性实施例中的检测三维的边框投影至图像数据时的二维的第一姿态信息的流程图;
图9为本申请另一示例性实施例中的检测三维的边框投影至图像数据时的二维的第一姿态信息的流程图;
图10为本申请另一示例性实施例中的将第一姿态信息映射为三维的第二姿态信息的流程图;
图11为本申请另一示例性实施例中的根据第二姿态信息检测目标物体的第三姿态信息的流程图。
具体实施方式
相关技术中,3D目标检测方法按输入形式的不同主要可以分为如下四个大类:
第一类,单目图像,即输入单摄像头拍摄的一帧图像数据。
第二类,双目图像,即输入双目摄像头从两个方向拍摄的两帧图像数据。
第三类,点云,即用激光雷达采集的空间的点的数据。
第四类,点云和单目图像结合,即同时输入单摄像头拍摄的一帧图像数据与激光雷达采集的空间的点的数据。
对于移动端,双目摄像头头、激光雷达的结构较为复杂、较难移植到移动端,且成本较高,通常会使用单目图像。
相关技术中,基于单目图像的3D目标检测大多是基于CenterNet(中心网络)进行改进,直接由网络端到端地估计物体的信息,但这类方法对旋转估计比较敏感,旋转稍微有0.01误差也对物体的信息产生比较大的偏差,导致稳定性和精确度都较差。
为应对上述工况,本申请实施例公开了一种物体姿态的检测方法、装置、计算机设备和存储介质,提高3D目标检测的稳定性和精确度。
下面结合附图和实施例对本申请进行说明。
一个实施例
图1为本申请一个实施例提供的一种物体姿态的检测方法的流程图,本实施例在目标检测时,将边界框的2D(2-dimension,二维)姿态映射为3D姿态,检测物体的3D姿态。本申请实施例描述的一种物体姿态的检测方法,可以由物体姿态的检测装置来执行,该物体姿态的检测装置可以由软件和/或硬件实现,可配置在作为移动端的计算机设备中,计算机设备包括,例如手机、平板电脑、智能穿戴设备,等等,智能穿戴设备包括,例如智能眼镜、智能手表等。
本申请实施例包括如下步骤:
步骤101、获取图像数据。
在计算机设备中,可以安装Android(安卓)、iOS、HarmonyOS(鸿蒙系统)等操作系统,用户可以在这些操作系统中安装用户所需的应用程序,例如,直播应用、短视频应用、美颜应用、会议应用,等等。
计算机设备可以配置有一个或多个摄像头(Camera),摄像头又称相机。这些摄像头可以安装在计算机设备的正面、又称前置摄像头,也可以安装在计算机设备的背部、又称后置摄像头。
这些应用可以将计算机设备本地的图库、网络的图库中的图像数据作为待使用的图像数据,也可以调用摄像头采集图像数据,等等。
图像数据中具有作为检测目标的物体,该物体记为目标物体,该目标物体可以根据业务场景的需求而设置,例如,如图2所示的杯子201、笔记本、笔、显示屏,等等。
示例性地,这些应用调用摄像头面向目标物体采集视频数据,视频数据中具有多帧图像数据,通过卡尔曼滤波、光流法等方法在多帧图像数据中追踪目 标物体。
步骤102、将图像数据输入二维检测模型中,检测三维的边界框投影至图像数据时的二维的第一姿态信息。
目标物体处于真实的三维空间中,可以使用三维的边界框描述目标物体在三维的空间中的姿态,其中,如图2所示,三维的边界框的形状可以包括长方体202、圆柱体、球体,等等,三维的边界框是目标物体201外接的框体,三维的边界框可用于检测目标物体201。
在图像数据中,目标物体表示为二维的像素点,三维的边界框也跟随目标物体以投影方式记录在图像数据中,三维的边界框以二维的像素点进行表示,此时,可以计算该三维的边界框在二维的图像数据中呈现的姿态,记为第一姿态信息。
在一实施例中,可以预先训练用于检测目标物体的边界框的第一姿态信息的模型,记为二维检测模型,例如,mobilenetV2、shufflenetV2,等等。
对于视频数据而言,可以将所有帧图像数据均输入二维检测模型进行检测,也可以每隔一定时间间隔将图像数据分别输入二维检测模型进行检测,时间间隔的预测结果可以用跟踪替代。
时间间隔的预测结果可以用跟踪替代,例如,为了得到每帧的结果,可以把每帧都输入模型得到结果,如此设置,每帧过模型都需要有时耗,延时较严重;也可不必每帧均过模型,比如第0帧过模型,第5帧过模型,但是每帧结果均可得到,那么第1帧的结果就可以对第0帧的结果进行跟踪得到。
步骤103、将第一姿态信息映射为三维的第二姿态信息。
例如,已知相机坐标系中的第一姿态信息,可以将求解世界坐标系中的第二姿态信息看作是PnP(Perspective-n-Point)问题,将目标物体在相机坐标系中部分的第一姿态信息通过PnP(Perspective-n-Point)、DLT(Direct Linear Transform,直接线性变换)、EPnP(Efficient PnP)、UPnP等位姿估计算法映射为世界坐标系中的三维的第二姿态信息。
在本申请的一个实施例中,第一姿态信息包括中心点、顶点、深度,可以将目标物体在图像坐标系中的顶点通过位姿估计算法映射为世界坐标系中的顶点。
例如,深度可指相机拍摄图片时,物体离相机的距离。
例如,若物体检测框是长方体,顶点可指长方体的8个顶点。
以EPnP算法作为位姿估计算法的示例,EPnP算法可以较好地处理如何从 3D点-2D点匹配对中求解摄像头的位姿的状况,在本实施例中,利用EPnP算法把相机坐标系下的2D点(例如顶点)映射到相机坐标系中的3D点(例如顶点),利用模型预测出的中心点的深度,与EPnP估出来的深度相除得到一个比例,对每一个顶点都乘以这个比例,得到真实深度的相机坐标系的3D点(例如顶点),用这个3D点(例如顶点)外乘相机的外参得到世界坐标系下的3D点(例如顶点)。
步骤104、根据第二姿态信息检测目标物体的第三姿态信息。
若确定边界框在世界坐标系下的第二姿态信息,则可以检测位于边界框内的目标物体在世界坐标系下的姿态,记为三维的第三姿态信息。
此外,边界框在世界坐标系下的第二姿态信息包括多个顶点,可以根据多个顶点计算目标物体在世界坐标系下的位置和方向,作为第三姿态信息。
第一姿态信息为二维的姿态信息,对应于边界框,处于图像坐标系下。
第二姿态信息为三维的姿态信息,对应于边界框,处于相机坐标系下。
第三姿态信息为三维的姿态信息,对应于目标物体,处于世界坐标系下。
本实施例中,通过将3D边界框映射到2D图像上,通过2D图像的边界框反映射回3D物体的边界框。
本实施例中,获取图像数据,图像数据中具有目标物体,将图像数据输入二维检测模型中,检测三维的边界框投影至所述图像数据时的二维的第一姿态信息,边界框用于检测目标物体,将第一姿态信息映射为三维的第二姿态信息,根据第二姿态信息检测目标物体的第三姿态信息,通过预测边界框在图像数据上的投影映射,从投影映射还原回3D的姿态信息,避免预测旋转角的细微误差带来的抖动,本实施例比直接预测3D的姿态信息精确度更高,效果更稳定。
在本申请的一个示例性实施例中,二维检测模型为一个独立、完整的网络,即一阶段网络,如图3所示,二维检测模型包括编码器310、解码器320、预测网络330,在本示例性实施例中,如图8,步骤102可以包括如下步骤1021、步骤1022和步骤1023。
步骤1021、在编码器中对图像数据进行编码,获得第一图像特征。
例如,编码器可以将整个源数据(即图像数据)读取为固定长度的编码(即第一图像特征)。
示例性地,如图3所示,编码器310包括卷积层(Conv Layer)311、第一残差网络312、第二残差网络313、第三残差网络314、第四残差网络315、第 五残差网络316,其中,第一残差网络312、第二残差网络313、第三残差网络314、第四残差网络315与第五残差网络316均包括一个或多个瓶颈残差块(Bottleneck residual block),瓶颈残差块的输出通道数为输入通道数的n倍,n为正整数,如4。
在本示例中,在卷积层311中对图像数据进行卷积处理,获得第一层级特征;在第一残差网络312中对第一层级特征进行处理,获得第二层级特征;在第二残差网络313中对第二层级特征进行处理,获得第三层级特征;在第三残差网络314中对第三层级特征进行处理,获得第四层级特征;在第四残差网络315中对第四层级特征进行处理,获得第五层级特征;在第五残差网络316中对第五层级特征进行处理,获得第六层级特征。
其中,在第一残差网络312、第二残差网络313、第三残差网络314、第四残差网络315与第五残差网络316中,本层瓶颈残差块的输出为下一层瓶颈残差块的输入。
在本示例中,第一层级特征、第二层级特征、第三层级特征、第四层级特征、第五层级特征、第六层级特征均为第一图像特征。
在一实施例中,如图3所示,第一残差网络312中瓶颈残差块的数量小于第二残差网络313中瓶颈残差块的数量,例如,第一残差网络312中1层瓶颈残差块,第二残差网络313中2层瓶颈残差块;第二残差网络313中瓶颈残差块的数量小于第三残差网络314中瓶颈残差块的数量,第三残差网络314中瓶颈残差块的数量小于第四残差网络315中瓶颈残差块的数量,第四残差网络315中瓶颈残差块的数量等于第五残差网络316中瓶颈残差块的数量。例如,第二残差网络313中2层瓶颈残差块,第三残差网络314中3层瓶颈残差块,第四残差网络315中4层瓶颈残差块,第五残差网络316中4层瓶颈残差块。
此外,第二层级特征的维度高于第三层级特征的维度,第三层级特征的维度高于第四层级特征的维度,第四层级特征的维度高于第五层级特征的维度,第五层级特征的维度高于第六层级特征的维度。例如,第二层级特征的维度为320×240×16,第三层级特征的维度为160×120×24,第四层级特征的维度为80×60×32,第五层级特征的维度为40×30×64,第六层级特征的维度为20×15×128。
图像数据经过多次下采样后的低分辨率信息,能够提供目标物体在整个图像数据中上下文的语义信息,该语义信息可反应目标物体和目标物体的环境之间关系的特征,第一图像特征有助于目标物体的检测。
步骤1022、在解码器中对第一图像特征进行解码,获得第二图像特征。
例如,解码器可以将编码(即第一图像特征)进行解码,以输出目标数据(即第二图像特征)。
示例性地,如图3所示,解码器320包括转置卷积层(Transposed Convolution Layer)321、第六残差网络322,其中,第六残差网络322包括多个瓶颈残差块,例如,第六残差网络322包括2层瓶颈残差块。
如果第一图像特征包括多种特征,如第一层级特征、第二层级特征、第三层级特征、第四层级特征、第五层级特征、第六层级特征等,则可以选取至少一种特征进行上采样,将高层的语义信息与低层的语义信息相结合,提高语音信息的丰富度,增加二维检测模型的稳定性和精确度,减少了漏检误检的情况。
在本示例中,如图3所示,在转置卷积层321中对第六层级特征数据进行卷积处理,获得第七层级特征;将上采样得到的第五层级特征与第七层级特征拼接为第八层级特征,实现高层的语义信息与低层的语义信息相结合;在第六残差网络322中对第八层级特征进行处理,获得第二图像特征。
其中,在第六残差网络322中,本层瓶颈残差块的输出为下一层瓶颈残差块的输入。
在一实施例中,第二图像特征的维度高于第六层级特征的维度,例如,第二图像特征的维度为40×30×64,第六层级特征的维度为20×15×128。
步骤1023、在预测网络中将第二图像特征映射为边界框的二维的第一姿态信息。
第一姿态信息为与边界框对应的二维姿态信息。
一般情况下,二维检测模型具有多个预测网络,预测网络属于分支网络,专注于第一姿态信息中的某个数据,可以实现为较小的结构。
示例性地,如图3所示,预测网络330包括第一预测网络331、第二预测网络332、第三预测网络333、第四预测网络334,其中,第一预测网络331包括多个瓶颈残差块、如2层瓶颈残差块,第二预测网络332包括多个瓶颈残差块、如2层瓶颈残差块,第三预测网络333包括多个瓶颈残差块、如2层瓶颈残差块,第四预测网络334包括多个瓶颈残差块、如2层瓶颈残差块。
在本示例中,第一姿态信息包括中心点、深度、尺寸、顶点,分别将第二图像特征输入第一预测网络331、第二预测网络332、第三预测网络33、第四预测网络334中。
例如,尺寸可指真实物体的长宽高。
在第一预测网络331中对第二图像特征进行处理,获得边界框的高斯分布图(center heatmap),在高斯分布图中查找中心点,该中心点具有深度。
在第二预测网络332中对第二图像特征进行处理,获得边界框的深度(depth)。
在第三预测网络333中对第二图像特征进行处理,获得边界框的尺寸(scale)。
在第四预测网络334中对第二图像特征进行处理,获得边界框中的顶点相对于中心点偏移的距离(vertexes),在中心点的坐标的基础加上该偏移的距离,可以得到多个顶点的坐标。
其中,针对不同形状的边界框,顶点的数量及顶点在边界框的相对位置也有所不同,例如,若边界框为长方体,则边界框具有8个顶点,这些顶点为每一个面的角点,若边界框为圆柱体,则边界框具有8个顶点,这些顶点为底面和顶面的外接圆的交接点,等等。
在本实施例中,二维检测模型的层级数少、结构简单,使用的计算资源少,计算的耗时低,可以保证实时性。
在本申请的另一个示例性实施例中,二维检测模型包括两个相互独立的模型,即二阶段网络,如图4所示,二维检测模型包括目标检测模型410与编码模型420,目标检测模型410与编码模型420级联,即目标检测模型410的输出为编码模型420的输入,二阶段网络的结构更加复杂,可以避免在小模型上预测结果会聚到一堆的情况,使得二维检测模型更加稳定。
在本示例性实施例中,如图9所示,步骤102可以包括如下步骤1021'、步骤1022'和步骤1023'。
步骤1021'、在目标检测模型中,检测图像数据中边界框的部分的二维的第一姿态信息、目标物体在图像数据中所处的区域。
目标检测模型可以包括one-stage(一阶段)和two-stage(二阶段),one-stage可以包括SSD(Single Shot MultiBox Detector)、YOLO(you only look once)等等,two-stage包括R-CNN(Region with CNN features)系列,R-CNN系列例如R-CNN、fast-RCNN、faster-RCNN等等。
将图像数据输入目标检测模型中,目标检测模型可以检测图像数据中边界框的部分的二维的第一姿态信息、以及、目标物体在图像数据中所处的区域。
在一实施例中,在目标检测模型中,可以检测图像数据中边界框的深度、 尺寸,作为第一姿态信息。
以YOLOv5作为目标检测模型的示例,YOLOv5分为三部分:骨干网络,特征金字塔网络,分支网络。骨干网络是指在不同细粒度上聚合并形成图像特征的卷积神经网络;特征金字塔网络是指一系列混合和组合图像特征的网络层,后续将图像特征传递到预测层,预测层一般是FPN(Feature Pyramid Networks,特征金字塔网络)或者PANet(Path Aggregation Network,路径聚合网络);分支网络是指对图像特征进行预测,生成边界框和并预测目标物体的类别、目标物体的深度和尺寸。因此,YOLOv5的输出为nc+5+3+1。
其中,nc为物体的类别数。
其中,数字5表征存在5个变量,包括分类置信度(c)、边界框的中心点(x,y)、边界框的宽和高(w,h)一共5个变量。
其中,数字3表征存在3个变量,包括目标物体在3维空间中的尺寸(长宽高)。
其中,数字1表征存在1个变量,即目标物体在相机坐标系中的深度,即拍摄时物体离相机的距离。
步骤1022'、在图像数据中提取区域中的数据,作为区域数据。
如图4所示,目标物体在图像数据430中所处的区域属于二维的边界框,基于该区域对图像数据进行裁剪,取出该区域中的数据(像素点),实现图像数据的缩放,记为区域数据431。
步骤1023'、在编码模型中对区域数据进行编码,获得边界框的部分的二维的第一姿态信息。
如图4所示,将区域数据431输入编码模型420中进行编码,可以输出边界框剩余的二维的第一姿态信息。
在一实施例中,在编码模型中对区域数据进行编码,获得边界框的中心点和顶点,作为第一姿态信息。
可将,目标检测模块检测出的一部分的第一姿态信息、与编码模块生成的另一部分的第一姿态信息的合集,认定为二维检测模型检测出的第一姿态信息。
考虑到移动端的计算资源较少,编码模型通常选用结构简单、计算量少的模型,以efficientnet-lite0作为编码模型的示例,efficientnet-lite0包括多个1×1的卷积层,多个深度可分离卷积层,多个残差连接层以及多个全连接层,最后一层全连接层预估目标物体的中心点和顶点。
除了efficientnet-lite0之外,还可以采用参数更少或者表达能力更强的轻量 级网络结构作为编码模型。
在本实施例中,可以根据业务场景的需求选择合适的二维检测模型,如果针对用户上传视频数据的情况,一阶段网络的速度更快,二阶段网络的精度更高,如果针对用户实时拍摄的情况,二阶段网络可以增加2D检测器的跟踪(跟踪例如为,根据第一帧的位置信息,捕捉接下来帧的可能位置信息),可以不用每帧图像数据都检测,使得速度精度都更快更高。
在本申请的另一个示例性实施例中,如图10所示,步骤103可以包括如下步骤1031-步骤1034。
步骤1031、分别在世界坐标系与相机坐标系下,查询控制点。
EPnP算法引入了控制点,任何一个参考点可以表示为四个控制点的线性组合,任何一个参考点例如顶点、中心点。
一般情况下,控制点可以随便选择,而本实施例可以预先通过实验选择效果较佳的点作为控制点,并记录控制点的坐标,在使用时将控制点作为超参数加载。
步骤1032、分别在世界坐标系与相机坐标系下,将中心点与顶点表示为控制点的加权和。
设上标w为世界坐标系、上标c为相机坐标系,
Figure PCTCN2021111502-appb-000001
为第i个参考点(顶点、中心点)在世界坐标系下的坐标,
Figure PCTCN2021111502-appb-000002
为第i个参考点(顶点、中心点)在投影至相机坐标系下的坐标,
Figure PCTCN2021111502-appb-000003
Figure PCTCN2021111502-appb-000004
为四个控制点在世界坐标系下的坐标,
Figure PCTCN2021111502-appb-000005
为四个控制点投影至相机坐标系下的坐标。
对于世界坐标系中的参考点可用四个控制点表示:
Figure PCTCN2021111502-appb-000006
其中,a ij为配置给控制点的齐次重心坐标(homogeneous barycentric coordinates),又称权重,属于超参数。
对于相机坐标系中的参考点可用四个控制点表示:
Figure PCTCN2021111502-appb-000007
其中,a ij为配置给控制点的齐次重心坐标,又称权重。
对于同一参考点,在世界坐标系下的权重与在相机坐标系下的权重相同,属于超参数。
步骤1033、构建深度、中心点与顶点在世界坐标系与相机坐标系之间的约束关系。
其中,此处提及的约束关系可为,世界坐标系下的深度、中心点与顶点,与相机坐标系下的深度、中心点与顶点之间的约束关系。
例如,根据投影方程,得到世界坐标系中参考点坐标(例如顶点、中心点)和相机坐标系中参考点(例如顶点、中心点)的约束关系。
投影方程,如下:
Figure PCTCN2021111502-appb-000008
其中,w i为参考点(顶点、中心点)的深度,u i、v i为参考点(顶点、中心点)在相机坐标系中的x坐标、y坐标,A为超参数,f u、f v、u c、u v为相机的内参,
Figure PCTCN2021111502-appb-000009
为参考点(顶点、中心点)在世界坐标系中的x坐标、y坐标、z坐标,共12个未知变量,带入求解。
将控制点的权重a ij和为1带入约束关系,每个参考点(顶点、中心点)的约束关系可以转换为:
Figure PCTCN2021111502-appb-000010
Figure PCTCN2021111502-appb-000011
步骤1034、串联约束关系,获得线性方程。
例如,每个参考点存在两个约束关系,串联约束关系可表征为将9个参考点的约束关系组成一个矩阵,按行串联。
步骤1035、对线性方程进行求解,以将顶点映射至三维空间。
当有n个参考点(顶点、中心点)的时候,n为正整数,如9,串联n个参考点的约束关系,可以得到如下的齐次的线性方程:
Mx=0
其中,
Figure PCTCN2021111502-appb-000012
x表示的是控制点在相机坐标系下的坐标(X,Y,Z),是一个12维的向量,四个控制点总共12个未知变量,M为2n×12的矩阵。
因此,x属于M的右零空间,v i为矩阵M的右奇异向量,v i对应的奇异值为0,可以通过求解M TM的零空间特征值得到:
Figure PCTCN2021111502-appb-000013
解算方法为,求解M TM的特征值和特征向量,特征值为0的特征向量即为v i。其中,不论有多少个参考点,M TM的大小均是12×12。而计算M TM的复杂度为O(n),因此,算法的整体复杂度为O(n)。
N与参考点的数量、控制点、相机焦距和噪声有关,β i是一个线性组合可通过在设定N的数量时直接优化求解或者采用近似的解法求解,得到确定的解。
在本申请的另一个示例性实施例中,可以对第二姿态信息进行奇异值求解得到第三姿态信息。在本示例性实施例中,如图11所示,步骤104可以包括如下步骤1041-1047。
步骤1041、分别在世界坐标系与相机坐标系下,基于顶点计算新的中心点。
对于长方体、圆柱体等形状的边界框,一般是以中心点建立坐标系的,那么,中心点就是边界框的原点。
示例性地,在相机坐标系下计算顶点的平均值,作为新的中心点,记为:
Figure PCTCN2021111502-appb-000014
其中,
Figure PCTCN2021111502-appb-000015
为相机坐标系下新的中心点,
Figure PCTCN2021111502-appb-000016
为相机坐标系下的顶点,N为顶点的数量,i为整数。
在世界坐标系下计算顶点的平均值,作为新的中心点,记为
Figure PCTCN2021111502-appb-000017
其中,
Figure PCTCN2021111502-appb-000018
为世界坐标系下新的中心点,
Figure PCTCN2021111502-appb-000019
为世界坐标系下的顶点,N为顶点的数量。
步骤1042、分别在世界坐标系与相机坐标系下,对顶点去除新的中心点。
在相机坐标系下对多个顶点分别减去新的中心点,实现去中心,即去除新的中心点,可记为:
Figure PCTCN2021111502-appb-000020
其中,
Figure PCTCN2021111502-appb-000021
为相机坐标系下去中心之后的顶点,
Figure PCTCN2021111502-appb-000022
为相机坐标系下的顶点,
Figure PCTCN2021111502-appb-000023
为相机坐标系下新的中心点。
在世界坐标系下将多个顶点分别减去新的中心点,实现去中心,即去除新的中心点,记为:
Figure PCTCN2021111502-appb-000024
其中,
Figure PCTCN2021111502-appb-000025
为世界坐标系下去中心之后的顶点,
Figure PCTCN2021111502-appb-000026
为世界坐标系下的顶点,
Figure PCTCN2021111502-appb-000027
为世界坐标系下新的中心点。
步骤1043、对顶点去除新的中心点之后,计算自共轭矩阵。
在去中心完成后,可计算自共轭矩阵H,其中,自共轭矩阵H为相机坐标系下的顶点与世界坐标系下的顶点的转置矩阵之间的乘积,表示如下:
Figure PCTCN2021111502-appb-000028
其中,N为顶点的数量,
Figure PCTCN2021111502-appb-000029
相机坐标系下去中心之后的顶点,
Figure PCTCN2021111502-appb-000030
为世界坐标系下去中心之后的顶点,T为转置矩阵。
步骤1044、对自共轭矩阵进行奇异值分解,得到第一正交矩阵、对角阵、第二正交矩阵的转置矩阵之间的乘积。
在本实施例中,有两个坐标系下的坐标都是已知的,即世界坐标系下顶点的坐标和相机坐标系下顶点的坐标,利用SVD(Singular Value Decomposition,奇异值分解)的思路可以求得两个坐标系的位姿变换,即,对自共轭矩阵H进行SVD,可表示为:
H=UΛV T
其中,U为第一正交矩阵,Λ为对角阵,V为第二正交矩阵,T为转置矩阵。
步骤1045、计算第二正交矩阵与第一正交矩阵的转置矩阵之间的乘积,作为目标物体在世界坐标系下的方向。
计算X=VU T,其中,U为第一正交矩阵,V为第二正交矩阵,T为转置矩阵。
在部分情况下,R=X,R即为目标物体在世界坐标系下的方向。
步骤1046、将世界坐标系下新的中心点按照方向旋转,获得投影点。
步骤1047、将相机坐标系下新的中心点减去投影点,获得目标物体在世界坐标系下的位置。
将相机坐标系下新的中心点减去按照该方向旋转之后的世界坐标系下新的中心点,可以得到目标物体在世界坐标系下的位置,表达如下:
Figure PCTCN2021111502-appb-000031
其中,t为目标物体在世界坐标系下的位置,
Figure PCTCN2021111502-appb-000032
为相机坐标系下新的中心点,R为目标物体在世界坐标系下的方向,
Figure PCTCN2021111502-appb-000033
为世界坐标系下新的中心点。
另一实施例
图5为本申请另一实施例提供的一种物体姿态的检测方法的流程图,本实施例以前述实施例为基础,增加特效处理的操作,该方法包括如下步骤:
步骤501、获取图像数据。
其中,图像数据中具有目标物体。
步骤502、将图像数据输入二维检测模型中,检测三维的边界框投影至图像数据时的二维的第一姿态信息。
其中,边界框用于检测目标物体。
步骤503、将第一姿态信息映射为三维的第二姿态信息。
步骤504、根据第二姿态信息检测目标物体的第三姿态信息。
步骤505、确定与目标物体适配的、三维的素材。
在本实施例中,服务端例如服务器可以根据业务场景的需求,预先收集与目标物体的类型适配的、三维的素材。移动端可以预先按照一定的规则(如选取基础素材、热度高的素材等)从服务端下载这些素材到本地,也可以在使用时,根据用户触发的操作,从服务端下载指定的素材到本地;或者,用户可以在移动端本地选择与目标物体适配的、三维的素材,或者,从目标物体对应的数据中提取部分数据、并对部分数据进行三维转换,作为素材,等等。
例如,该素材可以为文本数据、图像数据、动画数据,等等。
例如,如图2所示,如果目标物体201为某个品牌的饮料,则可以将该品牌的LOGO(标识)203作为素材。
又如,如果目标物体为球类(如足球、篮球、排球、羽毛球、乒乓球等),则可以为将与球类适配的特效动画(如羽毛、闪电、火焰等)作为素材。
再如,如果目标物体为盛放水的容器,则可以为将水生动植物(如水草、鱼、虾等)作为素材。
步骤506、为所述素材配置第四姿态信息。
其中,第四姿态信息适配于第一姿态信息和/或第三姿态信息。
步骤507、在图像数据中按照第四姿态信息显示素材。
在本实施例中,可以预设设置特效生成器,可以将第一姿态信息和/或第三姿态信息输入到特效生成器中,为该素材生成第四姿态信息,在图像数据中按照该第四姿态信息渲染素材,使得素材符合目标物体的情况,生成更加自然的特效。
示例性地,一部分的第一姿态信息包括边界框的尺寸,第三姿态信息包括目标物体的方向、位置。
在本示例中,可以将目标物体的位置往外偏移指定的距离,例如,以目标物体的正面为参考面偏移10厘米,将偏移后的位置作为素材的位置。
第四姿态信息可包括素材的位置。
将边界框的尺寸缩小至指定的比例(如10%),将缩小后的尺寸作为素材的尺寸。
第四姿态信息可包括素材的尺寸。
将目标物体的方向配置为素材的方向,使得素材与目标物体的朝向相同。
第四姿态信息可包括素材的方向。
上述第四姿态信息只是作为示例,在实施本申请实施例时,可以根据实际情况设置其它第四姿态信息,例如,将边界框的尺寸放大至指定的比例(如1.5倍),将放大后的尺寸作为素材的尺寸,将目标物体的方向旋转指定角度(如90°),将旋转后的方向作为素材的方向,等等。另外,除了上述第四姿态信息外,本领域技术人员还可以根据实际需要采用其它第四姿态信息。
对于视频数据而言,用户在适配数据中添加特效完毕后,则可以发布该适配数据,如短视频、直播数据,等等。
对于方法实施例,为了简单描述,故将实施例表述为一系列的动作组合,但是本领域技术人员应该知悉,动作顺序存在多种,因为依据本申请实施例,一部分步骤可以采用其他顺序或者同时进行。
另一实施例
图6为本申请另一实施例提供的一种物体姿态的检测装置的结构框图,物体姿态的检测装置可以包括如下模块:
图像数据获取模块601,设置为获取图像数据,所述图像数据中具有目标物体;
第一姿态信息检测模块602,设置为将所述图像数据输入二维检测模型中,检测三维的边界框投影至所述图像数据时的二维的第一姿态信息,所述边界框用于检测所述目标物体;
第二姿态信息映射模块603,设置为将所述第一姿态信息映射为三维的第二姿态信息;
第三姿态信息检测模块604,设置为根据所述第二姿态信息检测所述目标物体的第三姿态信息。
在本申请的一个实施例中,所述二维检测模型包括编码器、解码器、预测网络;
所述第一姿态信息检测模块602包括:
编码模块,设置为在所述编码器中对所述图像数据进行编码,获得第一图像特征;
解码模块,设置为在所述解码器中对所述第一图像特征进行解码,获得第二图像特征;
映射模块,设置为在所述预测网络中将所述第二图像特征映射为边界框的二维的第一姿态信息。
在本申请的一个实施例中,所述编码器包括卷积层、第一残差网络、第二残差网络、第三残差网络、第四残差网络、第五残差网络,所述第一残差网络、所述第二残差网络、所述第三残差网络、所述第四残差网络与所述第五残差网络分别包括至少一个瓶颈残差块;
所述编码模块还设置为:
在所述卷积层中对所述图像数据进行卷积处理,获得第一层级特征;
在所述第一残差网络中对所述第一层级特征进行处理,获得第二层级特征;
在所述第二残差网络中对所述第二层级特征进行处理,获得第三层级特征;
在所述第三残差网络中对所述第三层级特征进行处理,获得第四层级特征;
在所述第四残差网络中对所述第四层级特征进行处理,获得第五层级特征;
在所述第五残差网络中对所述第五层级特征进行处理,获得第六层级特征。
在本申请的一个实施例中,所述第一残差网络中所述瓶颈残差块的数量小于所述第二残差网络中所述瓶颈残差块的数量,所述第二残差网络中所述瓶颈残差块的数量小于所述第三残差网络中所述瓶颈残差块的数量,所述第三残差网络中所述瓶颈残差块的数量小于所述第四残差网络中所述瓶颈残差块的数量,所述第四残差网络中所述瓶颈残差块的数量等于所述第五残差网络中所述瓶颈残差块的数量;
所述第二层级特征的维度高于所述第三层级特征的维度,所述第三层级特征的维度高于所述第四层级特征的维度,所述第四层级特征的维度高于所述第五层级特征的维度,所述第五层级特征的维度高于所述第六层级特征的维度。
在本申请的一个实施例中,所述解码器包括转置卷积层、第六残差网络,所述第六残差网络包括多个瓶颈残差块;
所述解码模块还设置为:
在所述转置卷积层中对所述第六层级特征数据进行卷积处理,获得第七层级特征;
将所述第五层级特征与所述第七层级特征拼接为第八层级特征;
在所述第六残差网络中对所述第八层级特征进行处理,获得第二图像特征。
在本申请的一个实施例中,所述第二图像特征的维度高于所述第六层级特征的维度。
在本申请的一个实施例中,所述预测网络包括第一预测网络、第二预测网络、第三预测网络、第四预测网络,所述第一预测网络、所述第二预测网络、所述第三预测网络与所述第四预测网络分别包括多个瓶颈残差块;
所述映射模块还设置为:
在所述第一预测网络中对所述第二图像特征进行处理,获得边界框的中心点;
在所述第二预测网络中对所述第二图像特征进行处理,获得边界框的深度;
在所述第三预测网络中对所述第二图像特征进行处理,获得边界框的尺寸;
在所述第四预测网络中对所述第二图像特征进行处理,获得边界框中的顶点相对于所述中心点偏移的距离。
在本申请的另一个实施例中,所述二维检测模型包括目标检测模型与编码模型,目标检测模型与编码模型级联;
所述第一姿态信息检测模块602:
目标检测模块,设置为在所述目标检测模型中,检测所述图像数据中边界框的部分的二维的第一姿态信息、所述目标物体在所述图像数据中所处的区域;
区域数据提取模块,设置为在所述图像数据中提取所述区域中的数据,作为区域数据;
区域数据编码模块,设置为在所述编码模型中,对所述区域数据进行编码,获得所述边界框的部分的二维的第一姿态信息。
在本申请的一个实施例中,所述目标检测模块还设置为:
在所述目标检测模型中,检测所述图像数据中边界框的深度、尺寸;
所述区域数据编码模块还设置为:
在所述编码模型中,对所述区域数据进行编码,获得所述边界框的中心点和顶点。
在本申请的一个实施例中,所述第一姿态信息包括中心点、顶点、深度;
所述第二姿态信息映射模块603包括:
控制点查询模块,设置为分别在世界坐标系与相机坐标系下,查询控制点;
点表示模块,设置为分别在世界坐标系与相机坐标系下,将所述中心点与所述顶点表示为所述控制点的加权和;
约束关系构建模块,设置为构建所述深度、所述中心点与所述顶点在所述世界坐标系与所述相机坐标系之间的约束关系;
线性方程生成模块,设置为串联所述约束关系,获得线性方程;
线性方程求解模块,设置为对所述线性方程进行求解,以将所述顶点映射至三维空间。
在本申请的一个实施例中,所述第三姿态信息检测模块604,包括:
中心点计算模块,设置为分别在世界坐标系与相机坐标系下,基于顶点计算新的中心点;
中心点去除模块,设置为分别在世界坐标系与相机坐标系下,对所述顶点去除所述新的中心点;
自共轭矩阵计算模块,设置为计算自共轭矩阵,所述自共轭矩阵为所述相机坐标系下的顶点与所述世界坐标系下的顶点的转置矩阵之间的乘积;
奇异值分解模块,设置为对所述自共轭矩阵进行奇异值分解,得到第一正 交矩阵、对角阵、第二正交矩阵的转置矩阵之间的乘积;
方向计算模块,设置为计算所述第二正交矩阵与所述第一正交矩阵的转置矩阵之间的乘积,作为所述目标物体在世界坐标系下的方向;
投影点计算模块,设置为将所述世界坐标系下的所述新的中心点按照所述方向旋转,获得投影点;
位置计算模块,设置为将所述相机坐标系下的所述新的中心点减去所述投影点,获得所述目标物体在世界坐标系下的位置。
在本申请的一个实施例中,还包括:
素材确定模块,设置为确定与所述目标物体适配的、三维的素材;
第四姿态信息配置模块,设置为为所述素材配置第四姿态信息,所述第四姿态信息为适配于所述第一姿态信息和所述第三姿态信息的姿态信息;
素材显示模块,设置为在所述图像数据中按照所述第四姿态信息显示所述素材。
在本申请的一个实施例中,所述第一姿态信息包括所述边界框的尺寸,所述第三姿态信息包括所述目标物体的方向、位置;
所述第四姿态信息配置模块:
位置偏移模块,设置为将所述目标物体的位置偏移指定的距离,将偏移后的位置作为所述素材的位置;
尺寸缩小模块,设置为将所述边界框的尺寸缩小至指定的比例,将缩小后的尺寸作为所述素材的尺寸;
方向配置模块,设置为将所述目标物体的方向配置为所述素材的方向。
本申请实施例所提供的物体姿态的检测装置可执行本申请任意实施例所提供的物体姿态的检测方法,具备执行方法相应的功能模块和有益效果。
另一实施例
图7为本申请另一实施例提供的一种计算机设备的结构示意图。图7示出了适于用来实现本申请实施方式的示例性计算机设备12的框图。图7显示的计算机设备12仅仅是一个示例。
如图7所示,计算机设备12以通用计算设备的形式表现。计算机设备12的组件可以包括:一个或者多个处理器或者处理单元16,系统存储器28,连接不同系统组件的总线18,不同系统组件包括系统存储器28和处理单元16。
系统存储器28也可记为内存。
总线18表示多类总线结构中的一种或多种,包括存储器总线或者存储器控制器,外围总线,图形加速端口,处理器或者使用多种总线结构中的任意总线结构的局域总线。举例来说,这些体系结构包括工业标准体系结构(Industry Standard Architecture,ISA)总线,微通道体系结构(Micro Channel Architecture,MCA)总线,增强型ISA总线、视频电子标准协会(Video Electronics Standards Association,VESA)局域总线以及外围组件互连(Peripheral Component Interconnect,PCI)总线。
计算机设备12典型地包括多种计算机系统可读介质。这些介质可以是任何能够被计算机设备12访问的可用介质,包括易失性和非易失性介质,可移动的和不可移动的介质。
系统存储器28可以包括易失性存储器形式的计算机系统可读介质,例如随机存取存储器(Random Access Memory,RAM)30和/或高速缓存32。计算机设备12可以包括其它可移动/不可移动的、易失性/非易失性计算机系统存储介质。仅作为举例,存储系统34可以设置为读写不可移动的、非易失性磁介质(图7未显示)。尽管图7中未示出,可以提供用于对可移动非易失性磁盘(例如软盘)读写的磁盘驱动器,以及对可移动非易失性光盘(例如CD-ROM,DVD-ROM或者其它光介质)读写的光盘驱动器。在这些情况下,每个驱动器可以通过一个或者多个数据介质接口与总线18相连。系统存储器28可以包括至少一个程序产品,该程序产品具有一组(例如至少一个)程序模块,这些程序模块被配置以执行本申请多实施例的功能。
具有一组(至少一个)程序模块42的程序/实用工具40,可以存储在例如系统存储器28中,这样的程序模块42包括操作系统、一个或者多个应用程序、其它程序模块以及程序数据,这些示例中的每一个或某种组合中可能包括网络环境的实现。程序模块42通常执行本申请所描述的实施例中的功能和/或方法。
计算机设备12也可以与一个或多个外部设备14(例如键盘、指向设备、显示器24等)通信,还可与一个或者多个使得用户能与该计算机设备12交互的设备通信,和/或与使得该计算机设备12能与一个或多个其它计算设备进行通信的任何设备(例如网卡,调制解调器等等)通信。这种通信可以通过输入/输出(I/O)接口22进行。并且,计算机设备12还可以通过网络适配器20与一个或者多个网络通信,一个或者多个网络例如局域网(Local Area Network,LAN),广域网(Wide Area Network,WAN)和/或公共网络,公共网络例如因特网。如图7所示,网络适配器20通过总线18与计算机设备12的其它模块通信。还可 以结合计算机设备12使用其它硬件和/或软件模块,其它硬件和/或软件模块包括:微代码、设备驱动器、冗余处理单元、外部磁盘驱动阵列、磁盘阵列(Redundant Arrays of Independent Disks,RAID)系统、磁带驱动器以及数据备份存储系统等。
处理单元16通过运行存储在系统存储器28中的程序,执行多种功能应用以及数据处理,例如实现本申请实施例所提供的物体姿态的检测方法。
另一实施例
本申请另一实施例还提供一种计算机可读存储介质,计算机可读存储介质上存储有计算机程序,该计算机程序被处理器执行时实现上述物体姿态的检测方法的多个过程,且能达到相同的技术效果,为避免重复,这里不再赘述。
其中,计算机可读存储介质例如可以包括电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质例如为,具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机存取存储器(RAM)、只读存储器(Read-Only Memory,ROM)、可擦式可编程只读存储器(Erasable Programmable Read-Only Memory,EPROM)、闪存、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本文件中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者结合使用。

Claims (16)

  1. 一种物体姿态的检测方法,包括:
    获取图像数据,所述图像数据中具有目标物体;
    将所述图像数据输入二维检测模型中,检测三维的边界框投影至所述图像数据时的二维的第一姿态信息,所述边界框用于检测所述目标物体;
    将所述第一姿态信息映射为三维的第二姿态信息;
    根据所述第二姿态信息检测所述目标物体的第三姿态信息。
  2. 根据权利要求1所述的方法,其中,所述二维检测模型包括编码器、解码器、预测网络;
    所述将所述图像数据输入二维检测模型中,检测边界框的二维的第一姿态信息,包括:
    在所述编码器中对所述图像数据进行编码,获得第一图像特征;
    在所述解码器中对所述第一图像特征进行解码,获得第二图像特征;
    在所述预测网络中将所述第二图像特征映射为边界框的二维的第一姿态信息。
  3. 根据权利要求2所述的方法,其中,所述编码器包括卷积层、第一残差网络、第二残差网络、第三残差网络、第四残差网络、第五残差网络,所述第一残差网络、所述第二残差网络、所述第三残差网络、所述第四残差网络与所述第五残差网络分别包括至少一个瓶颈残差块;
    所述在所述编码器中对所述图像数据进行编码,获得第一图像特征,包括:
    在所述卷积层中对所述图像数据进行卷积处理,获得第一层级特征;
    在所述第一残差网络中对所述第一层级特征进行处理,获得第二层级特征;
    在所述第二残差网络中对所述第二层级特征进行处理,获得第三层级特征;
    在所述第三残差网络中对所述第三层级特征进行处理,获得第四层级特征;
    在所述第四残差网络中对所述第四层级特征进行处理,获得第五层级特征;
    在所述第五残差网络中对所述第五层级特征进行处理,获得第六层级特征。
  4. 根据权利要求3所述的方法,其中,所述第一残差网络中所述瓶颈残差块的数量小于所述第二残差网络中所述瓶颈残差块的数量,所述第二残差网络中所述瓶颈残差块的数量小于所述第三残差网络中所述瓶颈残差块的数量,所述第三残差网络中所述瓶颈残差块的数量小于所述第四残差网络中所述瓶颈残差块的数量,所述第四残差网络中所述瓶颈残差块的数量等于所述第五残差网络中所述瓶颈残差块的数量;
    所述第二层级特征的维度高于所述第三层级特征的维度,所述第三层级特 征的维度高于所述第四层级特征的维度,所述第四层级特征的维度高于所述第五层级特征的维度,所述第五层级特征的维度高于所述第六层级特征的维度。
  5. 根据权利要求3所述的方法,其中,所述解码器包括转置卷积层、第六残差网络,所述第六残差网络包括多个瓶颈残差块;
    所述在所述解码器中对所述第一图像特征进行解码,获得第二图像特征,包括:
    在所述转置卷积层中对所述第六层级特征数据进行卷积处理,获得第七层级特征;
    将所述第五层级特征与所述第七层级特征拼接为第八层级特征;
    在所述第六残差网络中对所述第八层级特征进行处理,获得第二图像特征。
  6. 根据权利要求5所述的方法,其中,所述第二图像特征的维度高于所述第六层级特征的维度。
  7. 根据权利要求5所述的方法,其中,所述预测网络包括第一预测网络、第二预测网络、第三预测网络、第四预测网络,所述第一预测网络、所述第二预测网络、所述第三预测网络与所述第四预测网络分别包括多个瓶颈残差块;
    所述在所述预测网络中将所述第二图像特征映射为边界框的二维的第一姿态信息,包括:
    在所述第一预测网络中对所述第二图像特征进行处理,获得边界框的中心点;
    在所述第二预测网络中对所述第二图像特征进行处理,获得边界框的深度;
    在所述第三预测网络中对所述第二图像特征进行处理,获得边界框的尺寸;
    在所述第四预测网络中对所述第二图像特征进行处理,获得边界框中的顶点相对于所述中心点偏移的距离。
  8. 根据权利要求1所述的方法,其中,所述二维检测模型包括目标检测模型与编码模型,目标检测模型与编码模型级联;
    所述将所述图像数据输入二维检测模型中,检测边界框的二维的第一姿态信息,包括:
    在所述目标检测模型中,检测所述图像数据中边界框的部分的二维的第一姿态信息、所述目标物体在所述图像数据中所处的区域;
    在所述图像数据中提取所述区域中的数据,作为区域数据;
    在所述编码模型中,对所述区域数据进行编码,获得所述边界框的部分的二维的第一姿态信息。
  9. 根据权利要求8所述的方法,其中,
    所述在所述目标检测模型中,检测所述图像数据中边界框的部分的二维的第一姿态信息、所述目标物体在所述图像数据中所处的区域,包括:
    在所述目标检测模型中,检测所述图像数据中边界框的深度、尺寸;
    所述在所述编码模型中,对所述区域数据进行编码,获得所述边界框的部分的二维的第一姿态信息,包括:
    在所述编码模型中,对所述区域数据进行编码,获得所述边界框的中心点和顶点。
  10. 根据权利要求1所述的方法,其中,所述第一姿态信息包括中心点、顶点、深度;
    所述将所述第一姿态信息映射为三维的第二姿态信息,包括:
    分别在世界坐标系与相机坐标系下,查询控制点;
    分别在世界坐标系与相机坐标系下,将所述中心点与所述顶点表示为所述控制点的加权和;
    构建所述深度、所述中心点与所述顶点在所述世界坐标系与所述相机坐标系之间的约束关系;
    串联所述约束关系,获得线性方程;
    对所述线性方程进行求解,以将所述顶点映射至三维空间。
  11. 根据权利要求1所述的方法,其中,所述根据所述第二姿态信息检测所述目标物体的第三姿态信息,包括:
    分别在世界坐标系与相机坐标系下,基于顶点计算新的中心点;
    分别在世界坐标系与相机坐标系下,对所述顶点去除所述新的中心点;
    计算自共轭矩阵,所述自共轭矩阵为所述相机坐标系下的顶点与所述世界坐标系下的顶点的转置矩阵之间的乘积;
    对所述自共轭矩阵进行奇异值分解,得到第一正交矩阵、对角阵、第二正交矩阵的转置矩阵之间的乘积;
    计算所述第二正交矩阵与所述第一正交矩阵的转置矩阵之间的乘积,作为所述目标物体在世界坐标系下的方向;
    将所述世界坐标系下的所述新的中心点按照所述方向旋转,获得投影点;
    将所述相机坐标系下的所述新的中心点减去所述投影点,获得所述目标物体在世界坐标系下的位置。
  12. 根据权利要求1-11中任一项所述的方法,还包括:
    确定与所述目标物体适配的、三维的素材;
    为所述素材配置第四姿态信息,所述第四姿态信息为适配于所述第一姿态信息和所述第三姿态信息的姿态信息;
    在所述图像数据中按照所述第四姿态信息显示所述素材。
  13. 根据权利要求12任一项所述的方法,其中,所述第一姿态信息包括所述边界框的尺寸,所述第三姿态信息包括所述目标物体的方向、位置;
    所述为所述素材配置第四姿态信息,包括:
    将所述目标物体的位置偏移指定的距离,将偏移后的位置作为所述素材的位置;
    将所述边界框的尺寸缩小至指定的比例,将缩小后的尺寸作为所述素材的尺寸;
    将所述目标物体的方向配置为所述素材的方向。
  14. 一种物体姿态的检测装置,包括:
    图像数据获取模块,设置为获取图像数据,所述图像数据中具有目标物体;
    第一姿态信息检测模块,设置为将所述图像数据输入二维检测模型中,检测三维的边界框投影至所述图像数据时的二维的第一姿态信息,所述边界框用于检测所述目标物体;
    第二姿态信息映射模块,设置为将所述第一姿态信息映射为三维的第二姿态信息;
    第三姿态信息检测模块,设置为根据所述第二姿态信息检测所述目标物体的第三姿态信息。
  15. 一种计算机设备,所述计算机设备包括:
    至少一个处理器;
    存储器,设置为存储至少一个程序,
    所述至少一个处理器,设置为执行所述至少一个程序以实现如权利要求1-13中任一项所述的物体姿态的检测方法。
  16. 一种计算机可读存储介质,所述计算机可读存储介质上存储计算机程序,所述计算机程序被处理器执行时实现如权利要求1-13中任一项所述的物体姿态的检测方法。
PCT/CN2021/111502 2021-08-09 2021-08-09 物体姿态的检测方法、装置、计算机设备和存储介质 WO2023015409A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
CN202180002185.8A CN113795867A (zh) 2021-08-09 2021-08-09 物体姿态的检测方法、装置、计算机设备和存储介质
PCT/CN2021/111502 WO2023015409A1 (zh) 2021-08-09 2021-08-09 物体姿态的检测方法、装置、计算机设备和存储介质
EP21953048.2A EP4365841A1 (en) 2021-08-09 2021-08-09 Object pose detection method and apparatus, computer device, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/111502 WO2023015409A1 (zh) 2021-08-09 2021-08-09 物体姿态的检测方法、装置、计算机设备和存储介质

Publications (1)

Publication Number Publication Date
WO2023015409A1 true WO2023015409A1 (zh) 2023-02-16

Family

ID=78877394

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/111502 WO2023015409A1 (zh) 2021-08-09 2021-08-09 物体姿态的检测方法、装置、计算机设备和存储介质

Country Status (3)

Country Link
EP (1) EP4365841A1 (zh)
CN (1) CN113795867A (zh)
WO (1) WO2023015409A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115965628A (zh) * 2023-03-16 2023-04-14 湖南大学 一种工件涂装质量在线动态检测方法及检测系统
CN116773534A (zh) * 2023-08-15 2023-09-19 宁德思客琦智能装备有限公司 一种检测方法及装置、电子设备、计算机可读介质
CN117476509A (zh) * 2023-12-27 2024-01-30 联合富士半导体有限公司 一种用于半导体芯片产品的激光雕刻装置及控制方法

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115723152B (zh) * 2022-11-17 2023-06-06 中国人民解放军总医院第五医学中心 一种智能护理机器人

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180137644A1 (en) * 2016-11-11 2018-05-17 Qualcomm Incorporated Methods and systems of performing object pose estimation
CN111968235A (zh) * 2020-07-08 2020-11-20 杭州易现先进科技有限公司 一种物体姿态估计方法、装置、系统和计算机设备
CN112116653A (zh) * 2020-11-23 2020-12-22 华南理工大学 一种多张rgb图片的物体姿态估计方法
CN112767489A (zh) * 2021-01-29 2021-05-07 北京达佳互联信息技术有限公司 一种三维位姿确定方法、装置、电子设备及存储介质

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6115052A (en) * 1998-02-12 2000-09-05 Mitsubishi Electric Information Technology Center America, Inc. (Ita) System for reconstructing the 3-dimensional motions of a human figure from a monocularly-viewed image sequence
CN110163197B (zh) * 2018-08-24 2023-03-10 腾讯科技(深圳)有限公司 目标检测方法、装置、计算机可读存储介质及计算机设备
CN112163541A (zh) * 2020-10-09 2021-01-01 上海云绅智能科技有限公司 一种3d目标检测方法、装置、电子设备和存储介质
CN112270249B (zh) * 2020-10-26 2024-01-23 湖南大学 一种融合rgb-d视觉特征的目标位姿估计方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180137644A1 (en) * 2016-11-11 2018-05-17 Qualcomm Incorporated Methods and systems of performing object pose estimation
CN111968235A (zh) * 2020-07-08 2020-11-20 杭州易现先进科技有限公司 一种物体姿态估计方法、装置、系统和计算机设备
CN112116653A (zh) * 2020-11-23 2020-12-22 华南理工大学 一种多张rgb图片的物体姿态估计方法
CN112767489A (zh) * 2021-01-29 2021-05-07 北京达佳互联信息技术有限公司 一种三维位姿确定方法、装置、电子设备及存储介质

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115965628A (zh) * 2023-03-16 2023-04-14 湖南大学 一种工件涂装质量在线动态检测方法及检测系统
CN115965628B (zh) * 2023-03-16 2023-06-02 湖南大学 一种工件涂装质量在线动态检测方法及检测系统
CN116773534A (zh) * 2023-08-15 2023-09-19 宁德思客琦智能装备有限公司 一种检测方法及装置、电子设备、计算机可读介质
CN116773534B (zh) * 2023-08-15 2024-03-05 宁德思客琦智能装备有限公司 一种检测方法及装置、电子设备、计算机可读介质
CN117476509A (zh) * 2023-12-27 2024-01-30 联合富士半导体有限公司 一种用于半导体芯片产品的激光雕刻装置及控制方法
CN117476509B (zh) * 2023-12-27 2024-03-19 联合富士半导体有限公司 一种用于半导体芯片产品的激光雕刻装置及控制方法

Also Published As

Publication number Publication date
EP4365841A1 (en) 2024-05-08
CN113795867A (zh) 2021-12-14

Similar Documents

Publication Publication Date Title
Sahu et al. Artificial intelligence (AI) in augmented reality (AR)-assisted manufacturing applications: a review
Brachmann et al. Visual camera re-localization from RGB and RGB-D images using DSAC
WO2023015409A1 (zh) 物体姿态的检测方法、装置、计算机设备和存储介质
CN108898630B (zh) 一种三维重建方法、装置、设备和存储介质
US9916679B2 (en) Deepstereo: learning to predict new views from real world imagery
CN111243093B (zh) 三维人脸网格的生成方法、装置、设备及存储介质
US11127189B2 (en) 3D skeleton reconstruction from images using volumic probability data
Wang et al. Hmor: Hierarchical multi-person ordinal relations for monocular multi-person 3d pose estimation
GB2573170A (en) 3D Skeleton reconstruction from images using matching 2D skeletons
US11704853B2 (en) Techniques for feature-based neural rendering
CN113807361B (zh) 神经网络、目标检测方法、神经网络训练方法及相关产品
JP2023532285A (ja) アモーダル中心予測のためのオブジェクト認識ニューラルネットワーク
Baudron et al. E3d: event-based 3d shape reconstruction
CN117218246A (zh) 图像生成模型的训练方法、装置、电子设备及存储介质
CA3172140A1 (en) Full skeletal 3d pose recovery from monocular camera
EP4292059A1 (en) Multiview neural human prediction using implicit differentiable renderer for facial expression, body pose shape and clothes performance capture
Han et al. RO-MAP: Real-Time Multi-Object Mapping with Neural Radiance Fields
US20230298243A1 (en) 3d digital avatar generation from a single or few portrait images
Schick et al. Real-time GPU-based voxel carving with systematic occlusion handling
US11461956B2 (en) 3D representation reconstruction from images using volumic probability data
Afzal et al. Incremental reconstruction of moving object trajectory
CN116385643B (zh) 虚拟形象生成、模型的训练方法、装置及电子设备
GB2573172A (en) 3D skeleton reconstruction with 2D processing reducing 3D processing
CN116012666B (zh) 图像生成、模型的训练、信息重建方法、装置及电子设备
Yasin et al. Motion tracking, retrieval and 3d reconstruction from video

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21953048

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2021953048

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2021953048

Country of ref document: EP

Effective date: 20240202

WWE Wipo information: entry into national phase

Ref document number: 2024103165

Country of ref document: RU

NENP Non-entry into the national phase

Ref country code: DE