TWI777538B

TWI777538B - Image processing method, electronic device and computer-readable storage media

Info

Publication number: TWI777538B
Application number: TW110115664A
Authority: TW
Inventors: 王燦; 李杰鋒; 劉文韜; 錢晨
Original assignee: 大陸商北京市商湯科技開發有限公司
Priority date: 2020-05-13
Filing date: 2021-04-29
Publication date: 2022-09-11
Also published as: WO2021227694A1; CN111582207A; CN111582207B; TW202143100A

Abstract

The present disclosure provides an image processing method, electronic device, and computer-readable storage media. The method includes: identifying a target area where a target object in a first image is located; based on the target region corresponding to the target object, determining the first & two-dimensional position information of multiple key points representing the pose of the target object in the first image, the relative depth of each key point to the reference node of the target object, and the absolute depth of the reference node of the target object in the camera coordinate system h; based on the first & two-dimensional position information, relative depth, and absolute depth of the target object, determining the three-dimensional position information of multiple key points of the target object in the camera coordinate system.

Description

Image processing method, electronic device and computer-readable storage medium

本發明關於圖像處理技術領域，具體而言，關於一種圖像處理方法、電子設備及電腦可讀儲存介質。The present invention relates to the technical field of image processing, and in particular, to an image processing method, an electronic device, and a computer-readable storage medium.

三維人體姿態檢測計被廣泛應用於安防、遊戲、娛樂等領域。當前的三維人體姿態檢測方法通常為識別人體關鍵點在圖像中的第一二維位置資訊，然後根據預先確定的人體關鍵點之間的位置關係，將第一二維位置資訊轉換為三維位置資訊。3D human posture detectors are widely used in security, games, entertainment and other fields. The current 3D human posture detection method usually recognizes the first 2D position information of the human body key points in the image, and then converts the first 2D position information into a 3D position according to the predetermined positional relationship between the human body key points. News.

當前的三維人體姿態檢測方法所得到的人體姿態存在較大的誤差。The human pose obtained by the current 3D human pose detection method has a large error.

本發明實施例至少提供一種圖像處理方法、電子設備及電腦可讀儲存介質。Embodiments of the present invention provide at least an image processing method, an electronic device, and a computer-readable storage medium.

第一方面，本發明實施例提供了一種圖像處理方法，包括：識別第一圖像中的目標對象所在的目標區域；基於所述目標對象所在的目標區域，確定所述目標對象的多個關鍵點分別在所述第一圖像中的第一二維位置資訊、每個所述關鍵點相對所述目標對象的參考節點的相對深度、以及所述目標對象的參考節點在相機座標系中的絕對深度；基於所述目標對象的多個關鍵點分別對應的所述第一二維位置資訊和所述相對深度、以及所述參考節點對應的所述絕對深度，確定所述目標對象的多個關鍵點分別在所述相機座標系中的三維位置資訊。In a first aspect, an embodiment of the present invention provides an image processing method, including: identifying a target area where a target object is located in a first image; and determining a plurality of target objects based on the target area where the target object is located The first two-dimensional position information of the key points in the first image, the relative depth of each key point relative to the reference node of the target object, and the reference node of the target object in the camera coordinate system The absolute depth of the target object; based on the first two-dimensional position information and the relative depth corresponding to the multiple key points of the target object, and the absolute depth corresponding to the reference node, determine the depth of the target object. The three-dimensional position information of each key point in the camera coordinate system.

這樣，本發明實施例能夠更精確的得到目標對象的多個關鍵點分別在相機座標系中的三維位置資訊，目標對象的多個關鍵點分別在相機座標系中的三維位置資訊能夠表徵目標對象的三維姿態，三維位置資訊的精度越高，則得到的目標對象的三維姿態的精度也就越高。In this way, the embodiment of the present invention can more accurately obtain the three-dimensional position information of the multiple key points of the target object in the camera coordinate system, respectively, and the three-dimensional position information of the multiple key points of the target object in the camera coordinate system can represent the target object. The higher the accuracy of the 3D position information, the higher the accuracy of the obtained 3D pose of the target object.

一種可能的實施方式中，還包括：基於所述目標對象的多個關鍵點分別在所述相機座標系中的三維位置資訊，得到所述目標對象的姿態。In a possible implementation manner, the method further includes: obtaining the pose of the target object based on the three-dimensional position information of the multiple key points of the target object in the camera coordinate system respectively.

這樣，基於本發明實施例得到的目標對象的多個關鍵點分別在相機座標系中的三維位置資訊，由於三維位置資訊具有更高的精度，因而基於三維位置資訊確定的目標對象的姿態也就更為精確。In this way, based on the three-dimensional position information of the multiple key points of the target object in the camera coordinate system obtained in the embodiment of the present invention, since the three-dimensional position information has higher precision, the posture of the target object determined based on the three-dimensional position information is also more precise.

一種可能的實施方式中，所述識別所述第一圖像中的目標對象所在的目標區域，包括：對所述第一圖像進行特徵提取，得到所述第一圖像的特徵圖；基於所述特徵圖，從預先生成的多個候選邊界框中確定多個目標邊界框；基於多個所述目標邊界框，確定所述目標對象所在的目標區域。In a possible implementation manner, the identifying the target area where the target object in the first image is located includes: performing feature extraction on the first image to obtain a feature map of the first image; In the feature map, a plurality of target bounding boxes are determined from a plurality of candidate bounding boxes generated in advance; based on the plurality of the target bounding boxes, a target area where the target object is located is determined.

這樣，分為兩步來確定目標對象所在的目標區域，能夠精確的將各個目標對象在第一圖像中的位置，從第一圖像中檢測出來，以提升後續關鍵點檢測過程中的人體資訊完整性、以及檢測精度。In this way, it is divided into two steps to determine the target area where the target object is located, and the position of each target object in the first image can be accurately detected from the first image, so as to improve the human body in the subsequent key point detection process. Information integrity, and detection accuracy.

一種可能的實施方式中，所述基於多個所述目標邊界框，確定所述目標對象所在的目標區域，包括：基於多個所述目標邊界框以及所述特徵圖，確定每個所述目標邊界框的特徵子圖；對多個所述目標邊界框分別對應的特徵子圖進行邊界框回歸處理，得到所述目標對象所在的目標區域。In a possible implementation manner, the determining the target area where the target object is located based on the plurality of target bounding boxes includes: determining each target based on the plurality of target bounding boxes and the feature map. Feature sub-maps of the bounding box; performing bounding box regression processing on the feature sub-maps corresponding to the plurality of target bounding boxes to obtain the target area where the target object is located.

這樣，對多個目標邊界框分別對應的特徵子圖進行邊界框回歸處理，能夠精確的將各個目標對象在第一圖像中的位置從第一圖像中檢測出來。In this way, the bounding box regression processing is performed on the feature submaps corresponding to the multiple target bounding boxes, so that the positions of the respective target objects in the first image can be accurately detected from the first image.

一種可能的實施方式中，基於所述目標對象所在的目標區域，確定所述目標對象的參考節點在相機座標系中的絕對深度，包括：基於所述目標對象所在的目標區域以及所述第一圖像，確定所述目標對象的目標特徵圖；對所述目標對象對應的目標特徵圖進行深度識別處理，得到所述目標對象的參考節點的歸一化絕對深度；基於所述歸一化絕對深度以及相機的參數矩陣，得到所述目標對象的參考節點在所述相機座標系中的絕對深度。In a possible implementation manner, determining the absolute depth of the reference node of the target object in the camera coordinate system based on the target area where the target object is located includes: based on the target area where the target object is located and the first image, determine the target feature map of the target object; perform depth recognition processing on the target feature map corresponding to the target object to obtain the normalized absolute depth of the reference node of the target object; based on the normalized absolute depth The depth and the parameter matrix of the camera are used to obtain the absolute depth of the reference node of the target object in the camera coordinate system.

這樣，能夠盡可能避免相機的內參不同所造成的直接基於目標特徵圖預測參考節點的絕對深度，所造成對不同相機在相同視角、相同位置獲取的不同第一圖像獲取的絕對深度不同的情況。In this way, it can be avoided as much as possible that the absolute depth of the reference node is directly predicted based on the target feature map due to different internal parameters of the cameras, resulting in different absolute depths obtained for different first images obtained by different cameras at the same viewing angle and at the same position. .

一種可能的實施方式中，所述對所述目標對象對應的目標特徵圖進行深度識別處理，得到所述目標對象的參考節點的歸一化絕對深度，包括：基於所述第一圖像，確定初始深度圖像；其中，所述初始深度圖像中任一第一圖元點的圖元值為所述第一圖像中與所述第一圖元點的位置對應的第二圖元點在所述相機座標系中的初始深度值；基於所述目標對象對應的目標特徵圖，確定與所述目標對象對應的參考節點在所述第一圖像中的第二二維位置資訊；基於所述第二二維位置資訊、以及所述初始深度圖像，確定所述目標對象對應的參考節點的初始深度值；基於所述參考節點的初始深度值以及所述目標特徵圖，確定所述目標對象的參考節點的歸一化絕對深度。In a possible implementation manner, the performing depth recognition processing on the target feature map corresponding to the target object to obtain the normalized absolute depth of the reference node of the target object includes: determining based on the first image. An initial depth image; wherein the primitive value of any first primitive point in the initial depth image is a second primitive point corresponding to the position of the first primitive point in the first image the initial depth value in the camera coordinate system; based on the target feature map corresponding to the target object, determine the second two-dimensional position information of the reference node corresponding to the target object in the first image; based on The second two-dimensional position information and the initial depth image determine the initial depth value of the reference node corresponding to the target object; based on the initial depth value of the reference node and the target feature map, determine the The normalized absolute depth of the reference node of the target object.

這樣，能夠使得通過該過程得到參考節點的歸一化絕對深度更加精確。In this way, the normalized absolute depth of the reference node obtained through this process can be made more accurate.

一種可能的實施方式中，所述基於所述參考節點的初始深度值以及所述目標特徵圖，確定所述目標對象的參考節點的歸一化絕對深度，包括：對所述目標對象對應的目標特徵圖進行至少一級第一卷積處理，得到所述目標對象的特徵向量；將所述特徵向量和所述初始深度值進行拼接，得到拼接向量，並對所述拼接向量進行至少一級第二卷積處理，得到所述初始深度值的修正值；基於所述初始深度值的修正值、以及所述初始深度值，得到所述歸一化絕對深度。In a possible implementation manner, determining the normalized absolute depth of the reference node of the target object based on the initial depth value of the reference node and the target feature map includes: The feature map is subjected to at least one level of first convolution processing to obtain the feature vector of the target object; the feature vector and the initial depth value are spliced to obtain a spliced vector, and at least one level of the second convolution is performed on the spliced vector. product processing to obtain the modified value of the initial depth value; based on the modified value of the initial depth value and the initial depth value, the normalized absolute depth is obtained.

一種可能的實施方式中，所述參數矩陣包括：所述相機的焦距；所述基於所述歸一化絕對深度以及相機的參數矩陣，得到所述目標對象的參考節點在所述相機座標系中的絕對深度，包括：基於所述歸一化絕對深度、所述焦距、所述目標區域的面積、以及所述目標邊界框的面積，得到所述目標對象的參考節點在所述相機座標系中的絕對深度。In a possible implementation manner, the parameter matrix includes: the focal length of the camera; Obtaining the absolute depth of the reference node of the target object in the camera coordinate system based on the normalized absolute depth and the parameter matrix of the camera, including: Based on the normalized absolute depth, the focal length, the area of the target region, and the area of the target bounding box, the absolute depth of the reference node of the target object in the camera coordinate system is obtained.

一種可能的實施方式中，所述圖像處理方法應用於預先訓練好的神經網路中，所述神經網路包括目標檢測網路、關鍵點檢測網路以及深度預測網路三個分支網路；所述目標檢測網路用於獲得所述目標對象所在的目標區域；所述關鍵點檢測網路用於獲取所述目標對象的多個關鍵點分別在所述第一圖像中的第一二維位置資訊、和每個所述關鍵點相對所述目標對象的參考節點的相對深度；所述深度預測網路用於獲取所述參考節點在所述相機座標系中的絕對深度。In a possible implementation, the image processing method is applied to a pre-trained neural network, and the neural network includes three branch networks: a target detection network, a key point detection network, and a depth prediction network. ; the target detection network is used to obtain the target area where the target object is located; the key point detection network is used to obtain the first number of key points of the target object in the first image respectively. Two-dimensional position information, and the relative depth of each key point relative to the reference node of the target object; the depth prediction network is used to obtain the absolute depth of the reference node in the camera coordinate system.

這樣，通過目標檢測網路、關鍵點檢測網路以及深度預測網路三個分支網路，構成端到端的目標對象姿態檢測框架，基於該框架對第一圖像進行處理，得到第一圖像中每個目標對象的多個關鍵點分別在相機座標系中的三維位置資訊，處理速度更快，識別精度更高。In this way, through the target detection network, the key point detection network and the depth prediction network three branch networks, an end-to-end target object pose detection framework is formed, and the first image is processed based on the framework to obtain the first image. The three-dimensional position information of multiple key points of each target object in the camera coordinate system, the processing speed is faster and the recognition accuracy is higher.

第二方面，本發明實施例還提供一種圖像處理裝置，包括：識別模組，用於識別第一圖像中的目標對象所在的目標區域；第一檢測模組，用於基於所述目標對象所在的目標區域，確定所述目標對象的多個關鍵點分別在所述第一圖像中的第一二維位置資訊、每個所述關鍵點相對所述目標對象的參考節點的相對深度、以及所述目標對象的參考節點在相機座標系中的絕對深度；第二檢測模組，用於基於所述目標對象的多個關鍵點分別對應的所述第一二維位置資訊和所述相對深度、以及所述參考節點對應的所述絕對深度，確定所述目標對象的多個關鍵點分別在所述相機座標系中的三維位置資訊。In a second aspect, an embodiment of the present invention further provides an image processing device, including: a recognition module for recognizing a target area where a target object in a first image is located; a first detection module for based on the target the target area where the object is located, and determine the first two-dimensional position information of multiple key points of the target object in the first image, and the relative depth of each key point relative to the reference node of the target object , and the absolute depth of the reference node of the target object in the camera coordinate system; the second detection module is used for the first two-dimensional position information and the described first two-dimensional position information respectively corresponding to a plurality of key points of the target object based on The relative depth and the absolute depth corresponding to the reference node determine the three-dimensional position information of the multiple key points of the target object respectively in the camera coordinate system.

一種可能的實施方式中，所述第二檢測模組，還用於基於所述目標對象的多個關鍵點分別在所述相機座標系中的三維位置資訊，得到所述目標對象的姿態。In a possible implementation manner, the second detection module is further configured to obtain the posture of the target object based on the three-dimensional position information of a plurality of key points of the target object in the camera coordinate system respectively.

一種可能的實施方式中，所述識別模組，在識別所述第一圖像中的目標對象所在的目標區域時，用於：對所述第一圖像進行特徵提取，得到所述第一圖像的特徵圖；基於所述特徵圖，從預先生成的多個候選邊界框中確定多個目標邊界框；基於多個所述目標邊界框，確定所述目標對象所在的目標區域。In a possible implementation manner, the recognition module, when recognizing the target area where the target object in the first image is located, is used to: perform feature extraction on the first image to obtain the first image. A feature map of the image; based on the feature map, determine multiple target bounding boxes from multiple candidate bounding boxes generated in advance; based on the multiple target bounding boxes, determine the target area where the target object is located.

一種可能的實施方式中，所述識別模組，在基於多個所述目標邊界框，確定所述目標對象所在的目標區域時，用於：基於多個所述目標邊界框以及所述特徵圖，確定每個所述目標邊界框的特徵子圖；對多個所述目標邊界框分別對應的特徵子圖進行邊界框回歸處理，得到所述目標對象所在的目標區域。In a possible implementation manner, the identification module, when determining the target area where the target object is located based on a plurality of the target bounding boxes, is used for: based on the plurality of the target bounding boxes and the feature map. , determine the feature sub-map of each target bounding box; perform bounding box regression processing on the feature sub-maps corresponding to the plurality of target bounding boxes respectively to obtain the target area where the target object is located.

一種可能的實施方式中，其中，所述第一檢測模組，在基於目標對象所在的目標區域，確定所述目標對象的參考節點在相機座標系中的絕對深度時，用於：基於所述目標對象所在的目標區域以及所述第一圖像，確定所述目標對象的目標特徵圖；對所述目標對象對應的目標特徵圖進行深度識別處理，得到所述目標對象的參考節點的歸一化絕對深度；基於所述歸一化絕對深度以及相機的參數矩陣，得到所述目標對象的參考節點在所述相機座標系中的絕對深度。In a possible implementation manner, wherein the first detection module, when determining the absolute depth of the reference node of the target object in the camera coordinate system based on the target area where the target object is located, is used for: based on the The target area where the target object is located and the first image, determine the target feature map of the target object; perform depth recognition processing on the target feature map corresponding to the target object, and obtain the normalization of the reference nodes of the target object The normalized absolute depth; based on the normalized absolute depth and the parameter matrix of the camera, the absolute depth of the reference node of the target object in the camera coordinate system is obtained.

一種可能的實施方式中，所述第一檢測模組，在對所述目標對象對應的目標特徵圖進行深度識別處理，得到所述目標對象的參考節點的歸一化絕對深度時，用於：基於所述第一圖像，確定初始深度圖像；其中，所述初始深度圖像中任一第一圖元點的圖元值為所述第一圖像中與所述第一圖元點的位置對應的第二圖元點在所述相機座標系中的初始深度值；基於所述目標對象對應的目標特徵圖，確定與所述目標對象對應的參考節點在所述第一圖像中的第二二維位置資訊；基於所述第二二維位置資訊、以及所述初始深度圖像，確定所述目標對象對應的參考節點的初始深度值；基於所述參考節點的初始深度值以及所述目標特徵圖，確定所述目標對象的參考節點的歸一化絕對深度。In a possible implementation manner, the first detection module, when performing depth recognition processing on the target feature map corresponding to the target object to obtain the normalized absolute depth of the reference node of the target object, is used for: Based on the first image, an initial depth image is determined; wherein, the primitive value of any first primitive point in the initial depth image is the same as the first primitive point in the first image The initial depth value of the second primitive point corresponding to the position of the target object in the camera coordinate system; based on the target feature map corresponding to the target object, it is determined that the reference node corresponding to the target object is in the first image. based on the second two-dimensional position information and the initial depth image, determine the initial depth value of the reference node corresponding to the target object; based on the initial depth value of the reference node and For the target feature map, the normalized absolute depth of the reference node of the target object is determined.

一種可能的實施方式中，所述第一檢測模組，在基於所述參考節點的初始深度值以及所述目標特徵圖，確定所述目標對象的參考節點的歸一化絕對深度時，用於：對所述目標對象對應的目標特徵圖進行至少一級第一卷積處理，得到所述目標對象的特徵向量；將所述特徵向量和所述初始深度值進行拼接，得到拼接向量，並對所述拼接向量進行至少一級第二卷積處理，得到所述初始深度值的修正值；基於所述初始深度值的修正值、以及所述初始深度值，得到所述歸一化絕對深度。In a possible implementation manner, the first detection module is used for determining the normalized absolute depth of the reference node of the target object based on the initial depth value of the reference node and the target feature map. : perform at least one-level first convolution processing on the target feature map corresponding to the target object to obtain a feature vector of the target object; splicing the feature vector and the initial depth value to obtain a splicing vector, The splicing vector is subjected to at least one stage of second convolution processing to obtain a modified value of the initial depth value; based on the modified value of the initial depth value and the initial depth value, the normalized absolute depth is obtained.

一種可能的實施方式中，所述參數矩陣包括：所述相機的焦距；所述第一檢測模組，在基於所述歸一化絕對深度以及相機的參數矩陣，得到所述目標對象的參考節點在所述相機座標系中的絕對深度時，用於：基於所述歸一化絕對深度、所述焦距、所述目標區域的面積、以及所述目標邊界框的面積，得到所述目標對象的參考節點在所述相機座標系中的絕對深度。In a possible implementation manner, the parameter matrix includes: the focal length of the camera; the first detection module obtains the reference node of the target object based on the normalized absolute depth and the parameter matrix of the camera At absolute depth in the camera coordinate system, used to: Based on the normalized absolute depth, the focal length, the area of the target region, and the area of the target bounding box, the absolute depth of the reference node of the target object in the camera coordinate system is obtained.

一種可能的實施方式中，所述圖像處理裝置利用預先訓練好的神經網路實現圖像處理，所述神經網路包括目標檢測網路、關鍵點檢測網路以及深度預測網路三個分支網路；所述目標檢測網路用於獲得所述目標對象所在的目標區域；所述關鍵點檢測網路用於獲取所述目標對象的多個關鍵點分別在所述第一圖像中的第一二維位置資訊、和每個所述關鍵點相對所述目標對象的參考節點的相對深度；所述深度預測網路用於獲取所述參考節點在所述相機座標系中的絕對深度。In a possible implementation, the image processing device uses a pre-trained neural network to realize image processing, and the neural network includes three branches: a target detection network, a key point detection network, and a depth prediction network. network; the target detection network is used to obtain the target area where the target object is located; the key point detection network is used to obtain the respective key points of the target object in the first image. The first two-dimensional position information, and the relative depth of each key point relative to the reference node of the target object; the depth prediction network is used to obtain the absolute depth of the reference node in the camera coordinate system.

第三方面，本發明實施例還提供一種電腦設備，包括：相互連接的處理器和記憶體，所述記憶體儲存有所述處理器可執行的機器可讀指令，當電腦設備運行時，所述機器可讀指令被所述處理器執行以實現上述第一方面，或第一方面中任一種可能的實施方式中的圖像處理方法的步驟。In a third aspect, an embodiment of the present invention further provides a computer device, including: a processor and a memory that are connected to each other, the memory stores machine-readable instructions executable by the processor, and when the computer device runs, the The machine-readable instructions are executed by the processor to implement the first aspect or the steps of the image processing method in any possible implementation manner of the first aspect.

第四方面，本發明實施例還提供一種電腦可讀儲存介質，該電腦可讀儲存介質上儲存有電腦程式，該電腦程式被處理器運行時執行上述第一方面，或第一方面中任一種可能的實施方式中的圖像處理方法的步驟。In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and the computer program executes the first aspect or any one of the first aspects when the computer program is run by the processor. Steps of an image processing method in a possible implementation.

為使本發明的上述目的、特徵和優點能更明顯易懂，下文特舉較佳實施例，並配合所附附圖，作詳細說明如下。In order to make the above-mentioned objects, features and advantages of the present invention more obvious and easy to understand, preferred embodiments are given below, and are described in detail as follows in conjunction with the accompanying drawings.

為使本發明實施例的目的、技術方案和優點更加清楚，下面將結合本發明實施例中附圖，對本發明實施例中的技術方案進行清楚、完整地描述，顯然，所描述的實施例僅僅是本發明一部分實施例，而不是全部的實施例。通常在此處附圖中描述和示出的本發明實施例的元件可以以各種不同的配置來佈置和設計。因此，以下對在附圖中提供的本發明的實施例的詳細描述並非旨在限制要求保護的本發明的範圍，而是僅僅表示本發明的選定實施例。基於本發明的實施例，本領域技術人員在沒有做出創造性勞動的前提下所獲得的所有其他實施例，都屬於本發明保護的範圍。In order to make the purposes, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention. Obviously, the described embodiments are only These are some embodiments of the present invention, but not all embodiments. The elements of the embodiments of the invention generally described and illustrated in the drawings herein may be arranged and designed in a variety of different configurations. Thus, the following detailed description of the embodiments of the invention provided in the accompanying drawings is not intended to limit the scope of the invention as claimed, but is merely representative of selected embodiments of the invention. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative work fall within the protection scope of the present invention.

三維人體姿態檢測方法通常為通過神經網路識別人體關鍵點在待識別圖像中的第一二維位置資訊，然後根據人體關鍵點之間的相互位置關係（如不同關鍵點之間的連接關係、相鄰關鍵點之間的距離範圍等）將各個人體關鍵點的第一二維位置資訊轉換為三維位置資訊；但人的體型複雜多變，不同的人體所對應的人體關鍵點之間的位置關係也各不相同，導致通過這種方法得到的三維人體姿態存在較大的誤差。The three-dimensional human pose detection method is usually to identify the first two-dimensional position information of human key points in the image to be recognized through a neural network, and then based on the mutual positional relationship between human key points (such as the connection relationship between different key points) , distance range between adjacent key points, etc.) to convert the first two-dimensional position information of each human body key point into three-dimensional position information; The positional relationships are also different, resulting in large errors in the three-dimensional human pose obtained by this method.

另外，當前的三維人體姿態檢測方法的精度是建立在人體關鍵點精確估計的基礎上，但由於衣服、肢體等遮擋，在很多情況下並不能精確的從圖像中將人體關鍵點識別出來，進而造成通過上述方法得到的三維人體姿態誤差會被進一步拉大。In addition, the accuracy of the current 3D human pose detection method is based on the accurate estimation of the key points of the human body. However, due to the occlusion of clothing and limbs, in many cases, the key points of the human body cannot be accurately identified from the image. As a result, the three-dimensional human body posture error obtained by the above method will be further enlarged.

針對以上方案所存在的缺陷，均是發明人在經過實踐並仔細研究後得出的結果，因此，上述問題的發現過程以及下文中本發明針對上述問題所提出的解決方案，都應該是發明人在本發明過程中對本發明做出的貢獻。The defects existing in the above solutions are all the results obtained by the inventor after practice and careful research. Therefore, the discovery process of the above problems and the solutions to the above problems proposed by the present invention hereinafter should be the inventors. Contributions made to the invention in the course of the invention.

應注意到：相似的標號和字母在下面的附圖中表示類似項，因此，一旦某一項在一個附圖中被定義，則在隨後的附圖中不需要對其進行進一步定義和解釋。It should be noted that like numerals and letters refer to like items in the following figures, so once an item is defined in one figure, it does not require further definition and explanation in subsequent figures.

基於上述研究，本發明提供了一種圖像處理方法及裝置，通過識別第一圖像中目標對象所在的目標區域，並基於目標區域，確定表徵目標對象姿態的多個關鍵點分別在第一圖像中的第一二維位置資訊、每個關鍵點相對於目標對象的參考節點的相對深度、以及目標對象的參考節點在相機座標系中的絕對深度，從而基於目標對象的第一二維位置資訊、相對深度、以及絕對深度，更精確的得到目標對象的多個關鍵點分別在相機座標系中的三維位置資訊。Based on the above research, the present invention provides an image processing method and device. By identifying the target area where the target object is located in the first image, and based on the target area, multiple key points representing the posture of the target object are determined in the first image respectively. The first two-dimensional position information in the image, the relative depth of each key point relative to the reference node of the target object, and the absolute depth of the reference node of the target object in the camera coordinate system, so as to be based on the first two-dimensional position of the target object information, relative depth, and absolute depth, to more accurately obtain the three-dimensional position information of multiple key points of the target object in the camera coordinate system.

為便於對本實施例進行理解，首先對本發明實施例所公開的一種圖像處理方法進行詳細介紹，本發明實施例所提供的圖像處理方法的執行主體一般為具有一定計算能力的電腦設備，該電腦設備例如包括：終端設備或伺服器或其它處理設備，終端設備可以為使用者設備（User Equipment，UE）、移動設備、使用者終端、終端、蜂窩電話、無繩電話、個人數位助理（Personal Digital Assistant，PDA）、手持設備、計算設備、車載設備、可穿戴設備等。在一些可能的實現方式中，該圖像處理方法可以通過處理器調用記憶體中儲存的電腦可讀指令的方式來實現。In order to facilitate the understanding of this embodiment, an image processing method disclosed in the embodiment of the present invention is first introduced in detail. The execution subject of the image processing method provided by the embodiment of the present invention is generally a computer device with Computer equipment includes, for example: terminal equipment or server or other processing equipment, and the terminal equipment can be user equipment (UE), mobile equipment, user terminal, terminal, cellular phone, cordless phone, personal digital assistant (Personal Digital Assistant) Assistant, PDA), handheld devices, computing devices, in-vehicle devices, wearable devices, etc. In some possible implementations, the image processing method may be implemented by the processor calling computer-readable instructions stored in the memory.

下面對本發明實施例提供的圖像處理方法加以說明。The following describes the image processing method provided by the embodiment of the present invention.

參見圖1所示，為本發明實施例提供的圖像處理方法的流程圖，所述方法包括步驟S101~S103，其中： S101：識別第一圖像中的目標對象所在的目標區域； S102：基於所述目標對象所在的目標區域，確定所述目標對象的多個關鍵點分別在所述第一圖像中的第一二維位置資訊、每個所述關鍵點相對所述目標對象的參考節點的相對深度、以及所述目標對象的參考節點在相機座標系中的絕對深度； S103：基於所述目標對象的多個關鍵點分別對應的所述第一二維位置資訊和所述相對深度、以及所述參考節點對應的所述絕對深度，確定所述目標對象的多個關鍵點分別在所述相機座標系中的三維位置資訊。Referring to FIG. 1, which is a flowchart of an image processing method provided by an embodiment of the present invention, the method includes steps S101-S103, wherein: S101: Identify the target area where the target object in the first image is located; S102: Based on the target area where the target object is located, determine first two-dimensional position information of multiple key points of the target object in the first image, and each of the key points is relative to the target object. The relative depth of the reference node, and the absolute depth of the reference node of the target object in the camera coordinate system; S103: Determine a plurality of key points of the target object based on the first two-dimensional position information and the relative depth corresponding to the plurality of key points of the target object respectively, and the absolute depth corresponding to the reference node The three-dimensional position information of the points in the camera coordinate system respectively.

下面分別對上述S101~S103加以詳細說明。The above S101 to S103 are respectively described in detail below.

I：在上述S101中，第一圖像中包括有至少一個目標對象。目標對象例如為人、動物、機器人、車輛等待確定姿態的對象。I: In the above S101, the first image includes at least one target object. The target object is, for example, a person, an animal, a robot, or a vehicle whose posture is to be determined.

一種可能的實施方式中，當第一圖像中包括的目標對象多於一個的時候，不同目標對象的類別可以相同，也可以不同；例如，多個目標對象均為人；或者多個目標對象均為車輛。又例如，第一圖像中的目標對象包括：人和動物；或者第一圖像中的目標對象包括人和車輛，具體根據實際的應用場景需要來確定目標對象類別。In a possible implementation, when more than one target object is included in the first image, the categories of different target objects may be the same or different; for example, multiple target objects are all people; or multiple target objects All are vehicles. For another example, the target objects in the first image include: people and animals; or the target objects in the first image include people and vehicles, and the target object category is specifically determined according to actual application scenario requirements.

目標對象所在的目標區域，是指第一圖像中包括有目標對象的區域。The target area where the target object is located refers to an area in the first image that includes the target object.

示例性的，參見圖2所示，本發明實施例提供一種識別第一圖像中目標對象所在的目標區域的具體方法，包括： S201：對所述第一圖像進行特徵提取，得到所述第一圖像的特徵圖。此處，例如可以利用神經網路對第一圖像進行特徵提取，以得到第一圖像的特徵圖。 S202：基於所述特徵圖，從預先生成的多個候選邊界框中，確定多個目標邊界框，並基於所述目標邊界框，確定所述目標對象對應的目標區域。Exemplarily, as shown in FIG. 2 , an embodiment of the present invention provides a specific method for identifying a target area where a target object is located in a first image, including: S201: Perform feature extraction on the first image to obtain a feature map of the first image. Here, for example, a neural network can be used to perform feature extraction on the first image to obtain a feature map of the first image. S202: Based on the feature map, determine multiple target bounding boxes from multiple pre-generated candidate bounding boxes, and determine a target area corresponding to the target object based on the target bounding boxes.

在具體實施中，例如可以利用邊界框預測演算法，得到多個目標邊界框。邊界框預測演算法例如包括RoIAlign、ROI-Pooling等，以RoIAlign為例，RoIAlign可以對預先生成的多個候選邊界框進行遍歷，確定各個候選邊界框對應的子圖像屬於第一圖像中任一目標對象的感興趣區域（region of interest，ROI）值，該ROI值越高，與之對應的候選邊界框對應的子圖像屬於某個目標對象的概率也就越大；在確定了每個候選邊界框對應的ROI值後，根據各個候選邊界框分別對應的ROI值從大到小的順序，從候選邊界框中確定多個目標邊界框。In a specific implementation, for example, a bounding box prediction algorithm can be used to obtain multiple target bounding boxes. For example, bounding box prediction algorithms include RoIAlign, ROI-Pooling, etc. Taking RoIAlign as an example, RoIAlign can traverse multiple pre-generated candidate bounding boxes to determine that the sub-images corresponding to each candidate bounding box belong to any of the first image. The region of interest (ROI) value of a target object, the higher the ROI value, the greater the probability that the sub-image corresponding to the corresponding candidate bounding box belongs to a certain target object; After the ROI values corresponding to the candidate bounding boxes, according to the descending order of the ROI values corresponding to the candidate bounding boxes, a plurality of target bounding boxes are determined from the candidate bounding boxes.

目標邊界框例如為矩形；目標邊界框的資訊例如包括：目標邊界框中任一頂點在第一圖像中的座標，以及目標邊界框的高度值和寬度值。或者，目標邊界框的資訊例如包括：目標邊界框中任一頂點在第一圖像的特徵圖中的座標，以及目標邊界框的高度值和寬度值。The target bounding box is, for example, a rectangle; the information of the target bounding box includes, for example, the coordinates of any vertex in the target bounding box in the first image, and the height and width values of the target bounding box. Or, the information of the target bounding box includes, for example, the coordinates of any vertex in the target bounding box in the feature map of the first image, and the height and width values of the target bounding box.

在得到多個目標邊界框後，基於多個目標邊界框，確定第一圖像中所有的目標對象分別對應的目標區域。After the multiple target bounding boxes are obtained, the respective target regions corresponding to all the target objects in the first image are determined based on the multiple target bounding boxes.

參見圖3所示，本發明實施例提供一種基於目標邊界框，確定目標對象對應的目標區域的具體示例，包括以下步驟。Referring to FIG. 3 , an embodiment of the present invention provides a specific example of determining a target area corresponding to a target object based on a target bounding box, including the following steps.

S301：基於多個所述目標邊界框以及所述特徵圖，確定每個所述目標邊界框的特徵子圖。S301: Based on the plurality of target bounding boxes and the feature maps, determine a feature submap of each of the target bounding boxes.

在具體實施中，在目標邊界框的資訊包括目標邊界框上的任一頂點在第一圖像中的座標，以及目標邊界框的高度值和寬度值的情況下，特徵圖中的特徵點和第一圖像中的圖元點具有一定的位置映射關係；根據該目標邊界框的相關資訊、以及特徵圖和第一圖像之間的映射關係，從第一圖像的特徵圖中確定各個目標邊界框分別對應的特徵子圖。In a specific implementation, when the information of the target bounding box includes the coordinates of any vertex on the target bounding box in the first image, and the height and width values of the target bounding box, the feature points in the feature map and The primitive points in the first image have a certain positional mapping relationship; according to the relevant information of the target bounding box and the mapping relationship between the feature map and the first image, each feature map is determined from the feature map of the first image. The feature submaps corresponding to the target bounding boxes respectively.

在目標邊界框的資訊包括目標邊界框中任一頂點在第一圖像的特徵圖中的座標，以及目標邊界框的高度值和寬度值的情況下，可以直接基於該目標邊界框，從第一圖像的特徵圖中確定與各個目標邊界框分別對應的特徵子圖。When the information of the target bounding box includes the coordinates of any vertex in the target bounding box in the feature map of the first image, as well as the height and width values of the target bounding box, it can be directly based on the target bounding box, from the first A feature submap corresponding to each target bounding box is determined in the feature map of an image.

S302：對多個所述目標邊界框分別對應的特徵子圖進行邊界框回歸處理，得到所述目標對象所在的目標區域。S302: Perform bounding box regression processing on the feature submaps corresponding to the plurality of target bounding boxes to obtain a target area where the target object is located.

此處，例如可以利用邊界框回歸（Bounding-Box Regression）演算法，對各個目標邊界框分別對應的特徵子圖進行邊界框回歸處理，以得到包括完整目標對象的多個邊界框。Here, for example, a bounding box regression (Bounding-Box Regression) algorithm may be used to perform bounding box regression processing on the feature submaps corresponding to each target bounding box, so as to obtain multiple bounding boxes including the complete target object.

在利用邊界框回歸演算法，能夠準確的將目標對象從對應的目標區域確定出來，以將目標對象和圖像背景區別開，進而減少圖像背景對後續圖像處理過程的影響。Using the bounding box regression algorithm, the target object can be accurately determined from the corresponding target area, so as to distinguish the target object from the image background, thereby reducing the influence of the image background on the subsequent image processing process.

多個邊界框中的每個邊界框與一個目標對象對應，基於與該目標對象對應的邊界框確定的區域，即為對應目標對象所在的目標區域。Each bounding box in the plurality of bounding boxes corresponds to a target object, and the area determined based on the bounding box corresponding to the target object is the target area where the corresponding target object is located.

此時，所得到的目標區域的數量，與第一圖像中目標對象的數量一致，且每個目標對象對應一個目標區域；若不同的目標對象之間存在相互遮擋的位置關係，則存在相互遮擋關係的目標對象分別對應的目標區域具有一定的重疊度。At this time, the obtained number of target areas is consistent with the number of target objects in the first image, and each target object corresponds to a target area; The target regions corresponding to the target objects in the occlusion relationship have a certain degree of overlap.

本發明另一種實施例中，也可以採用其他目標檢測演算法是被第一圖像中的目標對象所在的目標區域。例如，採用語義分割演算法，確定第一圖像中每個圖元點的語義分割結果，然後根據語義分割結果，確定屬於不同目標對象的圖元點在第一圖像中的位置；然後根據屬於同一目標對象的圖元點求最小包圍框，將最小包圍框對應的區域確定為目標對象所在的目標區域。In another embodiment of the present invention, other target detection algorithms may also be used to identify the target area where the target object in the first image is located. For example, a semantic segmentation algorithm is used to determine the semantic segmentation result of each primitive point in the first image, and then according to the semantic segmentation result, the positions of the primitive points belonging to different target objects in the first image are determined; The minimum bounding box is obtained from the primitive points belonging to the same target object, and the area corresponding to the minimum bounding box is determined as the target area where the target object is located.

II：在上述S102中，圖像座標系，是指以第一圖像的長和寬兩個方向所建立的二維座標系；相機座標系，是指以相機的光軸所在方向、以及平行於光軸且相機的光心所在平面中的兩個方向建立的三維座標系。II: In the above S102, the image coordinate system refers to a two-dimensional coordinate system established by the length and width of the first image; the camera coordinate system refers to the direction of the optical axis of the camera and the parallel A three-dimensional coordinate system established in two directions on the optical axis and in the plane where the optical center of the camera is located.

目標對象的關鍵點，例如是位於目標對象上，且之間具有相互關係的，並且按照相互關係連接後能夠表徵目標對象姿態的圖元點；例如，在目標對象為人體時，關鍵點例如包括人體各個關節的關鍵點。該關鍵點在圖像座標系中，表示為二維座標值；在相機座標系中，表示為三維座標值。The key points of the target object, for example, are located on the target object, have a mutual relationship, and can represent the pose of the target object after being connected according to the relationship; for example, when the target object is a human body, the key points include, for example The key points of each joint of the human body. In the image coordinate system, the key point is expressed as a two-dimensional coordinate value; in the camera coordinate system, it is expressed as a three-dimensional coordinate value.

在具體實施中，例如可以利用關鍵點檢測網路，基於目標對象的目標特徵圖進行關鍵點檢測處理，得到目標對象的多個關鍵點分別在第一圖像中的二維位置資訊，以及每個關鍵點相對於目標對象的參考節點的相對深度。此處，目標特徵圖的獲取方式可以參見下述對S401的說明，在此不再贅述。In a specific implementation, for example, a key point detection network can be used to perform key point detection processing based on the target feature map of the target object to obtain two-dimensional position information of multiple key points of the target object in the first image, and each The relative depth of a keypoint relative to the reference node of the target object. Here, for the acquisition method of the target feature map, reference may be made to the following description of S401, which will not be repeated here.

參考節點，例如為在目標對象上預先確定某個部位上的任一圖元點。示例性的，可以根據實際的需要來預先確定該參考節點；例如在目標對象為人體時，可以將人體骨盆上的圖元點確定為參考節點，或者將人體上任一圖元點確定為參考節點，或者將人體的胸腹中央上的圖元點確定為參考節點；具體的可以根據需要進行設定。The reference node is, for example, any primitive point on a predetermined part of the target object. Exemplarily, the reference node can be pre-determined according to actual needs; for example, when the target object is a human body, a primitive point on the human pelvis can be determined as a reference node, or any primitive point on the human body can be determined as a reference node. , or determine the primitive point on the center of the chest and abdomen of the human body as the reference node; the specific can be set as required.

每個關鍵點相對於目標對象的參考節點的絕對深度，例如為關鍵點在相機座標系的深度方向的座標值、與參考節點在相機座標系的深度方向的座標值的差值。關鍵點的絕對深度，例如為關鍵點在相機座標系的深度方向的座標值。The absolute depth of each key point relative to the reference node of the target object, for example, the difference between the coordinate value of the key point in the depth direction of the camera coordinate system and the coordinate value of the reference node in the depth direction of the camera coordinate system. The absolute depth of the key point, for example, the coordinate value of the key point in the depth direction of the camera coordinate system.

參見圖4所示，本發明實施例提供一種基於目標對象對應的目標區域，確定目標對象的參考節點在相機座標系中的絕對深度的具體方法，包括以下步驟。Referring to FIG. 4 , an embodiment of the present invention provides a specific method for determining the absolute depth of a reference node of a target object in a camera coordinate system based on a target area corresponding to a target object, including the following steps.

S401：基於所述目標對象所在的目標區域以及所述第一圖像，確定所述目標對象的目標特徵圖。S401: Determine a target feature map of the target object based on the target area where the target object is located and the first image.

此處，例如可以基於對第一圖像進行特徵提取所得到第一圖像的特徵圖、以及所述目標區域，從所述特徵圖中確定目標對象的目標特徵圖。Here, for example, based on a feature map of the first image obtained by performing feature extraction on the first image, and the target area, a target feature map of the target object may be determined from the feature map.

這裡，為第一圖像提取的特徵圖中的特徵點和第一圖像中的圖元點具有一定的位置映射關係；在得到各個目標對象所在的目標區域後，能夠根據該位置映射關係，確定各個目標對象在第一圖像的特徵圖中的所在位置，然後將與各個目標對象的目標特徵圖從第一圖像的特徵圖中截取出來。Here, the feature points in the feature map extracted for the first image and the primitive points in the first image have a certain position mapping relationship; after obtaining the target area where each target object is located, according to the position mapping relationship, The position of each target object in the feature map of the first image is determined, and then the target feature map related to each target object is cut out from the feature map of the first image.

S402：對所述目標對象對應的目標特徵圖進行深度識別處理，得到所述目標對象的參考節點的歸一化絕對深度。S402: Perform depth recognition processing on the target feature map corresponding to the target object, to obtain the normalized absolute depth of the reference node of the target object.

此處，由於不同相機的內參不同，目標對象在不同相機的成像中會有所區別；若要直接確定目標對象的參考節點的絕對深度，會存在由於相機內參造成的誤差，因此本發明實施例中，為了減少相機內參不同導致的圖像差異對絕對深度造成的而影響，可以首先基於目標特徵圖，得到目標對象的參考節點的歸一化絕對深度，然後再利用歸一化絕對深度和相機內參，得到參考節點的絕對深度。該歸一化絕對深度，是利用相機的參數矩陣對參考節點進行歸一化後得到的絕對深度，在得到歸一化絕對深度後，可以利用相機的參數矩陣，恢復參考節點的絕對深度。Here, because the internal parameters of different cameras are different, the target object will be different in the imaging of different cameras; if the absolute depth of the reference node of the target object is to be directly determined, there will be errors caused by the camera internal parameters, so the embodiment of the present invention will In order to reduce the influence of image differences caused by different camera internal parameters on the absolute depth, we can first obtain the normalized absolute depth of the reference node of the target object based on the target feature map, and then use the normalized absolute depth and the camera to obtain the normalized absolute depth. Internal parameter to get the absolute depth of the reference node. The normalized absolute depth is the absolute depth obtained by normalizing the reference node by using the parameter matrix of the camera. After obtaining the normalized absolute depth, the parameter matrix of the camera can be used to restore the absolute depth of the reference node.

在一種可能的實施方式中，例如可以採用預先訓練的深度預測網路，對目標特徵圖執行深度檢測處理，得到目標對象的參考節點的歸一化絕對深度。In a possible implementation, for example, a pre-trained depth prediction network may be used to perform depth detection processing on the target feature map to obtain the normalized absolute depth of the reference node of the target object.

本發明另一種實施例中，參見圖5所示，還提供另一種得到參考節點的歸一化絕對深度的具體方法，包括以下步驟。In another embodiment of the present invention, as shown in FIG. 5 , another specific method for obtaining the normalized absolute depth of a reference node is provided, including the following steps.

S501：基於所述第一圖像，確定初始深度圖像；其中，所述初始深度圖像中任一第一圖元點的圖元值為所述第一圖像中與所述第一圖元點的位置對應的第二圖元點在所述相機座標系中的初始深度值。S501: Determine an initial depth image based on the first image; wherein, the primitive value of any first primitive point in the initial depth image is the same as the first image in the first image The position of the primitive point corresponds to the initial depth value of the second primitive point in the camera coordinate system.

在具體實施中，初始深度圖像中的第一圖元點與第一圖像中的第二圖元點具有一一對應關係，也即，第一圖元點在初始深度圖像中的座標值，與位置對應的第二圖元點在第一圖像中的座標值相同。In a specific implementation, the first primitive point in the initial depth image has a one-to-one correspondence with the second primitive point in the first image, that is, the coordinates of the first primitive point in the initial depth image The value is the same as the coordinate value of the second primitive point corresponding to the position in the first image.

示例性的，可以利用可以採用深度預測網路，確定第一圖像中每個圖元點（第二圖元點）初始深度值；各個第一圖元點的初始深度值，構成了第一圖像的初始深度圖像；在初始深度圖像中的任一圖元點（第一圖元點）的圖元值，即為在第一圖像中對應位置的圖元點（第二圖元點）的初始深度值。Exemplarily, a depth prediction network can be used to determine the initial depth value of each primitive point (second primitive point) in the first image; the initial depth value of each first primitive point constitutes the first The initial depth image of the image; the primitive value of any primitive point (the first primitive point) in the initial depth image is the primitive point at the corresponding position in the first image (the second image element point) initial depth value.

S502：基於所述目標對象對應的目標特徵圖，確定與所述目標對象對應的參考節點在所述第一圖像中的第二二維位置資訊，並基於所述第二二維位置資訊、以及所述初始深度圖像，確定所述目標對象對應的參考節點的初始深度值。S502: Based on the target feature map corresponding to the target object, determine the second two-dimensional position information of the reference node corresponding to the target object in the first image, and based on the second two-dimensional position information, and the initial depth image, to determine the initial depth value of the reference node corresponding to the target object.

此處，目標對象對應的目標特徵圖，例如可以基於各個目標對象對應的目標區域，從與第一圖像的特徵圖中，為各個目標對象確定的目標特徵圖。Here, the target feature map corresponding to the target object may be, for example, a target feature map determined for each target object from the feature map of the first image based on the target region corresponding to each target object.

在得到各個目標對象對應的目標特徵圖後，例如可以利用預先訓練的參考節點檢測網路，基於目標特徵圖中確定目標對象的參考節點在第一圖像中的第二二維位置資訊。然後利用該第二二維位置資訊，從初始深度圖像確定與參考節點對應的圖元點，並將該從初始深度圖像中確定的圖元點的圖元值，確定為參考節點的初始深度值。After obtaining the target feature maps corresponding to each target object, for example, a pre-trained reference node detection network can be used to determine the second two-dimensional position information of the reference nodes of the target object in the first image based on the target feature map. Then, using the second two-dimensional position information, the primitive point corresponding to the reference node is determined from the initial depth image, and the primitive value of the primitive point determined from the initial depth image is determined as the initial value of the reference node. depth value.

S503：基於所述參考節點的初始深度值以及所述目標特徵圖，確定所述目標對象的參考節點的歸一化絕對深度。S503: Determine the normalized absolute depth of the reference node of the target object based on the initial depth value of the reference node and the target feature map.

示例性的，例如可以對所述目標對象對應的目標特徵圖進行至少一級第一卷積處理，得到所述目標對象的特徵向量；將所述特徵向量和所述初始深度值進行拼接，得到拼接向量，並對所述拼接向量進行至少一級第二卷積處理，得到所述初始深度值的修正值；基於所述初始深度值的修正值、以及所述初始深度值，得到所述歸一化絕對深度。Exemplarily, for example, at least one level of first convolution processing may be performed on the target feature map corresponding to the target object to obtain a feature vector of the target object; the feature vector and the initial depth value are spliced to obtain splicing. vector, and perform at least one level of second convolution processing on the splicing vector to obtain the modified value of the initial depth value; based on the modified value of the initial depth value and the initial depth value, obtain the normalized value absolute depth.

此處，例如可以採用一個用於對初始深度值進行調整的神經網路，該神經網路包括多個卷積層；其中，多個卷積層中的部分卷積層用於對目標特徵圖進行至少一級第一卷積處理；其他卷積層用於對拼接向量進行至少一級第二卷積處理，進而得到該修正值；然後根據該修正值對初始深度值進行調整，得到目標對象的參考節點的歸一化深度。Here, for example, a neural network for adjusting the initial depth value can be used, and the neural network includes a plurality of convolutional layers; wherein, some of the convolutional layers in the plurality of convolutional layers are used to perform at least one-level processing on the target feature map. The first convolution processing; other convolution layers are used to perform at least one second convolution processing on the splicing vector, and then obtain the correction value; then adjust the initial depth value according to the correction value to obtain the normalization of the reference node of the target object depth.

承接上述S402，本發明實施例所提供的確定目標對象的參考節點在相機座標系中的絕對深度的具體方法還包括： S403：基於所述歸一化絕對深度以及相機的參數矩陣，得到所述目標對象的參考節點在所述相機座標系中的絕對深度。Following the above S402, the specific method for determining the absolute depth of the reference node of the target object in the camera coordinate system provided by the embodiment of the present invention further includes: S403: Based on the normalized absolute depth and the parameter matrix of the camera, obtain the absolute depth of the reference node of the target object in the camera coordinate system.

在具體實施中，由於在對不同第一圖像進行圖像處理過程中，不同的第一圖像可能通過不同的相機拍攝而成；而對於不同的相機，所對應的相機內參可能會不同；此處，相機內參例如包括：相機在x軸上的焦距、相機在y軸上的焦距、相機的光心在相機座標系中的x軸和y軸的座標。In a specific implementation, in the process of performing image processing on different first images, different first images may be captured by different cameras; and for different cameras, the corresponding camera internal parameters may be different; Here, the camera internal parameters include, for example, the focal length of the camera on the x-axis, the focal length of the camera on the y-axis, and the coordinates of the x-axis and y-axis of the optical center of the camera in the camera coordinate system.

相機內參不同，即使在相同視角、相同位置獲取的第一圖像也會有所區別；若直接基於目標特徵圖預測參考節點的絕對深度，會造成對不同相機在相同視角、相同位置獲取的不同第一圖像獲取的絕對深度不同。If the camera's internal parameters are different, even the first image obtained from the same perspective and the same position will be different; if the absolute depth of the reference node is directly predicted based on the target feature map, it will cause different images obtained by different cameras from the same perspective and the same position. The absolute depths obtained for the first images are different.

為了避免上述情況的產生，本發明實施例直接預測參考節點的歸一化深度，該歸一化絕對深度是在不考慮相機內參的情況下得到的；然後根據相機內參、以及歸一化絕對深度，恢復參考節點的絕對深度。In order to avoid the above situation, the embodiment of the present invention directly predicts the normalized depth of the reference node, and the normalized absolute depth is obtained without considering the camera's internal parameters; and then according to the camera's internal parameters and the normalized absolute depth , which restores the absolute depth of the reference node.

在基於歸一化絕對深度恢復參考節點的絕對深度時，例如可以基於所述歸一化絕對深度、所述焦距、所述目標區域的面積、以及所述目標邊界框的面積，得到所述目標對象的參考節點在所述相機座標系中的絕對深度。When restoring the absolute depth of the reference node based on the normalized absolute depth, for example, the target may be obtained based on the normalized absolute depth, the focal length, the area of the target region, and the area of the target bounding box. The absolute depth of the object's reference node in the camera coordinate system.

示例性的，任一目標對象的參考節點的歸一化絕對深度、和絕對深度滿足下述公式（1）：

（1）其中，

表示參考節點的歸一化絕對深度；

表示參考節點絕對深度；

表示目標區域的面積；

表示目標邊界框的面積。

表示相機焦距。示例性的，相機座標系為三維座標系；包括x、y和z三個座標軸；相機座標系的原點為相機的光心；相機的光軸為相機座標系的z軸；光心所在的、且垂直於z軸的平面為x軸和y軸所在的平面；

為相機在x軸上的焦距；

為相機在y軸上的焦距。Exemplarily, the normalized absolute depth and absolute depth of the reference node of any target object satisfy the following formula (1):

(1) Among them,

Indicates the normalized absolute depth of the reference node;

Indicates the absolute depth of the reference node;

represents the area of the target area;

Represents the area of the target bounding box.

Indicates the camera focal length. Exemplarily, the camera coordinate system is a three-dimensional coordinate system; it includes three coordinate axes, x, y, and z; the origin of the camera coordinate system is the optical center of the camera; the optical axis of the camera is the z-axis of the camera coordinate system; , and the plane perpendicular to the z-axis is the plane where the x-axis and the y-axis are located;

is the focal length of the camera on the x-axis;

is the focal length of the camera on the y-axis.

這裡需要注意的是，在上述S202中可知，通過RoIAlign確定的目標邊界框有多個；且多個目標邊界框的面積均相等。It should be noted here that, in the above S202, it can be known that there are multiple target bounding boxes determined by RoIAlign; and the areas of the multiple target bounding boxes are all equal.

由於相機焦距在相機獲取第一圖像的時候已經確定，且目標區域和目標邊界框在確定目標區域的時候也已經確定，因而在得到參考節點的歸一化絕對深度後，根據上述公式（1）得到目標對象的參考節點的絕對深度。Since the camera focal length has been determined when the camera acquires the first image, and the target area and the target bounding box are also determined when the target area is determined, after obtaining the normalized absolute depth of the reference node, according to the above formula (1 ) to get the absolute depth of the reference node of the target object.

III：在上述S103中，假設每個目標對象包括J個關鍵點，且第一圖像中的目標對象有N個；其中，N個目標對象的三維姿態表示為：

。III: In the above S103, it is assumed that each target object includes J key points, and there are N target objects in the first image; wherein, the three-dimensional poses of the N target objects are expressed as:

.

其中，第m個目標對象的三維姿態

可以表示為：

。其中，

表示第m個目標對象的第j個關鍵點在相機座標系中x軸方向的座標值；

表示第m個目標對象的第j個關鍵點在相機座標系中y軸方向的座標值；

表示第m個目標對象的第j個關鍵點在相機座標系中z軸方向的座標值。Among them, the three-dimensional pose of the m-th target object

It can be expressed as:

. in,

Indicates the coordinate value of the j-th key point of the m-th target object in the x-axis direction of the camera coordinate system;

Represents the coordinate value of the j-th key point of the m-th target object in the y-axis direction of the camera coordinate system;

Indicates the coordinate value of the j-th key point of the m-th target object in the z-axis direction of the camera coordinate system.

N個目標對象所在的目標區域表示為：

。其中，第m個目標對象所在的目標區域

表示為：

；此處，

和

表示目標區域的左上角所在的頂點的座標值；

和

分別表示目標區域的寬度值和高度值。The target area where N target objects are located is expressed as:

. Among them, the target area where the mth target object is located

Expressed as:

; here,

and

Indicates the coordinate value of the vertex where the upper left corner of the target area is located;

and

Represents the width and height of the target area, respectively.

N個目標對象的相對於參考節點的三維姿勢表示為：

；其中，第m個目標對象相對於參考節點的三維姿勢

表示為：

；其中，

表示第m個目標對象的第j個關鍵點在圖像座標系中x軸的座標值；

表示第m個目標對象的第j個關鍵點在圖像座標系中y軸的座標值；也即，

表示第m個目標對象的第j個關鍵點在圖像座標系中的二維座標值。

表示第m個目標對象的第j個節點相對於第m個目標對象的參考節點的相對深度。The 3D poses of N target objects relative to the reference node are expressed as:

; where, the 3D pose of the mth target object relative to the reference node

Expressed as:

;in,

Indicates the coordinate value of the jth key point of the mth target object in the x-axis of the image coordinate system;

Represents the coordinate value of the j-th key point of the m-th target object in the y-axis of the image coordinate system; that is,

Indicates the two-dimensional coordinate value of the jth key point of the mth target object in the image coordinate system.

Represents the relative depth of the jth node of the mth target object relative to the reference node of the mth target object.

使用相機的內參矩陣K，通過反投影得到第m個目標對象的三維姿勢，其中，第m個目標對象的第j個節點的三維座標資訊滿足下述公式（2）

（2）其中，

表示第m個目標對象的參考節點在相機座標系中的絕對深度值。此處，需要注意的是，該

基於上述公式（1）對應的實施例獲得。Using the camera's internal parameter matrix K, the three-dimensional pose of the m-th target object is obtained by back-projection, where the three-dimensional coordinate information of the j-th node of the m-th target object satisfies the following formula (2)

(2) where,

Indicates the absolute depth value of the reference node of the mth target object in the camera coordinate system. Here, it should be noted that the

Obtained based on the embodiment corresponding to the above formula (1).

內參矩陣K例如為：

；其中：

為相機在相機座標系中x軸上的焦距；

為相機在相機座標系中在y軸上的焦距；

為相機的光心在相機座標系中在x軸上的座標值；

表示相機的光心在相機座標系中在y軸上的座標值。The internal parameter matrix K is for example:

; in:

is the focal length of the camera on the x-axis in the camera coordinate system;

is the focal length of the camera on the y-axis in the camera coordinate system;

is the coordinate value of the optical center of the camera on the x-axis in the camera coordinate system;

Indicates the coordinate value of the camera's optical center on the y-axis in the camera coordinate system.

通過上述過程，能夠得到目標對象的多個關鍵點分別在相機座標系中的三維位置資訊；針對第m個目標對象，該目標對象的J個關鍵點分別對應的三維位置資訊，表徵第m個目標對象的三維姿態。Through the above process, the three-dimensional position information of multiple key points of the target object in the camera coordinate system can be obtained respectively; for the m-th target object, the three-dimensional position information corresponding to the J key points of the target object respectively represents the m-th target object. The 3D pose of the target object.

本發明實施例通過識別第一圖像中目標對象所在的目標區域，並基於目標區域，確定表徵目標對象姿態的多個關鍵點分別在第一圖像中的第一二維位置資訊、每個關鍵點相對於目標對象的參考節點的相對深度、以及目標對象的參考節點在相機座標系中的絕對深度，從而基於目標對象的第一二維位置資訊、相對深度、以及絕對深度，更精確的得到目標對象的多個關鍵點分別在相機座標系中的三維位置資訊。The embodiment of the present invention identifies the target area where the target object is located in the first image, and based on the target area, determines the first two-dimensional position information, each The relative depth of the key point relative to the reference node of the target object, and the absolute depth of the reference node of the target object in the camera coordinate system, so that based on the first two-dimensional position information, relative depth, and absolute depth of the target object, more accurate The three-dimensional position information of the multiple key points of the target object in the camera coordinate system is obtained.

本發明另一實施例中，還提供另外一種圖像處理方法，其中，該圖像處理方法應用於預先訓練好的神經網路中。In another embodiment of the present invention, another image processing method is also provided, wherein the image processing method is applied to a pre-trained neural network.

其中，所述神經網路包括目標檢測網路、關鍵點檢測網路以及深度預測網路三個分支網路；所述目標檢測網路用於獲得所述目標對象所在的目標區域；所述關鍵點檢測網路用於獲取所述目標對象的多個關鍵點分別在所述第一圖像中的第一二維位置資訊、和每個所述關鍵點相對所述目標對象的參考節點的相對深度；所述深度預測網路用於獲取所述參考節點在所述相機座標系中的絕對深度。Wherein, the neural network includes three branch networks: a target detection network, a key point detection network and a depth prediction network; the target detection network is used to obtain the target area where the target object is located; the key The point detection network is used to obtain first two-dimensional position information of a plurality of key points of the target object in the first image respectively, and the relative relationship of each of the key points to the reference node of the target object. Depth; the depth prediction network is used to obtain the absolute depth of the reference node in the camera coordinate system.

上述三個分支網路的具體工作過程可以參見上述實施例所示，在此不再贅述。The specific working process of the above-mentioned three branch networks can be referred to as shown in the above-mentioned embodiment, and details are not repeated here.

本發明實施例通過目標檢測網路、關鍵點檢測網路以及深度預測網路三個分支網路，構成端到端的目標對象姿態檢測框架，基於該框架對第一圖像進行處理，得到第一圖像中每個目標對象的多個關鍵點分別在相機座標系中的三維位置資訊，處理速度更快，識別精度更高。In the embodiment of the present invention, an end-to-end target object gesture detection framework is formed through three branch networks, a target detection network, a key point detection network, and a depth prediction network. Based on the framework, the first image is processed to obtain the first The three-dimensional position information of multiple key points of each target object in the image in the camera coordinate system, the processing speed is faster and the recognition accuracy is higher.

參見圖6所示，本發明實施例還提供一種目標對象姿態檢測框架的具體示例，包括：目標檢測網路、關鍵點檢測網路、以及深度預測網路三個網路分支；其中，目標檢測網路對第一圖像進行特徵提取，得到第一圖像的特徵圖；然後，根據第一特徵圖，採用RoIAlign從預先生成的多個候選邊界框中，確定多個目標邊界框；對多個目標邊界框執行邊界框回歸處理，得到與每個目標對象對應的目標區域。將目標區域對應的目標特徵圖，傳輸至關鍵點檢測網路、以及深度預測網路。Referring to FIG. 6 , an embodiment of the present invention also provides a specific example of a target object gesture detection framework, including: There are three network branches: target detection network, key point detection network, and depth prediction network; The target detection network performs feature extraction on the first image to obtain a feature map of the first image; then, according to the first feature map, RoIAlign is used to determine multiple target boundaries from multiple pre-generated candidate bounding boxes box; perform bounding box regression processing on multiple target bounding boxes to obtain target regions corresponding to each target object. The target feature map corresponding to the target area is transmitted to the key point detection network and the depth prediction network.

關鍵點檢測網路，基於目標特徵圖，確定表徵目標對象姿態的多個關鍵點分別在所述第一圖像中的第一二維位置資訊、每個關鍵點相對所述目標對象的參考節點的相對深度。其中，針對每個目標特徵圖中各個關鍵點的第一二維位置資訊、及相對深度，構成該目標特徵圖中目標對象的三維姿態。此時的三維姿態，是以自身為參照的三維姿態。The key point detection network, based on the target feature map, determines the first two-dimensional position information of a plurality of key points representing the posture of the target object in the first image, and the reference node of each key point relative to the target object. relative depth. The first two-dimensional position information and relative depth of each key point in each target feature map constitute the three-dimensional pose of the target object in the target feature map. The three-dimensional posture at this time is the three-dimensional posture with the self as the reference.

深度預測網路，基於目標特徵圖，確定目標對象的參考節點在相機座標系中的絕對深度。The depth prediction network, based on the target feature map, determines the absolute depth of the reference node of the target object in the camera coordinate system.

最終，根據目標對象的所述第一二維位置資訊、相對深度、以及參考節點的所述絕對深度，確定目標對象的多個關鍵點分別在所述相機座標系中的三維位置資訊。針對每個目標對象，該目標對象上的多個關鍵點分別在相機座標系中的三維位置資訊，構成了該目標對象在相機座標系中的三維姿態。此時的三維姿態，是以相機為參照的三維姿態。Finally, according to the first two-dimensional position information of the target object, the relative depth, and the absolute depth of the reference node, the three-dimensional position information of the multiple key points of the target object in the camera coordinate system is determined respectively. For each target object, the three-dimensional position information of a plurality of key points on the target object in the camera coordinate system respectively constitutes the three-dimensional pose of the target object in the camera coordinate system. The 3D pose at this time is the 3D pose with the camera as a reference.

參見圖7所示，本發明實施例還提供另一種目標對象姿態檢測框架的具體示例，包括：目標檢測網路、關鍵點檢測網路、以及深度預測網路；其中，目標檢測網路對第一圖像進行特徵提取，得到第一圖像的特徵圖；然後，根據第一特徵圖，採用RoIAlign從預先生成的多個候選邊界框中，確定多個目標邊界框；對多個目標邊界框執行邊界框回歸處理，得到與每個目標對象對應的目標區域。將目標區域對應的目標特徵圖，傳輸至關鍵點檢測網路、以及深度預測網路。Referring to FIG. 7 , an embodiment of the present invention further provides another specific example of a target object gesture detection framework, including: Target detection network, key point detection network, and depth prediction network; The target detection network performs feature extraction on the first image to obtain a feature map of the first image; then, according to the first feature map, RoIAlign is used to determine multiple target boundaries from multiple pre-generated candidate bounding boxes box; perform bounding box regression processing on multiple target bounding boxes to obtain target regions corresponding to each target object. The target feature map corresponding to the target area is transmitted to the key point detection network and the depth prediction network.

深度預測網路，基於第一圖像，獲取初始深度圖像；並基於目標對象對應的目標特徵圖，確定與目標對象對應的參考節點在所述第一圖像中的第二二維位置資訊，並基於所述第二二維位置資訊、以及所述初始深度圖像，確定所述目標對象對應的參考節點的初始深度值；以及對目標對象對應的目標特徵圖進行至少一級第一卷積處理，得到所述目標對象的特徵向量；將所述特徵向量和參考節點的初始深度值進行拼接，形成拼接向量，並對所述拼接向量進行至少一級第二卷積處理，得到所述初始深度值的修正值；將修正值與參考節點的初始深度值相加，得到參考節點的歸一化絕對深度值。The depth prediction network, based on the first image, obtains an initial depth image; and based on the target feature map corresponding to the target object, determines the second two-dimensional position information of the reference node corresponding to the target object in the first image , and based on the second two-dimensional position information and the initial depth image, determine the initial depth value of the reference node corresponding to the target object; and perform at least one level of first convolution on the target feature map corresponding to the target object processing to obtain the feature vector of the target object; splicing the eigenvector and the initial depth value of the reference node to form a splicing vector, and performing at least one level of second convolution processing on the splicing vector to obtain the initial depth The correction value of the value; the correction value is added to the initial depth value of the reference node to obtain the normalized absolute depth value of the reference node.

然後，通過上述公式（1），恢復參考節點的絕對深度值，然後根據目標對象的所述第一二維位置資訊、相對深度、以及參考節點的所述絕對深度，確定目標對象的多個關鍵點分別在所述相機座標系中的三維位置資訊。針對每個目標對象，該目標對象上的多個關鍵點分別在相機座標系中的三維位置資訊，構成了該目標對象在相機座標系中的三維姿態。此時的三維姿態，是以相機為參照的三維姿態。Then, through the above formula (1), the absolute depth value of the reference node is restored, and then according to the first two-dimensional position information of the target object, the relative depth, and the absolute depth of the reference node, multiple key points of the target object are determined The three-dimensional position information of the points in the camera coordinate system respectively. For each target object, the three-dimensional position information of a plurality of key points on the target object in the camera coordinate system respectively constitutes the three-dimensional pose of the target object in the camera coordinate system. The 3D pose at this time is the 3D pose with the camera as a reference.

通過上述兩種目標對象姿態檢測框架中任一種，都能夠得到第一圖像中每個目標對象的多個關鍵點分別在相機座標系中的三維位置資訊，處理速度更快，識別精度更高。Through either of the above two target object gesture detection frameworks, the three-dimensional position information of multiple key points of each target object in the first image in the camera coordinate system can be obtained, with faster processing speed and higher recognition accuracy. .

本領域技術人員可以理解，在具體實施方式的上述方法中，各步驟的撰寫順序並不意味著嚴格的執行順序而對實施過程構成任何限定，各步驟的具體執行順序應當以其功能和可能的內在邏輯確定。Those skilled in the art can understand that in the above method of the specific implementation, the writing order of each step does not mean a strict execution order but constitutes any limitation on the implementation process, and the specific execution order of each step should be based on its function and possible Internal logic is determined.

基於同一發明構思，本發明實施例中還提供了與圖像處理方法對應的圖像處理裝置，由於本發明實施例中的裝置解決問題的原理與本發明實施例上述圖像處理方法相似，因此裝置的實施可以參見方法的實施，重複之處不再贅述。Based on the same inventive concept, the embodiment of the present invention also provides an image processing apparatus corresponding to the image processing method. For the implementation of the apparatus, reference may be made to the implementation of the method, and the repetition will not be repeated.

參照圖8所示，為本發明實施例提供的一種圖像處理裝置的示意圖，所述裝置包括：識別模組81、第一檢測模組82、第二檢測模組83；其中，識別模組81，用於識別所述第一圖像中的目標對象所在的目標區域；第一檢測模組82，用於基於所述目標對象對應的目標區域，確定表徵所述目標對象姿態的多個關鍵點分別在所述第一圖像中的第一二維位置資訊、每個所述關鍵點相對所述目標對象的參考節點的相對深度、以及所述目標對象的參考節點在相機座標系中的絕對深度；第二檢測模組83，用於基於所述目標對象的所述第一二維位置資訊、所述相對深度、以及所述絕對深度，確定所述目標對象的多個關鍵點分別在所述相機座標系中的三維位置資訊。8 , which is a schematic diagram of an image processing apparatus provided by an embodiment of the present invention, the apparatus includes: an identification module 81 , a first detection module 82 , and a second detection module 83 ; wherein, The identification module 81 is used to identify the target area where the target object in the first image is located; The first detection module 82 is used to determine, based on the target area corresponding to the target object, the first two-dimensional position information, each the relative depth of the key point relative to the reference node of the target object, and the absolute depth of the reference node of the target object in the camera coordinate system; The second detection module 83 is configured to determine, based on the first two-dimensional position information, the relative depth, and the absolute depth of the target object, that a plurality of key points of the target object are located in the camera respectively. 3D position information in a coordinate system.

一種可能的實施方式中，所述識別模組81，在識別所述第一圖像中的目標對象所在的目標區域時，用於：對所述第一圖像進行特徵提取，得到所述第一圖像的特徵圖；基於所述特徵圖，從預先生成的多個候選邊界框中，確定多個目標邊界框，並基於所述目標邊界框，確定所述目標對象對應的目標區域。In a possible implementation manner, the recognition module 81, when recognizing the target area where the target object in the first image is located, is used to: performing feature extraction on the first image to obtain a feature map of the first image; Based on the feature map, a plurality of target bounding boxes are determined from a plurality of pre-generated candidate bounding boxes, and based on the target bounding boxes, a target area corresponding to the target object is determined.

一種可能的實施方式中，所述識別模組81，在基於所述目標邊界框，確定所述目標對象對應的目標區域時，用於：基於多個所述目標邊界框，以及所述特徵圖，確定每個所述目標邊界框對應的特徵子圖；基於多個所述目標邊界框分別對應的特徵子圖進行邊界框回歸處理，得到所述目標對象對應的目標區域。In a possible implementation manner, the identification module 81, when determining the target area corresponding to the target object based on the target bounding box, is used to: Based on a plurality of the target bounding boxes and the feature maps, determine a feature submap corresponding to each of the target bounding boxes; A bounding box regression process is performed based on the feature sub-maps corresponding to the plurality of target bounding boxes, to obtain a target area corresponding to the target object.

一種可能的實施方式中，所述第一檢測模組82，在基於所述目標對象對應的目標區域，確定所述目標對象的參考節點在相機座標系中的絕對深度時，用於：基於所述目標對象對應的目標區域以及所述第一圖像，確定所述靶心圖表像對應的目標特徵圖；基於所述目標對象對應的目標特徵圖執行深度識別處理，得到所述目標對象的參考節點的歸一化絕對深度；基於所述歸一化絕對深度以及所述相機的參數矩陣，得到所述目標對象的參考節點在所述相機座標系中的絕對深度。In a possible implementation manner, the first detection module 82, when determining the absolute depth of the reference node of the target object in the camera coordinate system based on the target area corresponding to the target object, is used to: determining a target feature map corresponding to the bullseye image based on the target area corresponding to the target object and the first image; Performing depth recognition processing based on the target feature map corresponding to the target object to obtain the normalized absolute depth of the reference node of the target object; Based on the normalized absolute depth and the parameter matrix of the camera, the absolute depth of the reference node of the target object in the camera coordinate system is obtained.

一種可能的實施方式中，所述第一檢測模組82，在基於所述目標對象對應的目標特徵圖執行深度識別處理，得到所述目標對象的參考節點的歸一化絕對深度時，用於：基於所述第一圖像，獲取初始深度圖像；其中，所述初始深度圖像中任一第一圖元點的圖元值，表徵所述第一圖像中與所述第一圖元點位置對應的第二圖元點在所述相機座標系中的初始深度值；基於所述目標對象對應的目標特徵圖，確定與所述目標對象對應的參考節點在所述第一圖像中的第二二維位置資訊，並基於所述第二二維位置資訊、以及所述初始深度圖像，確定所述目標對象對應的參考節點的初始深度值；基於所述目標對象對應的參考節點的初始深度值，以及所述目標對象對應的所述目標特徵圖，確定所述目標對象的參考節點的歸一化絕對深度。In a possible implementation, the first detection module 82 is used to perform depth recognition processing based on the target feature map corresponding to the target object to obtain the normalized absolute depth of the reference node of the target object. : Based on the first image, an initial depth image is obtained; wherein, the primitive value of any first primitive point in the initial depth image represents the relationship between the first primitive in the first image and the first primitive the initial depth value of the second primitive point corresponding to the point position in the camera coordinate system; Based on the target feature map corresponding to the target object, the second two-dimensional position information of the reference node corresponding to the target object in the first image is determined, and based on the second two-dimensional position information and the the initial depth image, and determine the initial depth value of the reference node corresponding to the target object; Based on the initial depth value of the reference node corresponding to the target object and the target feature map corresponding to the target object, the normalized absolute depth of the reference node of the target object is determined.

一種可能的實施方式中，所述第一檢測模組82，在基於所述目標對象對應的參考節點的初始深度值，以及所述目標對象對應的所述目標特徵圖，確定所述目標對象的參考節點的歸一化絕對深度時，用於：對所述目標對象對應的目標特徵圖進行至少一級第一卷積處理，得到所述目標對象的特徵向量；將所述特徵向量和所述初始深度值進行拼接，形成拼接向量，並對所述拼接向量進行至少一級第二卷積處理，得到所述初始深度值的修正值；基於所述初始深度值的修正值、以及所述初始深度值，得到所述歸一化絕對深度。In a possible implementation, the first detection module 82 determines the depth of the target object based on the initial depth value of the reference node corresponding to the target object and the target feature map corresponding to the target object. When referencing the normalized absolute depth of a node, it is used to: Perform at least one-level first convolution processing on the target feature map corresponding to the target object to obtain a feature vector of the target object; Splicing the feature vector and the initial depth value to form a splicing vector, and performing at least a second convolution process on the splicing vector to obtain a correction value of the initial depth value; The normalized absolute depth is obtained based on the corrected value of the initial depth value and the initial depth value.

一種可能的實施方式中，所述圖像處理裝置中部署有預先訓練好的神經網路，所述神經網路包括目標檢測網路、關鍵點檢測網路以及深度預測網路三個分支網路；所述目標檢測網路用於獲得所述目標對象所在的目標區域；所述關鍵點檢測網路用於獲取所述目標對象的多個關鍵點分別在所述第一圖像中的第一二維位置資訊、和每個所述關鍵點相對所述目標對象的參考節點的相對深度；所述深度預測網路用於獲取所述參考節點在所述相機座標系中的絕對深度。In a possible implementation manner, a pre-trained neural network is deployed in the image processing device, and the neural network includes three branch networks: a target detection network, a key point detection network, and a depth prediction network. ; the target detection network is used to obtain the target area where the target object is located; the key point detection network is used to obtain the first number of key points of the target object in the first image respectively. Two-dimensional position information, and the relative depth of each key point relative to the reference node of the target object; the depth prediction network is used to obtain the absolute depth of the reference node in the camera coordinate system.

另外，本發明實施例通過目標檢測網路、關鍵點檢測網路以及深度預測網路三個分支網路，構成端到端的目標對象姿態檢測框架，基於該框架對第一圖像進行處理，得到第一圖像中每個目標對象的多個關鍵點分別在相機座標系中的三維位置資訊，處理速度更快，識別精度更高。In addition, in the embodiment of the present invention, an end-to-end target object pose detection framework is formed through three branch networks, a target detection network, a key point detection network, and a depth prediction network. Based on the framework, the first image is processed to obtain The three-dimensional position information of the multiple key points of each target object in the first image respectively in the camera coordinate system, the processing speed is faster and the recognition accuracy is higher.

關於裝置中的各模組的處理流程、以及各模組之間的交互流程的描述可以參照上述方法實施例中的相關說明，這裡不再詳述。For the description of the processing flow of each module in the device and the interaction flow between the modules, reference may be made to the relevant descriptions in the above method embodiments, which will not be described in detail here.

本發明實施例還提供了一種電腦設備10，如圖9所示，為本發明實施例提供的電腦設備10結構示意圖，包括：處理器11和記憶體12；所述記憶體12儲存有所述處理器11可執行的機器可讀指令，當電腦設備運行時，所述機器可讀指令被所述處理器執行以實現下述步驟：識別所述第一圖像中的目標對象所在的目標區域；基於所述目標對象對應的目標區域，確定表徵所述目標對象姿態的多個關鍵點分別在所述第一圖像中的第一二維位置資訊、每個所述關鍵點相對所述目標對象的參考節點的相對深度、以及所述目標對象的參考節點在相機座標系中的絕對深度；基於所述目標對象的所述第一二維位置資訊、所述相對深度、以及所述絕對深度，確定所述目標對象的多個關鍵點分別在所述相機座標系中的三維位置資訊。An embodiment of the present invention also provides a computer device 10, as shown in FIG. 9, which is a schematic structural diagram of the computer device 10 provided by the embodiment of the present invention, including: A processor 11 and a memory 12; the memory 12 stores machine-readable instructions executable by the processor 11, and when the computer device runs, the machine-readable instructions are executed by the processor to achieve the following step: Identifying the target area where the target object in the first image is located; Based on the target area corresponding to the target object, first two-dimensional position information of a plurality of key points representing the posture of the target object in the first image is determined, and each of the key points is relative to the target object. The relative depth of the reference node, and the absolute depth of the reference node of the target object in the camera coordinate system; Based on the first two-dimensional position information, the relative depth, and the absolute depth of the target object, three-dimensional position information of a plurality of key points of the target object in the camera coordinate system is determined.

上述指令的具體執行過程可以參考本發明實施例中所述的圖像處理方法的步驟，此處不再贅述。For the specific execution process of the above instruction, reference may be made to the steps of the image processing method described in the embodiment of the present invention, which will not be repeated here.

本發明實施例還提供一種電腦可讀儲存介質，該電腦可讀儲存介質上儲存有電腦程式，該電腦程式被處理器運行時執行上述方法實施例中所述的圖像處理方法的步驟。其中，該儲存介質可以是易失性或非易失的電腦可讀取儲存介質。Embodiments of the present invention further provide a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is run by a processor, the steps of the image processing method described in the above method embodiments are executed. Wherein, the storage medium may be a volatile or non-volatile computer-readable storage medium.

本發明實施例還提供一種電腦程式產品，該電腦程式產品承載有程式碼，所述程式碼包括的指令可用於執行上述方法實施例中所述的圖像處理方法的步驟，具體可參見上述方法實施例，在此不再贅述。An embodiment of the present invention further provides a computer program product, the computer program product carries a program code, and the instructions included in the program code can be used to execute the steps of the image processing method described in the above method embodiments. For details, please refer to the above method The embodiments are not repeated here.

其中，上述電腦程式產品可以具體通過硬體、軟體或其結合的方式實現。在一個可選實施例中，所述電腦程式產品具體體現為電腦儲存介質，在另一個可選實施例中，電腦程式產品具體體現為軟體產品，例如軟體發展包（Software Development Kit，SDK）等等。Wherein, the above-mentioned computer program product can be specifically realized by means of hardware, software or a combination thereof. In an optional embodiment, the computer program product is embodied as a computer storage medium, and in another optional embodiment, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK), etc. Wait.

所屬領域的技術人員可以清楚地瞭解到，為描述的方便和簡潔，上述描述的系統和裝置的具體工作過程，可以參考前述方法實施例中的對應過程，在此不再贅述。在本發明所提供的幾個實施例中，應該理解到，所揭露的系統、裝置和方法，可以通過其它的方式實現。以上所描述的裝置實施例僅僅是示意性的，例如，所述單元的劃分，僅僅為一種邏輯功能劃分，實際實現時可以有另外的劃分方式，又例如，多個單元或元件可以結合或者可以集成到另一個系統，或一些特徵可以忽略，或不執行。另一點，所顯示或討論的相互之間的耦合或直接耦合或通信連接可以是通過一些通信介面，裝置或單元的間接耦合或通信連接，可以是電性，機械或其它的形式。Those skilled in the art can clearly understand that, for the convenience and brevity of description, for the specific working process of the system and device described above, reference may be made to the corresponding process in the foregoing method embodiments, which will not be repeated here. In the several embodiments provided by the present invention, it should be understood that the disclosed systems, devices and methods may be implemented in other manners. The device embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or elements may be combined or may be Integration into another system, or some features can be ignored, or not implemented. On the other hand, the shown or discussed mutual coupling or direct coupling or communication connection may be through some communication interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.

所述作為分離部件說明的單元可以是或者也可以不是物理上分開的，作為單元顯示的部件可以是或者也可以不是物理單元，即可以位於一個地方，或者也可以分佈到多個網路單元上。可以根據實際的需要選擇其中的部分或者全部單元來實現本實施例方案的目的。The unit described as a separate component may or may not be physically separated, and the component displayed as a unit may or may not be a physical unit, that is, it may be located in one place, or may be distributed to multiple network units . Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.

另外，在本發明各個實施例中的各功能單元可以集成在一個處理單元中，也可以是各個單元單獨物理存在，也可以兩個或兩個以上單元集成在一個單元中。In addition, each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.

所述功能如果以軟體功能單元的形式實現並作為獨立的產品銷售或使用時，可以儲存在一個處理器可執行的非易失的電腦可讀取儲存介質中。基於這樣的理解，本發明的技術方案本質上或者說對現有技術做出貢獻的部分或者該技術方案的部分可以以軟體產品的形式體現出來，該電腦軟體產品儲存在一個儲存介質中，包括若干指令用以使得一台電腦設備（可以是個人電腦，伺服器，或者網路設備等）執行本發明各個實施例所述方法的全部或部分步驟。而前述的儲存介質包括：U盤、移動硬碟、唯讀記憶體（Read-Only Memory，ROM）、隨機存取記憶體（Random Access Memory，RAM）、磁碟或者光碟等各種可以儲存程式碼的介質。The functions, if implemented in the form of software functional units and sold or used as independent products, may be stored in a processor-executable non-volatile computer-readable storage medium. Based on this understanding, the technical solution of the present invention can be embodied in the form of a software product in essence, or the part that contributes to the prior art or the part of the technical solution. The computer software product is stored in a storage medium, including several The instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes: U disk, removable hard disk, Read-Only Memory (ROM), Random Access Memory (RAM), disk or CD, etc. that can store program codes medium.

最後應說明的是：以上所述實施例，僅為本發明的具體實施方式，用以說明本發明的技術方案，而非對其限制，本發明的保護範圍並不局限於此，儘管參照前述實施例對本發明進行了詳細的說明，本領域的普通技術人員應當理解：任何熟悉本技術領域的技術人員在本發明揭露的技術範圍內，其依然可以對前述實施例所記載的技術方案進行修改或可輕易想到變化，或者對其中部分技術特徵進行等同替換；而這些修改、變化或者替換，並不使相應技術方案的本質脫離本發明實施例技術方案的精神和範圍，都應涵蓋在本發明的保護範圍之內。因此，本發明的保護範圍應所述以申請專利範圍的保護範圍為準。Finally, it should be noted that the above-mentioned embodiments are only specific implementations of the present invention, and are used to illustrate the technical solutions of the present invention, but not to limit them. The protection scope of the present invention is not limited thereto, although referring to the foregoing The embodiment has been described in detail the present invention, those of ordinary skill in the art should understand: any person skilled in the art who is familiar with the technical field within the technical scope disclosed by the present invention can still modify the technical solutions described in the foregoing embodiments. Or can easily think of changes, or equivalently replace some of the technical features; and these modifications, changes or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions of the embodiments of the present invention, and should be covered in the present invention. within the scope of protection. Therefore, the protection scope of the present invention should be based on the protection scope of the patent application.

10:電腦設備 11:處理器 12:記憶體 81:識別模組 82:第一檢測模組 83:第二檢測模組 S101~S103,S201~S202,S301~S302,S401~S403,S501~S503:步驟10: Computer equipment 11: Processor 12: Memory 81: Identify the module 82: The first detection module 83: The second detection module S101~S103, S201~S202, S301~S302, S401~S403, S501~S503: Steps

為了更清楚地說明本發明實施例的技術方案，下面將對實施例中所需要使用的附圖作簡單地介紹，此處的附圖被併入說明書中並構成本說明書中的一部分，這些附圖示出了符合本發明的實施例，並與說明書一起用於說明本發明的技術方案。應當理解，以下附圖僅示出了本發明的某些實施例，因此不應被看作是對範圍的限定，對於本領域普通技術人員來講，在不付出創造性勞動的前提下，還可以根據這些附圖獲得其他相關的附圖。圖1示出了本發明實施例所提供的一種圖像處理方法的流程圖；圖2示出了本發明實施例所提供的識別第一圖像中目標對象所在的目標區域的具體方法的流程圖；圖3示出了本發明實施例所提供的基於目標邊界框，確定目標對象對應的目標區域的具體示例；圖4示出了本發明實施例所提供的確定目標對象的參考節點在相機座標系中的絕對深度的具體方法的流程圖；圖5示出了本發明實施例所提供的另一種得到參考節點的歸一化絕對深度的具體方法的流程圖；圖6示出了本發明實施例所提供的目標對象姿態檢測框架的具體示例；圖7示出了本發明實施例所提供的另一種目標對象姿態檢測框架的具體示例；圖8示出了本發明實施例所提供的一種圖像處理裝置的示意圖；圖9示出了本發明實施例所提供的一種電腦設備的示意圖。In order to illustrate the technical solutions of the embodiments of the present invention more clearly, the following briefly introduces the accompanying drawings that are used in the embodiments, which are incorporated into the specification and constitute a part of the specification. The drawings illustrate embodiments consistent with the present invention, and together with the description, are used to illustrate the technical solutions of the present invention. It should be understood that the following drawings only show some embodiments of the present invention, and therefore should not be regarded as a limitation of the scope. Other related figures are obtained from these figures. 1 shows a flowchart of an image processing method provided by an embodiment of the present invention; 2 shows a flowchart of a specific method for identifying a target area where a target object is located in a first image provided by an embodiment of the present invention; 3 shows a specific example of determining a target area corresponding to a target object based on a target bounding box provided by an embodiment of the present invention; 4 shows a flowchart of a specific method for determining the absolute depth of a reference node of a target object in a camera coordinate system provided by an embodiment of the present invention; 5 shows a flowchart of another specific method for obtaining the normalized absolute depth of a reference node provided by an embodiment of the present invention; 6 shows a specific example of a target object gesture detection framework provided by an embodiment of the present invention; FIG. 7 shows a specific example of another target object gesture detection framework provided by an embodiment of the present invention; FIG. 8 shows a schematic diagram of an image processing apparatus provided by an embodiment of the present invention; FIG. 9 shows a schematic diagram of a computer device provided by an embodiment of the present invention.

S101~S103:步驟S101~S103: Steps

Claims

An image processing method, applied to computer equipment, the method comprising: identifying a target area where a target object is located in a first image; determining the target area based on the target area where the target object is located and the first image The target feature map of the target object; the depth recognition process is performed on the target feature map corresponding to the target object to obtain the normalized absolute depth of the reference node of the target object; based on the normalized absolute depth and the parameter matrix of the camera , obtain the absolute depth of the reference node of the target object in the camera coordinate system; based on the target area where the target object is located, determine the position of the key points of the target object in the first image respectively first two-dimensional position information, and the relative depth of each key point relative to the reference node of the target object; the first two-dimensional position information and the The relative depth and the absolute depth corresponding to the reference node determine the three-dimensional position information of the multiple key points of the target object respectively in the camera coordinate system.

The image processing method according to claim 1, further comprising: obtaining the pose of the target object based on the three-dimensional position information of the multiple key points of the target object in the camera coordinate system respectively.

The image processing method according to claim 1 or 2, wherein the identifying the target area where the target object in the first image is located includes: Perform feature extraction on the first image to obtain a feature map of the first image; based on the feature map, determine a plurality of target bounding boxes from a plurality of pre-generated candidate bounding boxes; The target bounding box determines the target area where the target object is located.

The image processing method according to claim 3, wherein the determining the target area where the target object is located based on the plurality of target bounding boxes includes: based on the plurality of target bounding boxes and the feature map , determine the feature submaps of each of the target bounding boxes; perform bounding box regression processing on the feature submaps corresponding to the multiple target bounding boxes respectively, to obtain the target area where the target object is located; based on the normalization The absolute depth and the parameter matrix of the camera are used to obtain the absolute depth of the reference node of the target object in the camera coordinate system.

The image processing method according to claim 1, wherein the performing depth recognition processing on the target feature map corresponding to the target object to obtain the normalized absolute depth of the reference node of the target object includes: The first image is determined, and the initial depth image is determined; wherein, the primitive value of any first primitive point in the initial depth image is the position of the first primitive point in the first image The initial depth value of the corresponding second primitive point in the camera coordinate system; based on the target feature map corresponding to the target object, determine the first image of the reference node corresponding to the target object. two-dimensional location information information; based on the second two-dimensional position information and the initial depth image, determine the initial depth value of the reference node corresponding to the target object; based on the initial depth value of the reference node and the target feature map, The normalized absolute depth of the reference node of the target object is determined.

The image processing method according to claim 5, wherein the determining the normalized absolute depth of the reference node of the target object based on the initial depth value of the reference node and the target feature map includes: The target feature map corresponding to the target object is subjected to at least one level of first convolution processing to obtain a feature vector of the target object; the feature vector and the initial depth value are spliced to obtain a splicing vector, and the splicing The vector is subjected to at least one second convolution process to obtain a modified value of the initial depth value; the normalized absolute depth is obtained based on the modified value of the initial depth value and the initial depth value.

The image processing method according to claim 1, wherein the parameter matrix includes: the focal length of the camera; the reference node of the target object is obtained based on the normalized absolute depth and the parameter matrix of the camera The absolute depth in the camera coordinate system includes: obtaining the reference node of the target object based on the normalized absolute depth, the focal length, the area of the target area, and the area of the target bounding box Absolute depth in the camera coordinate system.

The image processing method according to claim 1 or 2, wherein , the image processing method is applied to a pre-trained neural network, and the neural network includes three branch networks: a target detection network, a key point detection network, and a depth prediction network; the target detection network The network is used to obtain the target area where the target object is located; the key point detection network is used to obtain the first two-dimensional position information, and the relative depth of each key point relative to the reference node of the target object; the depth prediction network is used to obtain the absolute depth of the reference node in the camera coordinate system.

A computer device, comprising: a processor and a memory connected to each other, the memory stores machine-readable instructions executable by the processor, and when the computer device runs, the machine-readable instructions are executed by the processor The steps are executed to realize the image processing method according to any one of claim items 1 to 8.

A computer-readable storage medium, on which a computer program is stored, and when the computer program is run by a processor, executes the steps of the image processing method according to any one of claims 1 to 8.