TWI807904B

TWI807904B - Method for training depth identification model, method for identifying depth of images and related devices

Info

Publication number: TWI807904B
Application number: TW111124989A
Authority: TW
Inventors: 李潔; 郭錦斌
Original assignee: 鴻海精密工業股份有限公司
Priority date: 2022-07-04
Filing date: 2022-07-04
Publication date: 2023-07-01
Also published as: TW202403662A

Abstract

The present application relates to an image analysis technology and provides a method for training a depth identification model, a method for identifying depth of images and related devices. This application includes: obtaining a plurality of static objects and a plurality of dynamic objects by performing an instance segmentation on a first image and a second image, and obtaining a dynamic position of each of the plurality of dynamic objects; selecting a plurality of target dynamic objects and a plurality of characteristic dynamic objects from the plurality of dynamic objects according to a number of pixels and a preset position; generating a target image and a target projection image according to a dynamic pose matrix of the plurality of target dynamic objects and characteristic dynamic objects corresponding to the plurality of target dynamic objects, a static pose matrix of the plurality of static objects, the dynamic position, a preset threshold matrix and a preset initial projection image; generating a depth identification model according to an initial depth image, the target image, the target projection image and a preset depth recognition network, and obtaining an identification result of an image to be identified by inputting the image to be identified into the depth identification model.

Description

Depth recognition model training method, image depth recognition method and related equipment

本發明涉及影像處理領域，尤其涉及一種深度識別模型訓練方法、圖像深度識別方法及相關設備。 The invention relates to the field of image processing, in particular to a depth recognition model training method, an image depth recognition method and related equipment.

在目前對車載圖像進行深度識別的方案中，可利用訓練圖像對深度網路進行訓練。然而，由於採用的訓練圖像中包括動態對象，動態對象會導致訓練得到的深度識別模型無法準確識別出車載圖像的深度資訊，從而難以確定車輛與周圍環境中各類物體或障礙物的真實距離，進而影響駕車安全。 In the current scheme of deep recognition of vehicle images, the training images can be used to train the deep network. However, since the training images used include dynamic objects, the trained depth recognition model will not be able to accurately identify the depth information of the vehicle image, making it difficult to determine the real distance between the vehicle and various objects or obstacles in the surrounding environment, thereby affecting driving safety.

鑒於以上內容，有必要提供一種深度識別模型訓練方法、圖像深度識別方法及相關設備，解決了車載圖像的深度資訊識別不準確的技術問題。 In view of the above, it is necessary to provide a depth recognition model training method, an image depth recognition method and related equipment, which solve the technical problem of inaccurate recognition of depth information of vehicle images.

本申請提供一種圖像深度識別方法，所述圖像深度識別方法包括：獲取第一圖像及第二圖像，基於實例分割網路對所述第一圖像進行實例分割，得到所述第一圖像對應的第一靜態對象、多個第一動態對象及每個第一動態對象的第一動態位置，並基於所述實例分割網路對所述第二圖像進行實例分割，得到所述第二圖像對應的第二靜態對象與多個第二動態對象，基於每個第一動態對象的像素點數量及預設位置從所述多個第一動態對象中選取多個目標動態對象，並基於每個第二動態對象的像素點數量及所述預設位置從所述多個第二動態對象中選取多個特徵動態對象，識別每個目標動態對象是否存在對應的特徵動態對象，並將存在對應關係的目標動態對象及特徵動態對象確定為識別對象，根據所述識別對象對應的動態位姿矩陣、所述第一靜態對象及所述第二靜態對象對應的靜態位姿矩陣及預設閥值矩陣，識別所述識別對象中的目標動態對象的對象狀態，根據所述對象狀態、所述第一動態位置及所述第一圖像生成目標圖像，並根據所述對象狀態、所述第一動態位置及所述第一圖像對應的初始投影圖像生成目標投影圖像，基於所述第一圖像對應的初始深度圖像與所述目標圖像之間的梯度誤差及所述目標投影圖像與所述目標圖像之間的光度誤差，調整獲取的深度識別網路，得到深度識別模型。 The present application provides an image depth recognition method, the image depth recognition method comprising: acquiring a first image and a second image, performing instance segmentation on the first image based on an instance segmentation network to obtain a first static object corresponding to the first image, a plurality of first dynamic objects, and a first dynamic position of each first dynamic object, and performing instance segmentation on the second image based on the instance segmentation network to obtain a second static object and a plurality of second dynamic objects corresponding to the second image, selecting a plurality of target dynamic objects from the plurality of first dynamic objects based on the number of pixels and preset positions of each first dynamic object, and based on the position of each second dynamic object The number of pixels and the preset position from the plurality of second Select a plurality of characteristic dynamic objects among the dynamic objects, identify whether there is a corresponding characteristic dynamic object for each target dynamic object, and determine the corresponding target dynamic object and characteristic dynamic object as the recognition object, according to the dynamic pose matrix corresponding to the recognition object, the static pose matrix corresponding to the first static object and the second static object, and the preset threshold matrix, identify the object state of the target dynamic object in the recognition object, generate a target image according to the object state, the first dynamic position, and the first image, and generate an initial projection image corresponding to the object state, the first dynamic position, and the first image For the target projection image, adjust the obtained depth recognition network based on the gradient error between the initial depth image corresponding to the first image and the target image and the photometric error between the target projection image and the target image to obtain a depth recognition model.

根據本申請可選實施例，所述實例分割網路包括特徵提取層、分類層及映射層，所述基於實例分割網路對所述第一圖像進行實例分割，得到所述第一圖像對應的第一靜態對象、多個第一動態對象及每個第一動態對象的第一動態位置包括：對所述第一圖像進行標準化處理，得到標準化圖像；基於所述特徵提取層對所述標準化圖像進行特徵提取，得到初始特徵圖；基於所述初始特徵圖的尺寸與所述標準化圖像的尺寸之間的倍數關係及所述特徵提取層中的卷積步長對所述標準化圖像進行分割，得到與所述初始特徵圖中每個像素點對應的矩形區域；基於所述分類層對所述初始特徵圖進行分類處理，得到所述初始特徵圖中每個像素點屬於第一預設類別的預測概率；將取值大於預設閥值的預測概率在所述初始特徵圖中所對應的像素點確定為目標像素點，並將多個所述目標像素點對應的多個矩形區域確定為多個特徵區域；基於所述映射層將每個特徵區域映射到所述初始特徵圖中，得到所述初始特徵圖中每個特徵區域對應的映射區域；基於預設數量對多個所述映射區域進行劃分，得到每個映射區域對應的多個劃分區域；確定每個劃分區域中的中心像素點，並計算出所述中心像素點的像素值；對多個所述中心像素點所對應的多個像素值進行池化處理，得到每個映射區域對應的映射概率值；對所述多個映射區域進行還原，並將還原後的多個映射區域進行拼接，得到目標特徵圖，根據所述目標特徵圖、所述映射概率值、所述還原後的多個映射區域及第二預設類別生成所述第一圖像對應的第一靜態對象、所述多個第一動態對象及每個第一動態對象的第一動態位置。 According to an optional embodiment of the present application, the instance segmentation network includes a feature extraction layer, a classification layer, and a mapping layer. The instance-based segmentation network performs instance segmentation on the first image to obtain the first static object corresponding to the first image, a plurality of first dynamic objects, and the first dynamic position of each first dynamic object. Segmenting the standardized image to obtain a rectangular area corresponding to each pixel in the initial feature map; classifying the initial feature map based on the classification layer to obtain a predicted probability that each pixel in the initial feature map belongs to a first preset category; determining a pixel corresponding to a predicted probability greater than a preset threshold in the initial feature map as a target pixel, and determining a plurality of rectangular areas corresponding to multiple target pixels as a plurality of feature areas; Divide a plurality of said mapping areas to obtain a plurality of divided areas corresponding to each mapped area; determine a central pixel point in each divided area, and calculate a pixel value of said central pixel point; perform pooling processing on a plurality of pixel values corresponding to a plurality of said central pixel points, to obtain a mapping probability value corresponding to each mapped area; restore said plurality of mapped areas, and splice the restored multiple mapped areas to obtain a target feature map, The mapping probability value, the plurality of restored mapping areas and the second preset category generate the first static object corresponding to the first image, the plurality of first dynamic objects, and a first dynamic position of each first dynamic object.

根據本申請可選實施例，所述根據所述目標特徵圖、所述映射概率值、所述還原後的多個映射區域及第二預設類別生成所述第一圖像對應的第一靜態對象、所述多個第一動態對象及每個第一動態對象的第一動態位置包括：根據所述映射概率值及所述第二預設類別對所述目標特徵圖的每個像素點進行分類，得到所述還原後的映射區域中每個像素點的像素類別，將所述還原後的映射區域中相同的像素類別所對應的多個像素點構成的區域確定為第一對象，獲取所述第一對象中所有像素點的像素座標，並將所述像素座標確定為所述第一對象對應的第一位置，根據預設規則將多個所述第一對象劃分為所述多個第一動態對象及所述第一靜態對象，並將每個第一動態對象對應的第一位置確定為所述第一動態位置。 According to an optional embodiment of the present application, the generating the first static object corresponding to the first image, the plurality of first dynamic objects, and the first dynamic position of each first dynamic object according to the target feature map, the mapping probability value, the restored plurality of mapping areas, and the second preset category includes: classifying each pixel of the target feature map according to the mapping probability value and the second preset category, obtaining the pixel category of each pixel in the restored mapping area, determining an area composed of a plurality of pixel points corresponding to the same pixel category in the restored mapping area as the first object, and obtaining the first object. The pixel coordinates of all pixel points in the object, and determining the pixel coordinates as the first position corresponding to the first object, dividing the plurality of the first objects into the plurality of first dynamic objects and the first static objects according to preset rules, and determining the first position corresponding to each first dynamic object as the first dynamic position.

根據本申請可選實施例，所述基於每個第一動態對象的像素點數量及預設位置從所述多個第一動態對象中選取多個目標動態對象包括：統計每個第一動態對象包含的像素點的像素數量，根據所述像素數量對所述多個第一動態對象進行排序，選取排序後的像素數量處於所述預設位置的第一動態對象作為所述多個目標動態對象。 According to an optional embodiment of the present application, the selecting a plurality of target dynamic objects from the plurality of first dynamic objects based on the number of pixels and preset positions of each first dynamic object includes: counting the number of pixels contained in each first dynamic object, sorting the plurality of first dynamic objects according to the number of pixels, and selecting the first dynamic objects whose sorted number of pixels are at the preset position as the plurality of target dynamic objects.

根據本申請可選實施例，所述識別每個目標動態對象是否存在對應的特徵動態對象包括：獲取每個目標動態對象的多個目標要素資訊，並獲取相同類別的特徵動態對象中每個目標要素資訊對應的特徵要素資訊，將每個目標要素資訊與對應的特徵要素資訊進行匹配處理，得到所述目標動態對象與所述相同類別的特徵動態對象的匹配值，若所述匹配值處於預設區間內，則確定所述目標動態對象中存在對應的特徵動態對象。 According to an optional embodiment of the present application, the identifying whether each target dynamic object has a corresponding feature dynamic object includes: obtaining multiple target element information of each target dynamic object, and obtaining feature element information corresponding to each target element information in the same category of feature dynamic objects, matching each target element information with the corresponding feature element information to obtain a matching value between the target dynamic object and the feature dynamic object of the same type, and if the matching value is within a preset range, it is determined that there is a corresponding feature dynamic object in the target dynamic object.

根據本申請可選實施例，所述根據所述識別對象對應的動態位姿矩陣、所述第一靜態對象及所述第二靜態對象對應的靜態位姿矩陣及預設閥值矩陣識別所述識別對象中的目標動態對象的對象狀態包括：將所述靜態位姿矩陣中的每個矩陣元素與所述識別對象對應的動態位姿矩陣中對應的矩陣元素進行相減運算，得到位姿差值，對所述位姿差值取絕對值，得到所述靜態位姿矩陣中的位姿絕對值，根據所述靜態位姿矩陣中每個位姿絕對值的元素位置，將所述位姿絕對值進行排列，得到位姿絕對值矩陣，將所述位姿絕對值矩陣中的每個位姿絕對值與所述預設閥值矩陣中對應的位姿閥值進比較，若所述位姿絕對值矩陣中存在至少一個大於所述對應位姿閥值的位姿絕對值，則確定該識別對象中的目標動態對象的對象狀態為移動，或者，若所述位姿絕對值矩陣中所有的位姿絕對值均小於或者等於所述對應閥值，則確定所述識別對象中的目標動態對象的對象狀態為靜止。 According to an optional embodiment of the present application, according to the dynamic pose matrix corresponding to the identified object, the static pose matrix corresponding to the first static object and the second static object, and a preset threshold Identifying the object state of the target dynamic object in the recognition object by matrix includes: subtracting each matrix element in the static pose matrix from the corresponding matrix element in the dynamic pose matrix corresponding to the recognition object to obtain a pose difference value, taking an absolute value of the pose difference value to obtain the pose absolute value in the static pose matrix, arranging the pose absolute values according to the element position of each pose absolute value in the static pose matrix to obtain a pose absolute value matrix, and calculating each pose absolute value in the pose absolute value matrix Compared with the corresponding pose threshold value in the preset threshold value matrix, if there is at least one pose absolute value greater than the corresponding pose threshold value in the pose absolute value matrix, it is determined that the object state of the target dynamic object in the recognition object is moving, or, if all the pose absolute values in the pose absolute value matrix are less than or equal to the corresponding threshold value, then it is determined that the object state of the target dynamic object in the recognition object is static.

根據本申請可選實施例，所述根據所述對象狀態、所述第一動態位置及所述第一圖像生成目標圖像包括：若所述識別對象中任一目標動態對象的對象狀態為移動，則基於所述任一目標動態對象的第一動態位置在所述第一圖像中對所述任一目標動態對象進行掩膜處理，得到所述目標圖像，或者，若所述識別對象中的所有目標動態對象的對象狀態均為靜止，則將所述第一圖像確定為所述目標圖像。 According to an optional embodiment of the present application, generating the target image according to the object state, the first dynamic position, and the first image includes: if the object state of any target dynamic object in the identified objects is moving, then performing mask processing on the any target dynamic object in the first image based on the first dynamic position of the any target dynamic object to obtain the target image; or, if the object states of all target dynamic objects in the identified objects are stationary, then determining the first image as the target image.

根據本申請可選實施例，所述基於所述第一圖像對應的初始深度圖像與所述目標圖像之間的梯度誤差及所述目標投影圖像與所述目標圖像之間的光度誤差，調整所述深度識別網路，得到深度識別模型包括：基於所述梯度誤差及所述光度誤差，計算所述深度識別網路的深度損失值，基於所述深度損失值調整所述深度識別網路，直至所述深度損失值下降到最低，得到所述深度識別模型。 According to an optional embodiment of the present application, the adjusting the depth recognition network based on the gradient error between the initial depth image corresponding to the first image and the target image and the photometric error between the projected target image and the target image to obtain the depth recognition model includes: calculating a depth loss value of the depth recognition network based on the gradient error and the photometric error, adjusting the depth recognition network based on the depth loss value until the depth loss value drops to a minimum, and obtaining the depth recognition model.

本申請提供一種圖像深度識別方法，所述圖像深度識別方法包括：獲取待識別圖像，將所述待識別圖像輸入到深度識別模型中，得到所述待識別圖像的目標深度圖像及所述待識別圖像的深度資訊，所述深度識別模型透過執行所述的深度識別模型訓練方法而獲得。 The present application provides an image depth recognition method. The image depth recognition method includes: acquiring an image to be recognized, inputting the image to be recognized into a depth recognition model, and obtaining a target depth image of the image to be recognized and depth information of the image to be recognized. The depth recognition model is obtained by executing the depth recognition model training method described above.

本申請提供一種電腦設備，所述電腦設備包括：儲存器，儲存至少一個指令；及處理器，執行所述至少一個指令以實現所述的深度識別模型訓練方法或所述的圖像深度識別方法。 The present application provides a computer device, which includes: a memory storing at least one instruction; and a processor executing the at least one instruction to implement the depth recognition model training method or the image depth recognition method.

本申請提供一種電腦可讀儲存介質，所述電腦可讀儲存介質中儲存有至少一個指令，所述至少一個指令被電腦設備中的處理器執行以實現所述的深度識別模型訓練方法或所述的圖像深度識別方法。 The present application provides a computer-readable storage medium, wherein at least one instruction is stored in the computer-readable storage medium, and the at least one instruction is executed by a processor in a computer device to implement the depth recognition model training method or the image depth recognition method.

由上述技術方案可知，本申請對所述第一圖像進行實例分割，得到所述第一圖像對應的第一靜態對象、多個第一動態對象及每個第一動態對象的第一動態位置，基於每個第一動態對象的像素點數量及預設位置從所述多個第一動態對象中選取多個目標動態對象，能夠從所述多個第一動態對象中選取多個目標動態對象，由於減少所述多個第一動態對象的數量，因此能夠提高深度識別網路的訓練速度，識別每個目標動態對象是否存在對應的特徵動態對象，能夠選取所述第二圖像中與每個目標動態對象相同的特徵動態對象，透過計算每個目標動態對象與相同的特徵動態對象的動態位姿矩陣，並將所述動態位姿矩陣與所述預設閥值矩陣進行比較，能夠確定所述第一圖像中的每個目標動態對象的狀態是否為移動，根據所述識別對象中的目標動態對象的狀態、所述第一動態位置及所述第一圖像生成目標圖像，能夠基於所述第一動態位置對將所述第一圖像中發生移動的目標動態對象進行濾除，生成所述目標圖像，由於狀態為移動的目標動態對象的位置變化會導致所述目標動態對象在所述初始深度圖像中對應的像素點的深度值發生變化，透過在所述目標圖像中濾除了狀態為移動的目標動態對象，使得在計算損失值時不使用所述深度值進行計算，能夠避免狀態為移動的目標動態對象對計算損失值的影響，所述目標圖像保留狀態為靜止的目標動態對象，能夠保留了所述第一圖像的更多圖像資訊，因此，利用所述目標圖像訓練得到的深度識別模型，能夠避免發生移動的目標動態對象對所述深度識別模型的訓練精度的影響，進而能夠提高所述深度識別模型的識別準確性。 It can be seen from the above technical solution that the present application performs instance segmentation on the first image to obtain the first static object corresponding to the first image, multiple first dynamic objects and the first dynamic position of each first dynamic object, and select multiple target dynamic objects from the multiple first dynamic objects based on the number of pixels and preset positions of each first dynamic object, and can select multiple target dynamic objects from the multiple first dynamic objects. Since the number of the multiple first dynamic objects is reduced, it is possible to improve the training speed of the depth recognition network, identify whether each target dynamic object has a corresponding characteristic dynamic object, and select the second image and For each target dynamic object with the same feature dynamic object, by calculating the dynamic pose matrix of each target dynamic object and the same feature dynamic object, and comparing the dynamic pose matrix with the preset threshold value matrix, it can be determined whether the state of each target dynamic object in the first image is moving, and a target image is generated according to the state of the target dynamic object in the recognition object, the first dynamic position and the first image, and the target dynamic object that moves in the first image can be filtered based on the first dynamic position, and the target image is generated. As a result, the depth value of the pixel corresponding to the target dynamic object in the initial depth image changes, and by filtering out the target dynamic object whose state is moving in the target image, the depth value is not used for calculation when calculating the loss value, and the impact of the moving target dynamic object on the calculation loss value can be avoided. The target image retains the static target dynamic object, which can retain more image information of the first image. can improve the recognition of the deep recognition model Do not be accurate.

1:電腦設備 1: Computer equipment

2:拍攝設備 2: Shooting equipment

12:儲存器 12: Storage

13:處理器 13: Processor

101~107:步驟 101~107: Steps

108~109:步驟 108~109: Steps

圖1是本申請的實施例提供的應用環境圖。 FIG. 1 is a diagram of an application environment provided by an embodiment of the present application.

圖2是本申請的實施例提供的深度識別模型訓練方法的流程圖。 Fig. 2 is a flow chart of a method for training a depth recognition model provided by an embodiment of the present application.

圖3是本申請實施例提供的像素座標系和相機座標系的示意圖。 FIG. 3 is a schematic diagram of a pixel coordinate system and a camera coordinate system provided by an embodiment of the present application.

圖4是本申請實施例提供的圖像深度識別方法的流程圖。 Fig. 4 is a flowchart of an image depth recognition method provided by an embodiment of the present application.

圖5是本申請實施例提供的電腦設備的結構示意圖。 FIG. 5 is a schematic structural diagram of a computer device provided by an embodiment of the present application.

為了使本申請的目的、技術方案和優點更加清楚，下面結合附圖和具體實施例對本申請進行詳細描述。 In order to make the purpose, technical solution and advantages of the present application clearer, the present application will be described in detail below in conjunction with the accompanying drawings and specific embodiments.

如圖1所示，是本申請的實施例提供的應用環境圖。所述深度識別模型訓練方法及所述圖像深度識別方法可應用於一個或者多個電腦設備1中，所述電腦設備1與拍攝設備2相通信，所述拍攝設備2可以是單目相機，也可以是實現拍攝的其它設備。 As shown in FIG. 1 , it is an application environment diagram provided by the embodiment of the present application. The depth recognition model training method and the image depth recognition method can be applied to one or more computer equipment 1, and the computer equipment 1 communicates with the shooting device 2, and the shooting device 2 can be a monocular camera or other equipment for shooting.

所述電腦設備1是一種能夠按照事先設定或儲存的指令，自動進行參數值計算和/或資訊處理的設備，其硬體包括，但不限於：微處理器、專用積體電路(Application Specific Integrated Circuit，ASIC)、可程式設計閘陣列(Field-Programmable Gate Array，FPGA)、數位訊號處理器(Digital Signal Processor，DSP)、嵌入式設備等。所述電腦設備1可以是任何一種可與用戶進行人機交互的電子產品，例如，個人電腦、平板電腦、智慧手機、個人數位助理(Personal Digital Assistant，PDA)、遊戲機、互動式網路電視(Internet Protocol Television，IPTV)、穿戴式智能設備等。 The computer device 1 is a device that can automatically perform parameter value calculation and/or information processing according to pre-set or stored instructions. Its hardware includes, but is not limited to: microprocessor, Application Specific Integrated Circuit (ASIC), Programmable Gate Array (Field-Programmable Gate Array, FPGA), Digital Signal Processor (Digital Signal Processor, DSP), embedded devices, etc. The computer device 1 may be any electronic product capable of man-machine interaction with a user, such as a personal computer, a tablet computer, a smart phone, a personal digital assistant (Personal Digital Assistant, PDA), a game console, an interactive network television (Internet Protocol Television, IPTV), a wearable smart device, and the like.

所述電腦設備1還可以包括網路設備和/或使用者設備。其中，所述網路設備包括，但不限於單個網路伺服器、多個網路伺服器組成的伺服器組或基於雲計算(Cloud Computing)的由大量主機或網路伺服器構成的雲。 The computer equipment 1 may also include network equipment and/or user equipment. Wherein, the network device includes, but is not limited to, a single network server, a server group composed of multiple network servers, or a cloud composed of a large number of hosts or network servers based on Cloud Computing.

所述電腦設備1所處的網路包括，但不限於：網際網路、廣域網路、都會區網路、區域網路、虛擬私人網路(Virtual Private Network，VPN)等。 The network where the computer device 1 is located includes, but is not limited to: Internet, wide area network, metropolitan area network, local area network, virtual private network (Virtual Private Network, VPN) and so on.

如圖2所示，是本申請的實施例提供的深度識別模型訓練方法的流程圖。根據不同的需求，所述流程圖中各個步驟的順序可以根據實際檢測要求進行調整，某些步驟可以省略。所述方法的執行主體為電腦設備，例如圖1所示的電腦設備1。 As shown in FIG. 2 , it is a flow chart of the depth recognition model training method provided by the embodiment of the present application. According to different requirements, the order of each step in the flowchart can be adjusted according to actual detection requirements, and some steps can be omitted. The execution body of the method is a computer device, such as the computer device 1 shown in FIG. 1 .

步驟101，獲取第一圖像及第二圖像。 Step 101, acquiring a first image and a second image.

在本申請的至少一個實施例中，所述第一圖像及所述第二圖像為相鄰幀的三原色光(Red Green Blue，RGB)圖像，所述第二圖像的生成時間大於所述第一圖像的生成時間，所述第一圖像及所述第二圖像中可以包含車輛，地面、行人、天空、樹木等初始對象，所述第一圖像及所述第二圖像包含相同的初始對象。 In at least one embodiment of the present application, the first image and the second image are three primary color (Red Green Blue, RGB) images of adjacent frames, and the generation time of the second image is longer than the generation time of the first image. The first image and the second image may include initial objects such as vehicles, ground, pedestrians, sky, trees, etc., and the first image and the second image include the same initial object.

在本申請的至少一個實施例中，所述電腦設備獲取待識別圖像包括：所述電腦設備控制拍攝設備拍攝目標場景，得到所述第一圖像，並在相隔預設時間後再次拍攝所述目標場景，得到所述第二圖像。其中，所述拍攝設備可以為單目相機，所述目標場景中可以包括車輛，地面、行人等目標對象。可以理解的是，所述預設時間很小，例如，預設時間可以為10ms。 In at least one embodiment of the present application, the obtaining the image to be recognized by the computer device includes: the computer device controls the shooting device to shoot the target scene to obtain the first image, and shoots the target scene again after a preset time interval to obtain the second image. Wherein, the photographing device may be a monocular camera, and the target scene may include target objects such as vehicles, ground, and pedestrians. It can be understood that the preset time is very small, for example, the preset time may be 10ms.

步驟102，基於實例分割網路對所述第一圖像進行實例分割，得到所述第一圖像對應的第一靜態對象、多個第一動態對象及每個第一動態對象的第一動態位置，並基於所述實例分割網路對所述第二圖像進行實例分割，得到所述第二圖像對應的第二靜態對象與多個第二動態對象。 Step 102: Perform instance segmentation on the first image based on an instance segmentation network to obtain a first static object corresponding to the first image, a plurality of first dynamic objects, and a first dynamic position of each first dynamic object, and perform instance segmentation on the second image based on the instance segmentation network to obtain a second static object and a plurality of second dynamic objects corresponding to the second image.

在本申請的至少一個實施例中，所述第一動態對象及所述第二動態對象是指能夠移動的對象，例如所述第一動態對象及所述第二動態對象可以是行人、車輛，所述第一靜態對象及所述第二靜態對象是指不能夠移動的對象，例如，所述第一靜態對象及所述第二靜態對象可以為樹木、地面等等。 In at least one embodiment of the present application, the first dynamic object and the second dynamic object refer to objects that can move. For example, the first dynamic object and the second dynamic object can be pedestrians and vehicles. The first static object and the second static object refer to non-movable objects. For example, the first static object and the second static object can be trees, the ground, etc.

在本申請的至少一個實施例中，所述實例分割網路包括特徵提取層、分類層及映射層，所述電腦設備基於實例分割網路對所述第一圖像進行實例分割，得到所述第一圖像對應的第一靜態對象、多個第一動態對象及每個第一動態對象的第一動態位置包括：所述電腦設備對所述第一圖像進行標準化處理，得到標準化圖像，進一步地，所述電腦設備基於所述特徵提取層對所述標準化圖像進行特徵提取，得到初始特徵圖，更進一步地，所述電腦設備基於所述初始特徵圖的尺寸與所述標準化圖像的尺寸之間的倍數關係及所述特徵提取層中的卷積步長對所述標準化圖像進行分割，得到與所述初始特徵圖中每個像素點對應的矩形區域，更進一步地，所述電腦設備基於所述分類層對所述初始特徵圖進行分類處理，得到所述初始特徵圖中每個像素點屬於第一預設類別的預測概率，更進一步地，所述電腦設備將取值大於預設閥值的預測概率在所述初始特徵圖中所對應的像素點確定為目標像素點，並將多個所述目標像素點對應的多個矩形區域確定為多個特徵區域，更進一步地，所述電腦設備基於所述映射層將每個特徵區域映射到所述初始特徵圖中，得到所述初始特徵圖中每個特徵區域對應的映射區域，更進一步地，所述電腦設備基於預設數量對多個所述映射區域進行劃分，得到每個映射區域對應的多個劃分區域，更進一步地，所述電腦設備確定每個劃分區域中的中心像素點，並計算出所述中心像素點的像素值，更進一步地，所述電腦設備對多個所述中心像素點所對應的多個像素值進行池化處理，得到每個映射區域對應的映射概率值，更進一步地，所述電腦設備對所述多個映射區域進行還原，並將還原後的多個映射區域進行拼接，得到目標特徵圖，更進一步地，所述電腦設備根據所述目標特徵圖、所述映射概率值、所述還原後的多個映射區域及第二預設類別生成所述第一圖像對應的第一靜態對象、所述多個第一動態對象及每個第一動態對象的第一動態位置。 In at least one embodiment of the present application, the instance segmentation network includes a feature extraction layer, a classification layer, and a mapping layer. The computer device performs instance segmentation on the first image based on the instance segmentation network to obtain the first static object corresponding to the first image, a plurality of first dynamic objects, and the first dynamic position of each first dynamic object. The multiple relationship between the dimensions of the standardized image and the convolution step in the feature extraction layer are used to segment the standardized image to obtain a rectangular area corresponding to each pixel in the initial feature map. Further, the computer device classifies the initial feature map based on the classification layer to obtain a predicted probability that each pixel in the initial feature map belongs to a first preset category. Further, the computer device determines a pixel corresponding to a predicted probability greater than a preset threshold in the initial feature map as a target pixel, and determines multiple rectangular areas corresponding to multiple target pixels as A plurality of feature regions, further, the computer device maps each feature region to the initial feature map based on the mapping layer, and obtains a mapping region corresponding to each feature region in the initial feature map, and further, the computer device divides a plurality of the map regions based on a preset number to obtain a plurality of divided regions corresponding to each mapped region, further, the computer device determines the central pixel in each divided region, and calculates the pixel value of the central pixel point, and further, the computer device performs pooling processing on a plurality of pixel values corresponding to the plurality of central pixel points to obtain each mapped region For the corresponding mapping probability value, further, the computer device restores the plurality of mapping areas, and stitches the restored multiple mapping areas to obtain a target feature map. Further, the computer device generates the first static object corresponding to the first image, the plurality of first dynamic objects, and the first dynamic position of each first dynamic object according to the target feature map, the mapping probability value, the restored multiple mapping areas, and the second preset category.

其中，所述標準化處理包括剪裁，所述標準化圖像的形狀通常為正方形，所述特徵提取層包括卷積層、批標準化層及池化層等等。例如所述特徵提取層可以為去除了全連接層之後的VGG網路。其中，透過雙線性插值法計算出所述中心像素點的像素值，所述雙線性插值法為現有技術，本申請對此不再贅述。所述映射層可以為ROI Align層。所述第一預設類別可以自訂設置。例如，所述第一預設類別可以為前景或者背景。所述分類層可以為全連接層和softmax層。所述預設閥值可以自行設置，本申請對此不作限制。所述預設數量可以自行設置，本申請對此不作限制。所述第二預設類別可以根據所述目標場景中出現的目標對象自行設置，本申請對此不作限制。例如，所述第二預設類別可以包括，但不限於：小轎車、客車、道路、行人、路燈、天空及建築物等等。 Wherein, the normalization process includes clipping, the shape of the normalized image is usually a square, and the feature extraction layer includes a convolution layer, a batch normalization layer, a pooling layer, and the like. Such as the special The extraction layer can be the VGG network after removing the fully connected layer. Wherein, the pixel value of the central pixel point is calculated through a bilinear interpolation method, and the bilinear interpolation method is a prior art, which will not be repeated in this application. The mapping layer may be an ROI Align layer. The first default category can be customized. For example, the first preset category may be foreground or background. The classification layer can be a fully connected layer and a softmax layer. The preset threshold can be set by itself, which is not limited in this application. The preset number can be set by itself, which is not limited in this application. The second preset category can be set according to the target objects appearing in the target scene, which is not limited in the present application. For example, the second preset category may include, but not limited to: cars, passenger cars, roads, pedestrians, street lamps, sky and buildings, and so on.

在本實施例中，所述實例分割網路還包括全卷積神經網路，基於所述全卷積神經網路對所述多個映射區域進行還原。 In this embodiment, the instance segmentation network further includes a fully convolutional neural network, and the plurality of mapping regions are restored based on the fully convolutional neural network.

具體地，所述電腦設備基於所述初始特徵圖的尺寸與所述標準化圖像的尺寸之間的倍數關係及所述特徵提取層中的卷積步長對所述標準化圖像進行分割，得到與所述初始特徵圖中每個像素點對應的矩形區域包括：所述電腦設備將所述倍數關係與所述卷積步長的乘積作為寬和高對所述標準化圖像進行分割，得到與所述初始特徵圖中每個像素點對應的矩形區域。 Specifically, the computer device segmenting the standardized image based on the multiple relationship between the size of the initial feature map and the size of the normalized image and the convolution step in the feature extraction layer, and obtaining a rectangular area corresponding to each pixel in the initial feature map includes: the computer device segmenting the standardized image by using the product of the multiple relationship and the convolution step as width and height to obtain a rectangular area corresponding to each pixel in the initial feature map.

例如，所述標準化圖像的尺寸為800*800，所述初始特徵圖的尺寸為32*32，卷積步長為4，所述初始特徵圖的尺寸32*32與所述標準化圖像的尺寸800*800之間的倍數關係為25，所述倍數關係與所述卷積步長的乘積為100，所述電腦設備將所述標準化圖像分割為8個矩形區域，每個矩形區域的尺寸為100*100。 For example, the size of the standardized image is 800*800, the size of the initial feature map is 32*32, the convolution step is 4, the multiple relationship between the size 32*32 of the initial feature map and the size 800*800 of the standardized image is 25, the product of the multiple relationship and the convolution step is 100, and the computer device divides the standardized image into 8 rectangular areas, each of which has a size of 100*100.

具體地，所述預設數量包括第一預設數量及第二預設數量，所述電腦設備基於預設數量對多個所述映射區域進行劃分，得到每個映射區域對應的多個劃分區域包括：所述電腦設備基於所述第一預設數量對每個映射區域進行劃分，得到每個映射區域對應的多個中間區域，進一步地，所述電腦設備基於所述第二預設數量對每個中間區域進行劃分，得到每個映射區域對應的多個劃分區域。其中，所述第一預設數量及所述第二預設數量可以自行設置，本申請對此不作限制。例如，所述第一預設數量可以為7*7，所述第二預設數量可以為2*2。例如，當映射區域的尺寸為14*14，將所述映射區域平均分為7*7個中間區域，每個中間區域的尺寸為2*2，將每個中間區域再平均分為2*2個劃分區域，每個劃分區域的尺寸約為0.5*0.5。 Specifically, the preset number includes a first preset number and a second preset number, and the computer device divides the plurality of mapping areas based on the preset number to obtain a plurality of divided areas corresponding to each mapping area. Divide the area. Wherein, the first preset number and the second preset number can be set by yourself, which is not limited in this application. For example, the first preset number may be 7*7, and the second preset number may be 2*2. For example, when the size of the mapping area is 14*14, the mapping area is divided into 7*7 intermediate areas on average, and the size of each intermediate area is 2*2, and each intermediate area is divided into 2*2 divided areas on average, and the size of each divided area is about 0.5*0.5.

在本實施例中，所述實例分割網路還會輸出所述第一靜態對象的位置、所述第二靜態對象的位置、每個目標動態對象的類別、所述第一靜態對象的類別、所述第二靜態對象的類別及每個特徵動態對象的類別。透過上述實施方式，基於實例分割網路對所述第一圖像及所述第二圖像進行分割，能夠根據位置區分所述第一圖像及所述第二圖像中每個初始對象，從而能夠基於所述位置對每個初始對象進行處理。 In this embodiment, the instance segmentation network also outputs the position of the first static object, the position of the second static object, the category of each target dynamic object, the category of the first static object, the category of the second static object, and the category of each feature dynamic object. Through the above embodiments, the first image and the second image are segmented based on the instance segmentation network, and each initial object in the first image and the second image can be distinguished according to a position, so that each initial object can be processed based on the position.

具體地，所述電腦設備根據所述目標特徵圖、所述映射概率值、所述還原後的多個映射區域及第二預設類別生成所述第一圖像對應的第一靜態對象、所述多個第一動態對象及每個第一動態對象的第一動態位置包括：所述電腦設備根據所述映射概率值及所述第二預設類別對所述目標特徵圖的每個像素點進行分類，得到所述還原後的映射區域中每個像素點的像素類別，進一步地，所述電腦設備將所述還原後的映射區域中相同的像素類別所對應的多個像素點構成的區域確定為第一對象，更進一步地，所述電腦設備獲取所述第一對象中所有像素點的像素座標，並將所述像素座標確定為所述第一對象對應的第一位置，更進一步地，所述電腦設備根據預設規則將多個所述第一對象劃分為所述多個第一動態對象及所述第一靜態對象，並將每個第一動態對象對應的第一位置確定為所述第一動態位置。 Specifically, the computer device generating the first static object corresponding to the first image, the plurality of first dynamic objects, and the first dynamic position of each first dynamic object according to the target feature map, the mapping probability value, the restored multiple mapping regions, and the second preset category includes: the computer device classifying each pixel of the target feature map according to the mapping probability value and the second preset category to obtain the pixel category of each pixel in the restored mapping region. For the first object, further, the computer device acquires pixel coordinates of all pixels in the first object, and determines the pixel coordinates as the first position corresponding to the first object, further, the computer device divides the plurality of first objects into the plurality of first dynamic objects and the first static object according to preset rules, and determines the first position corresponding to each first dynamic object as the first dynamic position.

其中，所述預設規則將屬於代步工具、人或者動物等能夠移動的初始對象確定為可以移動的所述多個第一動態對象，並將屬於植物、固定對象等不能移動的初始對象確定為所述第一靜態對象。例如，將可以移動的行人、小貓、小狗、自行車及小轎車等確定為所述多個第一動態對象，並將不能移動的樹木、路燈及建築物等初始對象確定為所述第一靜態對象。 Wherein, the preset rules determine the movable initial objects belonging to means of transportation, people or animals as the plurality of movable first dynamic objects, and determine the immovable initial objects belonging to plants and fixed objects as the first static objects. For example, pedestrians, kittens, puppies, bicycles and cars that can move are determined as the plurality of first dynamic objects, and will not be able to move. Initial objects such as trees, street lamps, and buildings are determined as the first static object.

在本實施例中，所述多個第二動態對象的劃分方式與所述多個第一動態對象的劃分方式基本相同，所述第二靜態對象的劃分方式與所述第一靜態對象的劃分方式基本相同，故本申請在此不作贅述。 In this embodiment, the division method of the plurality of second dynamic objects is basically the same as that of the plurality of first dynamic objects, and the division method of the second static objects is basically the same as that of the first static objects, so this application will not repeat them here.

步驟103，基於每個第一動態對象的像素點數量及預設位置從所述多個第一動態對象中選取多個目標動態對象，並基於每個第二動態對象的像素點數量及所述預設位置從所述多個第二動態對象中選取多個特徵動態對象。 Step 103, selecting a plurality of target dynamic objects from the plurality of first dynamic objects based on the number of pixels and the preset position of each first dynamic object, and selecting a plurality of characteristic dynamic objects from the plurality of second dynamic objects based on the number of pixels of each second dynamic object and the preset position.

在本申請的至少一個實施例中，所述電腦設備基於每個第一動態對象的像素點數量及預設位置從所述多個第一動態對象中選取多個目標動態對象包括：所述電腦設備統計每個第一動態對象包含的像素點的像素數量，並根據所述像素數量對所述多個第一動態對象進行排序，進一步地，所述電腦設備選取排序後的像素數量處於所述預設位置的第一動態對象作為所述多個目標動態對象。其中，所述預設位置可以自行設置。例如，所述預設位置可以為前五個。 In at least one embodiment of the present application, the computer device selecting a plurality of target dynamic objects from the plurality of first dynamic objects based on the number of pixels and preset positions of each first dynamic object includes: the computer device counts the number of pixels contained in each first dynamic object, and sorts the plurality of first dynamic objects according to the number of pixels. Wherein, the preset position can be set by itself. For example, the preset positions may be the first five.

在本實施例中，所述多個特徵動態對象的選取方式與所述多個目標動態對象的選取方式基本相同，因此，本申請在此不作贅述。在本申請的至少一個實施例中，所述第二靜態圖像的生成過程與所述第一靜態圖像基本一致，所述第二動態圖像的生成過程與所述第一動態圖像基本一致，故本申請在此不作贅述。 In this embodiment, the selection method of the plurality of feature dynamic objects is basically the same as the selection method of the plurality of target dynamic objects, so the present application will not repeat them here. In at least one embodiment of the present application, the generation process of the second static image is basically the same as that of the first static image, and the generation process of the second dynamic image is basically the same as that of the first dynamic image, so this application will not repeat them here.

透過上述實施方式，基於像素點數量及預設位置選取所述多個目標動態對象及所述多個特徵動態對象，由於減少所述多個第一動態對象的數量，因此能夠提高深度識別網路的訓練速度。 Through the above implementation, the plurality of target dynamic objects and the plurality of feature dynamic objects are selected based on the number of pixels and the preset positions, and the number of the plurality of first dynamic objects is reduced, so the training speed of the depth recognition network can be improved.

步驟104，識別每個目標動態對象是否存在對應的特徵動態對象，並將存在對應關係的目標動態對象及特徵動態對象確定為識別對象。 Step 104, identify whether each target dynamic object has a corresponding characteristic dynamic object, and determine the corresponding target dynamic object and characteristic dynamic object as the identification object.

在本申請的至少一個實施例中，所述電腦設備識別每個目標動態對象是否存在對應的特徵動態對象包括：所述電腦設備獲取每個目標動態對象的多個目標要素資訊，並獲取相同類別的特徵動態對象中每個目標要素資訊對應的特徵要素資訊，進一步地，所述電腦設備將每個目標要素資訊與對應的特徵要素資訊進行匹配處理，得到所述目標動態對象與所述相同類別的特徵動態對象的匹配值，若所述匹配值處於預設區間內，所述電腦設備確定所述目標動態對象中存在對應的特徵動態對象。 In at least one embodiment of the present application, the computer device identifying whether there is a corresponding characteristic dynamic object for each target dynamic object includes: the computer device obtaining each target dynamic object multiple target element information, and obtain the feature element information corresponding to each target element information in the feature dynamic object of the same category, further, the computer device matches each target element information with the corresponding feature element information, and obtains a matching value between the target dynamic object and the feature dynamic object of the same category, and if the matching value is within a preset interval, the computer device determines that there is a corresponding feature dynamic object in the target dynamic object.

其中，可以基於目標追蹤演算法獲取所述多個目標要素資訊及每個目標要素資訊對應的特徵要素資訊。所述目標追蹤演算法為現有技術，本申請在此不作贅述。所述預設區間可以自行設置，本申請對此不做限制。 Wherein, the plurality of target element information and feature element information corresponding to each target element information may be acquired based on a target tracking algorithm. The target tracking algorithm is an existing technology, and the present application does not repeat it here. The preset interval can be set by itself, which is not limited in this application.

在本實施例中，所述多個目標要素資訊可以為所述目標動態對象的特徵的參數，所述多個特徵要素資訊可以為所述相同類別的特徵動態對象的特徵的參數。例如，當所述目標動態對象為小轎車時，所述多個目標要素資訊可以為小轎車的尺寸，小轎車的紋理，小轎車的位置以及小轎車的輪廓等等。由於每個目標要素資訊及對應的特徵要素資訊的參數不同，匹配處理的方式也不同。所述匹配處理的方式包括相減、相加、加權等等操作。例如，所述第一圖像中目標動態對象及所述第二圖像中特徵動態對象均小轎車，所述第一圖像中的小轎車的長度為4.8米，寬度為1.65米，所述第二圖像中的小轎車的長度為4.7米，寬度為1.6米，所述第一圖像中的小轎車的長為4.8米與所述第二圖像中的小轎車的長為4.7米相減，得到第一匹配值為0.1米，相應的得到第二匹配值0.05米，當第一匹配值對應的第一預設區間為[0,0.12]，第二匹配值對應的第二預設區間為[0,0.07]時，由於第一匹配值處於所述第一預設區間內及第二匹配值處於所述第二預設區間內，因此所述第二圖像中的小轎車與所述第一圖像中的小轎車是同一輛小轎車。 In this embodiment, the plurality of target element information may be parameters of features of the target dynamic object, and the plurality of feature element information may be parameters of features of the same type of feature dynamic objects. For example, when the target dynamic object is a car, the plurality of target element information may be the size of the car, the texture of the car, the position of the car, the outline of the car, and so on. Since the parameters of each target element information and corresponding feature element information are different, the matching processing methods are also different. The manner of the matching processing includes operations such as subtraction, addition, and weighting. For example, the target dynamic object in the first image and the characteristic dynamic object in the second image are both cars. The length of the car in the first image is 4.8 meters and the width is 1.65 meters. The length of the car in the second image is 4.7 meters and the width is 1.6 meters. The length of the car in the first image is 4.8 meters. When the first preset interval corresponding to the value is [0,0.12], and the second preset interval corresponding to the second matching value is [0,0.07], since the first matching value is in the first preset interval and the second matching value is in the second preset interval, the car in the second image is the same car as the car in the first image.

透過上述實施方式，獲取每個目標動態對象的多個目標要素資訊及相同類別的特徵動態對象中每個目標要素資訊對應的特徵要素資訊，選取相同類別的特徵動態對象，能夠更快識別出該特徵動態對象與該目標動態對象是同一個，透過選取多個目標要素資訊，並將每個目標要素資訊與對應的特徵要素資訊進行匹配，能夠更加全面地提取該目標動態對象及相同類別的特徵動態對象的特徵，能夠消除合理誤差並提高匹配準確性。 Through the above implementation method, multiple target element information of each target dynamic object and feature element information corresponding to each target element information in the same type of feature dynamic object are obtained, and the same type of feature dynamic object is selected, and the feature dynamic object and the target dynamic object can be identified more quickly. By selecting multiple target element information and combining each target element information with the corresponding feature element information Matching pixel information can more comprehensively extract the features of the target dynamic object and the feature dynamic objects of the same category, eliminate reasonable errors and improve matching accuracy.

步驟105，根據所述識別對象對應的動態位姿矩陣、所述第一靜態對象及所述第二靜態對象對應的靜態位姿矩陣及預設閥值矩陣，識別所述識別對象中的目標動態對象的對象狀態。 Step 105 , according to the dynamic pose matrix corresponding to the recognized object, the static pose matrix corresponding to the first static object and the second static object, and a preset threshold matrix, identify the object state of the target dynamic object in the recognized object.

在本申請的至少一個實施例中，所述動態位姿矩陣是指所述識別對象對應的像素點的相機座標到世界座標的變換關係，所述識別對象對應的像素點的相機座標的相機座標是指每個像素點在相機座標系中的座標，所述靜態位姿矩陣是指所述第一靜態對象及所述第二靜態對象對應的相機座標到世界座標的變換關係。 In at least one embodiment of the present application, the dynamic pose matrix refers to the transformation relationship from the camera coordinates of the pixel points corresponding to the recognition object to the world coordinates, the camera coordinates of the camera coordinates of the pixel points corresponding to the recognition object refer to the coordinates of each pixel point in the camera coordinate system, and the static pose matrix refers to the transformation relationship from the camera coordinates to the world coordinates corresponding to the first static object and the second static object.

如圖3所示，是本申請實施例提供的像素座標系和相機座標系的示意圖。所述電腦設備以所述第一圖像的第一行第一列的像素點O_uv為原點，以第一行像素點所在的平行線為u軸，以第一列像素點所在的垂直線為v軸構建像素座標系。此外，所述電腦設備以所述單目相機的光點O_XY為原點，以所述單目相機的光軸為Z軸，以所述像素座標系u軸的平行線為X軸，以所述像素座標系的v軸的平行線為Y軸構建所述相機座標系。 As shown in FIG. 3 , it is a schematic diagram of the pixel coordinate system and the camera coordinate system provided by the embodiment of the present application. The computer device takes the pixel point O _uv of the first row and first column of the first image as the origin, takes the parallel line where the pixel point of the first row is located as the u axis, and takes the vertical line where the pixel point of the first column is located as the v axis to construct a pixel coordinate system. In addition, the computer device takes the light point O _XY of the monocular camera as the origin, the optical axis of the monocular camera as the Z axis, the line parallel to the u axis of the pixel coordinate system as the X axis, and the camera coordinate system constructed with the line parallel to the v axis of the pixel coordinate system as the Y axis.

在本申請的至少一個實施例中，所述電腦設備根據所述識別對象對應的動態位姿矩陣、所述第一靜態對象及所述第二靜態對象對應的靜態位姿矩陣及預設閥值矩陣識別所述識別對象中的目標動態對象的對象狀態包括：所述電腦設備將所述靜態位姿矩陣中的每個矩陣元素與所述識別對象對應的動態位姿矩陣中對應的矩陣元素進行相減運算，得到位姿差值，進一步地，所述電腦設備對所述位姿差值取絕對值，得到所述靜態位姿矩陣中的位姿絕對值，更進一步地，所述電腦設備根據所述靜態位姿矩陣中每個位姿絕對值的元素位置，將所述位姿絕對值進行排列，得到位姿絕對值矩陣，更進一步地，所述電腦設備將所述位姿絕對值矩陣中的每個位姿絕對值與所述預設閥值矩陣中對應的位姿閥值進比較，若所述位姿絕對值矩陣中存在至少一個大於所述對應位姿閥值的位姿絕對值，所述電腦設備確定所述識別對象中的目標動態對象的對象狀態為移動，或者，若所述位姿絕對值矩陣中所有的位姿絕對值均小於或者等於所述對應閥值，所述電腦設備確定該識別對象中的目標動態對象的對象狀態為靜止。 In at least one embodiment of the present application, the computer device identifying the object state of the target dynamic object in the recognition object according to the dynamic pose matrix corresponding to the recognition object, the static pose matrix corresponding to the first static object and the second static object, and the preset threshold matrix includes: the computer device subtracts each matrix element in the static pose matrix from the corresponding matrix element in the dynamic pose matrix corresponding to the recognition object to obtain a pose difference. The absolute value of the pose. Further, the computer device arranges the absolute pose values according to the element positions of each absolute value of the pose in the static pose matrix to obtain an absolute pose value matrix. Further, the computer device compares each absolute value of the pose in the absolute value matrix of poses with the corresponding pose threshold in the preset threshold matrix. The computer device determines that the object state of the target dynamic object in the recognition object is moving, or if all the pose absolute values in the pose absolute value matrix are less than or equal to the corresponding threshold value, the computer device determines that the object state of the target dynamic object in the recognition object is stationary.

具體地，所述動態位姿矩陣的生成方式為：所述電腦設備將所述識別對象中的目標動態對象在所述第一圖像中對應的像素點確定為第一像素點，並將所述識別對象中的特徵動態對象在所述第二圖像中對應的像素點確定為第二像素點，進一步地，所述電腦設備獲取所述第一像素點的第一齊次座標矩陣，並獲取所述第二像素點的第二齊次座標矩陣，獲取拍攝所述第一圖像及所述第二圖像的拍攝設備的內參矩陣的逆矩陣，進一步地，所述電腦設備根據所述第一齊次座標矩陣及所述內參矩陣的逆矩陣計算出所述第一像素點的第一相機座標，並根據所述第二齊次座標矩陣及所述內參矩陣的逆矩陣計算出所述第二像素點的第二相機座標，更進一步地，所述電腦設備基於預設對極約束關係式對所述第一相機座標及所述第二相機座標進行計算，得到旋轉矩陣及平移矩陣，更進一步地，所述電腦設備將所述旋轉矩陣及所述平移矩陣進行拼接，得到所述目標位姿矩陣。 Specifically, the generation method of the dynamic pose matrix is as follows: the computer device determines the pixel point corresponding to the target dynamic object in the recognition object in the first image as the first pixel point, and determines the pixel point corresponding to the characteristic dynamic object in the recognition object in the second image as the second pixel point; further, the computer device obtains the first homogeneous coordinate matrix of the first pixel point, obtains the second homogeneous coordinate matrix of the second pixel point, obtains the inverse matrix of the internal reference matrix of the shooting device that captured the first image and the second image, and further, the computer The device calculates the first camera coordinate of the first pixel point according to the first homogeneous coordinate matrix and the inverse matrix of the internal reference matrix, and calculates the second camera coordinate of the second pixel point according to the second homogeneous coordinate matrix and the inverse matrix of the internal reference matrix. Further, the computer device calculates the first camera coordinate and the second camera coordinate based on a preset antipolar constraint relation to obtain a rotation matrix and a translation matrix. Further, the computer device splices the rotation matrix and the translation matrix to obtain the target pose matrix.

其中，所述第一像素點的第一齊次座標矩陣是指維度比像素座標矩陣的維度多出一維的矩陣，而且多出的一個維度的元素值為1，所述像素座標矩陣是指根據所述第一像素點的第一像素座標生成的矩陣，所述第一像素座標是指所述第一像素點在像素座標系中的座標，例如，所述第一像素點在所述像素座標系中的第一像素座標為(u,v)，所述第一像素點的像素座標矩陣为

；則該像素點的齊次座標矩陣為

。將所述第一齊次座標矩陣及所述內參矩陣的逆矩陣進行相乘，得到所述第一像素點的第一相機座標，並將所述第二齊次座標矩陣及所述內參矩陣的逆矩陣進行相乘，得到所述第二像素點的第二相機座標。 Wherein, the first homogeneous coordinate matrix of the first pixel point refers to a matrix whose dimension is one-dimensional more than that of the pixel coordinate matrix, and the element value of one extra dimension is 1. The pixel coordinate matrix refers to a matrix generated according to the first pixel coordinate of the first pixel point, and the first pixel coordinate refers to the coordinate of the first pixel point in the pixel coordinate system. For example, the first pixel coordinate of the first pixel point in the pixel coordinate system is ( u, v ), and the pixel coordinate matrix of the first pixel point is

; then the homogeneous coordinate matrix of the pixel point is

. multiplying the first homogeneous coordinate matrix and the inverse matrix of the internal reference matrix to obtain the first camera coordinates of the first pixel, and multiplying the second homogeneous coordinate matrix and the inverse matrix of the internal reference matrix to obtain the second camera coordinates of the second pixel.

其中，所述第二齊次座標矩陣的生成方式與所述第一齊次座標矩陣的生成方式基本一致，本申請在此不作贅述。 Wherein, the generation method of the second homogeneous coordinate matrix is basically the same as the generation method of the first homogeneous coordinate matrix, and the present application will not repeat them here.

所述旋轉矩陣可以表示為：

；其中，pose為所述動態位姿矩陣，所述動態位姿矩陣為4x4的矩陣，R為所述旋轉矩陣，所述旋轉矩陣為3x3的矩陣，t為所述平移矩陣，所述平移矩陣為3x1的矩陣。 The rotation matrix can be expressed as:

; wherein, pose is the dynamic pose matrix, the dynamic pose matrix is a 4x4 matrix, R is the rotation matrix, the rotation matrix is a 3x3 matrix, t is the translation matrix, and the translation matrix is a 3x1 matrix.

其中，所述平移矩陣及所述旋轉矩陣的計算公式為：其中，K ^-1 p ₁(txR)(K ^-1 p ₂)^T=0；其中，K ^-1 p ₁為所述第一相機座標，K ^-1 p ₂為所述第二相機座標，p ₁為所述第一齊次座標矩陣，p ₂為所述第二齊次座標矩陣，K ^-1為所述內參矩陣的逆矩陣。 Wherein, the calculation formulas of the translation matrix and the rotation matrix are: Wherein, K ^-1 p ₁ ( t x R )( K ^-1 p ₂ ) ^T =0; wherein, K ^-1 p ₁ is the coordinates of the first camera, K ^-1 p ₂ is the coordinates of the second camera, p ₁ is the first homogeneous coordinate matrix, p ₂ is the second homogeneous coordinate matrix, and K ^-1 is the inverse matrix of the internal reference matrix.

在本實施例中，所述靜態位姿矩陣的生成方式與所述動態位姿矩陣的生成方式基本相同，故本申請在此不作贅述。透過上述實施方式，當存在多個識別對象時，所述動態位姿矩陣的數量也為多個，由於每個動態位姿矩陣與所述第一圖像中的每個目標動態對象相對應，因此，透過每個動態位姿矩陣能夠確定所述第一圖像中對應的目標動態對象的對象狀態，從而能夠將多個目標動態對象的對象狀態進行區分。 In this embodiment, the method of generating the static pose matrix is basically the same as the method of generating the dynamic pose matrix, so the present application will not repeat them here. Through the above-mentioned embodiment, when there are multiple identification objects, the number of the dynamic pose matrix is also multiple, since each dynamic pose matrix corresponds to each target dynamic object in the first image, therefore, the object state of the corresponding target dynamic object in the first image can be determined through each dynamic pose matrix, so that the object states of multiple target dynamic objects can be distinguished.

步驟106，根據所述對象狀態、所述第一動態位置及所述第一圖像生成目標圖像，並根據所述對象狀態、所述第一動態位置及所述第一圖像對應的初始投影圖像生成目標投影圖像。 Step 106: Generate a target image according to the object state, the first dynamic position, and the first image, and generate a target projection image according to the object state, the first dynamic position, and an initial projection image corresponding to the first image.

在本申請的至少一個實施例中，所述目標圖像是指基於所述第一動態位置及所述對象狀態對所述第一圖像中的目標動態對象進行處理後生成的圖像。在本申請的至少一個實施例中，所述初始投影圖像表示變換過程的圖像，所述變換過程是指所述第一圖像中像素點的像素座標與所述第二圖像中對應的像素座標之間的變換過程。在本申請的至少一個實施例中，所述電腦設備基於所述第一圖像、所述初始深度圖像及所述目標位姿矩陣生成所述第一圖像的初始投影圖像包括：若所述識別對象中任一目標動態對象的對象狀態為移動，所述電腦設備基於所述任一目標動態對象的第一動態位置在所述第一圖像中對所述任一目標動態對象進行掩膜處理，得到所述目標圖像，或者，若所述識別對象中的所有目標動態對象的對象狀態均為靜止，所述電腦設備將所述第一圖像確定為所述目標圖像。 In at least one embodiment of the present application, the target image refers to an image generated after processing the target dynamic object in the first image based on the first dynamic position and the state of the object. In at least one embodiment of the present application, the initial projected image represents an image of a transformation process, and the transformation process refers to a transformation process between pixel coordinates of pixels in the first image and corresponding pixel coordinates in the second image. In at least one embodiment of the present application, the computer device generating the initial projection image of the first image based on the first image, the initial depth image, and the target pose matrix includes: if the object state of any target dynamic object in the identified objects is moving, the computer device calculates the target dynamic object in the first image based on the first dynamic position of any target dynamic object Perform mask processing on any of the target dynamic objects to obtain the target image, or, if the object states of all the target dynamic objects in the identified objects are static, the computer device determines the first image as the target image.

具體地，所述初始投影圖像的生成方式包括：所述電腦設備獲取所述第一圖像的初始深度圖像，並獲取所述第一圖像中每個像素點的目標齊次座標矩陣，並從所述初始深度圖像中獲取所述第一圖像中每個像素點的深度值，進一步地，所述電腦設備基於所述目標位姿矩陣、每個像素點的目標齊次座標矩陣及每個像素點的深度值計算出所述第一圖像中每個像素點的投影座標，更進一步地，所述電腦設備根據每個像素點的投影座標對每個像素點進行排列處理，得到所述初始投影圖像。其中，所述電腦設備將所述第一圖像輸入到所述深度識別網路中，得到所述初始深度圖像，所述深度值是指所述初始深度圖像中每個像素點的像素值。 Specifically, the method of generating the initial projected image includes: the computer device obtains an initial depth image of the first image, and obtains a target homogeneous coordinate matrix of each pixel in the first image, and obtains a depth value of each pixel in the first image from the initial depth image. Further, the computer device calculates the projected coordinates of each pixel in the first image based on the target pose matrix, the target homogeneous coordinate matrix of each pixel, and the depth value of each pixel. Points are arranged to obtain the initial projection image. Wherein, the computer device inputs the first image into the depth recognition network to obtain the initial depth image, and the depth value refers to the pixel value of each pixel in the initial depth image.

具體地，所述初始投影圖像中每個像素點的投影座標的計算公式為：P=K＊pose＊Z＊K ^-1＊H；其中，P表示每個像素點的投影座標，K表示所述拍攝設備的內參矩陣，pose表示所述目標位姿矩陣，K^-1表示K的逆矩陣，H表示所述第一圖像中每個像素點的目標齊次座標矩陣，Z表示所述初始深度圖像中對應的像素點的深度值。 Specifically, the calculation formula of the projected coordinates of each pixel in the initial projected image is: P = K * pose * Z * K ^-1 * H ; wherein, P represents the projected coordinates of each pixel, K represents the internal reference matrix of the shooting device, pose represents the target pose matrix, K ^-1 represents the inverse matrix of K , H represents the target homogeneous coordinate matrix of each pixel in the first image, and Z represents the depth value of the corresponding pixel in the initial depth image.

在本實施例中，所述目標投影圖像中包括與所述多個目標動態對象對應的多個投影對象，根據所述多個投影對象生成所述目標投影圖像的方式與所述目標圖像的方式基本相同，故本申請對此不作贅述。透過上述實施方式，當所述識別對象中目標動態對象的對象狀態為移動時，根據該目標動態對象的第一動態位置在所述第一圖像中對該目標動態對象能夠準確地進行掩膜處理，能夠避免發生移動的動態對象對計算損失值的影響，當所述識別對象中目標動態對象的對象狀態為靜止時，在所述第一圖像中保留該目標動態對象，能夠保留所述第一圖像更多的圖像資訊。 In this embodiment, the target projection image includes a plurality of projection objects corresponding to the plurality of target dynamic objects, and the method of generating the target projection image according to the plurality of projection objects is basically the same as that of the target image, so this application will not repeat it here. Through the above-mentioned embodiment, when the object state of the target dynamic object in the identification object is moving, the target dynamic object can be accurately masked in the first image according to the first dynamic position of the target dynamic object, and the influence of the moving dynamic object on the calculation loss value can be avoided; when the object state of the target dynamic object in the identification object is stationary, the target dynamic object is retained in the first image, and more image information of the first image can be retained.

步驟107，基於所述初始深度圖像與所述目標圖像之間的梯度誤差及所述目標投影圖像與所述目標圖像之間的光度誤差，調整獲取的深度識別網路，得到深度識別模型。 Step 107, based on the gradient error between the initial depth image and the target image and the photometric error between the target projection image and the target image, adjust the obtained depth recognition network to obtain a depth recognition model.

在本申請的至少一個實施例中，所述深度識別模型是指對所述深度識別網路進行調整後生成的模型。在本申請的至少一個實施例中，所述電腦設備基於所述初始深度圖像與所述目標圖像之間的梯度誤差及所述目標投影圖像與所述目標圖像之間的光度誤差，調整所述深度識別網路，得到深度識別模型包括：所述電腦設備基於所述梯度誤差及所述光度誤差，計算所述深度識別網路的深度損失值，進一步地，所述電腦設備基於所述深度損失值調整所述深度識別網路，直至所述深度損失值下降到最低，得到所述深度識別模型。 In at least one embodiment of the present application, the deep recognition model refers to a model generated after adjusting the deep recognition network. In at least one embodiment of the present application, the computer device adjusts the depth recognition network based on the gradient error between the initial depth image and the target image and the photometric error between the target projection image and the target image, and obtains the depth recognition model.

其中，所述深度識別網路可以為深度神經網路，所述深度識別網路可以從網際網路的資料庫中獲取。具體地，所述深度損失值的計算公式為：Lc=Lt+Ls；其中，Lc表示所述深度損失值，Lt表示所述光度誤差，Ls表示所述梯度誤差。 Wherein, the deep recognition network may be a deep neural network, and the deep recognition network may be obtained from a database on the Internet. Specifically, the calculation formula of the depth loss value is: Lc = Lt + Ls ; wherein, Lc represents the depth loss value, Lt represents the photometric error, and Ls represents the gradient error.

其中，所述光度誤差的計算公式為：

；其中，Lt表示所述光度誤差，α為預設的平衡參數，一般取值為0.85，SSIM(x,y)表示所述目標投影圖像與所述目標圖像之間的結構相似指數，∥x _i-y _i∥表示所述目標投影圖像與所述目標圖像之間的灰度差值，x _i表示所述目標投影圖像第i個像素點的像素值，y _i表示所述目標圖像中與所述第i個像素點對應的像素點的像素值。所述結構相似指數的計算方式為現有技術，本申請在此不作贅述。 Wherein, the calculation formula of the photometric error is:

; wherein Lt represents the photometric error, α is a preset balance parameter, and generally takes a value of 0.85, SSIM ( x, y ) represents the structural similarity index between the target projection image and the target image, ∥ x _i − y _i ∥ represents the grayscale difference between the target projection image and the target image, xi represents the pixel value of the i-th pixel of the target projection image, _and y _i represents the pixel value of the pixel corresponding to the i -th pixel in the target image. The calculation method of the structural similarity index is the prior art, and the present application will not repeat it here.

所述梯度誤差的計算公式為：

；其中，Ls表示所述梯度誤差，x表示所述初始深度圖像，y表示所述目標圖像，D(u，v)表示所述初始深度圖像中第i個像素點的像素座標，I(u，v)表示所述目標圖像中第i個像素點的像素座標。 The calculation formula of the gradient error is:

; wherein, Ls represents the gradient error, x represents the initial depth image, y represents the target image, D ( u , v ) represents the pixel coordinates of the i-th pixel in the initial depth image, I ( u , v ) represents the pixel coordinates of the i-th pixel in the target image.

透過上述實施方式，由於避免了發生移動的動態對象對計算所述深度識別網路的損失值的影響，因此能夠提高所述深度識別模型的精度。 Through the above implementation manner, since the influence of the moving dynamic object on the calculation of the loss value of the depth recognition network is avoided, the accuracy of the depth recognition model can be improved.

如圖4所示，是本申請實施例提供的圖像深度識別方法的流程圖。 As shown in FIG. 4 , it is a flow chart of the image depth recognition method provided by the embodiment of the present application.

步驟108，獲取待識別圖像。 Step 108, acquire the image to be recognized.

在本申請的至少一個實施例中，所述待識別圖像是指需要識別深度資訊的圖像。在本申請的至少一個實施例中，所述電腦設備從預設的資料庫中獲取所述待識別圖像，所述預設的資料庫可以為KITTI資料庫、Cityscapes資料庫及vKITTI資料庫等等。 In at least one embodiment of the present application, the image to be recognized refers to an image for which depth information needs to be recognized. In at least one embodiment of the present application, the computer device obtains the image to be recognized from a preset database, and the preset database may be KITTI database, Cityscapes database, vKITTI database and so on.

步驟109，將所述待識別圖像輸入到所述深度識別模型中，得到所述待識別圖像的目標深度圖像及所述待識別圖像的深度資訊，所述深度識別模型透過執行如所述的深度識別模型訓練方法而獲得。 Step 109, input the image to be recognized into the depth recognition model to obtain the target depth image of the image to be recognized and the depth information of the image to be recognized. The depth recognition model is obtained by executing the depth recognition model training method as described above.

在本申請的至少一個實施例中，所述目標深度圖像是指包含所述待識別圖像中每個像素點的深度資訊的圖像，所述待識別圖像中每個像素點的深度資訊是指所述待識別圖像中每個像素點對應的待識別對象與拍攝所述待識別圖像的拍攝設備之間的距離。在本申請的至少一個實施例中，所述目標深度圖像的生成方式與所述初始深度圖像的生成方式基本一致，故本申請在此不做贅述。在本申請的至少一個實施例中，所述電腦設備獲取所述目標深度圖像中每個像素點的像素值作為所述待識別圖像中對應的像素點的深度資訊。 In at least one embodiment of the present application, the target depth image refers to an image including depth information of each pixel in the image to be recognized, and the depth information of each pixel in the image to be recognized refers to the distance between the object to be recognized corresponding to each pixel in the image to be recognized and the shooting device that shoots the image to be recognized. In at least one embodiment of the present application, the method of generating the target depth image is basically the same as the method of generating the initial depth image, so the present application will not repeat them here. In at least one embodiment of the present application, the computer device acquires a pixel value of each pixel in the target depth image as depth information of a corresponding pixel in the image to be recognized.

透過上述實施方式，由於提升了所述深度識別模型的精度，因此能夠提高所述待識別圖像的深度識別的精確度。 Through the above implementation manner, since the accuracy of the depth recognition model is improved, the accuracy of the depth recognition of the image to be recognized can be improved.

由上述技術方案可知，本申請對所述第一圖像進行實例分割，得到所述第一圖像對應的第一靜態對象、多個第一動態對象及每個第一動態對象的第一動態位置，基於每個第一動態對象的像素點數量及預設位置從所述多個第一動態對象中選取多個目標動態對象，能夠從所述多個第一動態對象中選取多個目標動態對象，由於減少所述多個第一動態對象的數量，因此能夠提高深度識別網路的訓練速度，識別每個目標動態對象是否存在對應的特徵動態對象，能夠選取所述第二圖像中與每個目標動態對象相同的特徵動態對象，透過計算每個目標動態對象與相同的特徵動態對象的動態位姿矩陣，並將所述動態位姿矩陣與所述預設閥值矩陣進行比較，能夠確定所述第一圖像中的每個目標動態對象的狀態是否為移動，根據所述識別對象中的目標動態對象的狀態、所述第一動態位置及所述第一圖像生成目標圖像，能夠基於所述第一動態位置對將所述第一圖像中發生移動的目標動態對象進行濾除，生成所述目標圖像，由於狀態為移動的目標動態對象的位置變化會導致所述目標動態對象在所述初始深度圖像中對應的像素點的深度值發生變化，透過在所述目標圖像中濾除了狀態為移動的目標動態對象，使得在計算損失值時不使用所述深度值進行計算，能夠避免狀態為移動的目標動態對象對計算損失值的影響，所述目標圖像保留狀態為靜止的目標動態對象，能夠保留了所述第一圖像的更多圖像資訊，因此，利用所述目標圖像訓練得到的深度識別模型，能夠避免發生移動的目標動態對象對所述深度識別模型的訓練精度的影響，進而能夠提高所述深度識別模型的識別準確性。 It can be known from the above technical solution that the present application performs instance segmentation on the first image to obtain the first static object corresponding to the first image, multiple first dynamic objects, and the first dynamic position of each first dynamic object, and select multiple target dynamic objects from the multiple first dynamic objects based on the number of pixels and preset positions of each first dynamic object, and can select multiple target dynamic objects from the multiple first dynamic objects. Since the number of the multiple first dynamic objects is reduced, it is possible to improve the training speed of the depth recognition network and identify whether each target dynamic object has a corresponding characteristic dynamic object. A feature dynamic object that is the same as each target dynamic object in the second image can be selected, and by calculating a dynamic pose matrix of each target dynamic object and the same feature dynamic object, and comparing the dynamic pose matrix with the preset threshold matrix, it can be determined whether the state of each target dynamic object in the first image is moving, and a target image can be generated according to the state of the target dynamic object in the recognition object, the first dynamic position, and the first image, and the target dynamic object that moves in the first image can be filtered based on the first dynamic position to generate the target image. The position change of the moving target dynamic object will cause the depth value of the corresponding pixel of the target dynamic object to change in the initial depth image. By filtering out the moving target dynamic object in the target image, the depth value is not used for calculation when calculating the loss value, and the influence of the moving target dynamic object on the calculation loss value can be avoided. The target image retains the static target dynamic object, which can retain more image information of the first image. The impact of the training accuracy of the recognition model can further improve the recognition accuracy of the deep recognition model.

如圖5所示，是本申請實施例提供的電腦設備的結構示意圖。 As shown in FIG. 5 , it is a schematic structural diagram of a computer device provided in an embodiment of the present application.

在本申請的一個實施例中，所述電腦設備1包括，但不限於，儲存器12、處理器13，以及儲存在所述儲存器12中並可在所述處理器13上運行的電腦程式，例如圖像深度識別程式及深度識別模型訓練程式。 In one embodiment of the present application, the computer device 1 includes, but is not limited to, a storage 12, a processor 13, and a computer program stored in the storage 12 and operable on the processor 13, such as an image depth recognition program and a depth recognition model training program.

本領域技術人員可以理解，所述示意圖僅僅是電腦設備1的示例，並不構成對電腦設備1的限定，可以包括比圖示更多或更少的部件，或者組合某些部件，或者不同的部件，例如所述電腦設備1還可以包括輸入輸出設備、網路接入設備、匯流排等。所述處理器13可以是中央處理單元(Central Processing Unit，CPU)，還可以是其他通用處理器、數位訊號處理器(Digital Signal Processor，DSP)、專用積體電路(Application Specific Integrated Circuit，ASIC)、現場可程式設計閘陣列(Field-Programmable Gate Array，FPGA)或者其他可程式設計邏輯器件、分立元器件門電路或者電晶體組件、分立硬體組件等。通用處理器可以是微處理器或者該處理器也可以是任何常規的處理器等，所述處理器13是所述電腦設備1的運算核心和控制中心，利用各種介面和線路連接整個電腦設備1的各個部分，及獲取所述電腦設備1的作業系統以及安裝的各類應用程式、程式碼等。例如，所述處理器13可以透過介面獲取所述拍攝設備2拍攝到的所述第一圖像。 Those skilled in the art can understand that the schematic diagram is only an example of the computer device 1, and does not constitute a limitation to the computer device 1, and may include more or less components than those shown in the figure, or combine certain components, or different components, for example, the computer device 1 may also include input and output devices, network access devices, bus bars, etc. The processor 13 may be a central processing unit (Central Processing Unit, CPU), or other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field-programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete component gate circuits or transistor components, discrete hardware components, etc. The general-purpose processor can be a microprocessor or the processor can also be any conventional processor, etc., and the processor 13 is the computing core and control center of the computer device 1, and uses various interfaces and lines to connect various parts of the entire computer device 1, and obtains the operating system of the computer device 1 and various application programs and program codes installed. For example, the processor 13 may acquire the first image captured by the photographing device 2 through an interface.

所述處理器13獲取所述電腦設備1的作業系統以及安裝的各類應用程式。所述處理器13獲取所述應用程式以實現上述各個深度識別模型訓練方法以及各個圖像深度識別方法實施例中的步驟，例如圖2及圖5所示的步驟。示例性的，所述電腦程式可以被分割成一個或多個模組/單元，所述一個或者多個模組/單元被儲存在所述儲存器12中，並由所述處理器13獲取，以完成本申請。所述一個或多個模組/單元可以是能夠完成特定功能的一系列電腦程式指令段，該指令段用於描述所述電腦程式在所述電腦設備1中的獲取過程。 The processor 13 acquires the operating system of the computer device 1 and various installed applications. The processor 13 obtains the application program to implement the steps in the above embodiments of each depth recognition model training method and each image depth recognition method, such as the steps shown in FIG. 2 and FIG. 5 . Exemplarily, the computer program can be divided into one or more modules/units, and the one or more modules/units are stored in the storage 12 and retrieved by the processor 13 to complete the application. The one or more modules/units may be a series of computer program instruction segments capable of accomplishing specific functions, and the instruction segments are used to describe the acquisition process of the computer program in the computer device 1 .

所述儲存器12可用於儲存所述電腦程式和/或模組，所述處理器13透過運行或獲取儲存在所述儲存器12內的電腦程式和/或模組，以及調用儲存在儲存器12內的資料，實現所述電腦設備1的各種功能。所述儲存器12可主要包括儲存程式區和儲存資料區，其中，儲存程式區可儲存作業系統、至少一個功能所需的應用程式(比如聲音播放功能、圖像播放功能等)等；儲存資料區可儲存根據電腦設備的使用所創建的資料等。此外，儲存器12可以包括非易失性儲存器，例如硬碟、儲存器、插接式硬碟，智慧儲存卡(Smart Media Card,SMC)，安全數位(Secure Digital,SD)卡，記憶卡(Flash Card)、至少一個磁碟儲存器件、快閃儲存器器件、或其他非易失性固態儲存器件。 The storage 12 can be used to store the computer programs and/or modules, and the processor 13 implements various functions of the computer device 1 by running or obtaining the computer programs and/or modules stored in the storage 12, and calling the data stored in the storage 12. The storage device 12 can mainly include a program storage area and a data storage area, wherein the program storage area can store an operating system, at least one application program required by a function (such as a sound playback function, an image playback function, etc.); In addition, the storage 12 may include a non-volatile storage, such as a hard disk, a memory, a plug-in hard disk, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) card, a memory card (Flash Card), at least one magnetic disk storage device, a flash memory device, or other non-volatile solid-state storage devices.

所述儲存器12可以是電腦設備1的外部儲存器和/或內部儲存器。進一步地，所述儲存器12可以是具有實物形式的儲存器，如儲存器條、TF卡(Trans-flash Card)等等。所述電腦設備1集成的模組/單元如果以軟體功能單元的形式實現並作為獨立的產品銷售或使用時，可以儲存在一個電腦可讀取儲存介質中。基於這樣的理解，本申請實現上述實施例方法中的全部或部分流程，也可以透過電腦程式來指令相關的硬體來完成，所述的電腦程式可儲存於一電腦可讀儲存介質中，該電腦程式在被處理器獲取時，可實現上述各個方法實施例的步驟。 The storage 12 may be an external storage and/or an internal storage of the computer device 1 . Further, the storage 12 may be a physical storage, such as a storage stick, a TF card (Trans-flash Card) and the like. If the integrated modules/units of the computer device 1 are realized in the form of software function units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the present application implements all or part of the processes in the methods of the above-mentioned embodiments, It can also be completed by instructing related hardware through a computer program. The computer program can be stored in a computer-readable storage medium. When the computer program is acquired by a processor, it can realize the steps of the above-mentioned various method embodiments.

其中，所述電腦程式包括電腦程式代碼，所述電腦程式代碼可以為原始程式碼形式、對象代碼形式、可獲取檔或某些中間形式等。所述電腦可讀介質可以包括：能夠攜帶所述電腦程式代碼的任何實體或裝置、記錄介質、隨身碟、移動硬碟、磁碟、光碟、電腦儲存器、唯讀儲存器(ROM，Read-Only Memory)。 Wherein, the computer program includes computer program code, and the computer program code may be in the form of original code, object code, obtainable file or some intermediate form. The computer-readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a flash drive, a removable hard disk, a magnetic disk, an optical disk, a computer storage, and a read-only memory (ROM, Read-Only Memory).

結合圖2，所述電腦設備1中的所述儲存器12儲存多個指令以實現一種深度識別模型訓練方法，所述處理器13可獲取所述多個指令從而實現：獲取第一圖像及第二圖像；基於實例分割網路對所述第一圖像進行實例分割，得到所述第一圖像對應的第一靜態對象、多個第一動態對象及每個第一動態對象的第一動態位置，並基於所述實例分割網路對所述第二圖像進行實例分割，得到所述第二圖像對應的第二靜態對象與多個第二動態對象；基於每個第一動態對象的像素點數量及預設位置從所述多個第一動態對象中選取多個目標動態對象，並基於每個第二動態對象的像素點數量及所述預設位置從所述多個第二動態對象中選取多個特徵動態對象；識別每個目標動態對象是否存在對應的特徵動態對象，並將存在對應關係的目標動態對象及特徵動態對象確定為識別對象；根據所述識別對象對應的動態位姿矩陣、所述第一靜態對象及所述第二靜態對象對應的靜態位姿矩陣及預設閥值矩陣，識別所述識別對象中的目標動態對象的對象狀態；根據所述對象狀態、所述第一動態位置及所述第一圖像生成目標圖像，並根據所述對象狀態、所述第一動態位置及所述第一圖像對應的初始投影圖像生成目標投影圖像；基於所述第一圖像對應的初始深度圖像與所述目標圖像之間的梯度誤差及所述目標投影圖像與所述目標圖像之間的光度誤差，調整獲取的深度識別網路，得到深度識別模型。 In conjunction with FIG. 2 , the memory 12 in the computer device 1 stores a plurality of instructions to implement a depth recognition model training method, and the processor 13 can obtain the plurality of instructions to realize: obtaining a first image and a second image; performing instance segmentation on the first image based on an instance segmentation network to obtain a first static object corresponding to the first image, a plurality of first dynamic objects, and a first dynamic position of each first dynamic object, and performing instance segmentation on the second image based on the instance segmentation network to obtain a second static object and a plurality of second dynamic objects corresponding to the second image; based on the pixels of each first dynamic object selecting a plurality of target dynamic objects from the plurality of first dynamic objects based on the quantity and preset position, and selecting a plurality of characteristic dynamic objects from the plurality of second dynamic objects based on the number of pixels of each second dynamic object and the preset position; identifying whether each target dynamic object has a corresponding characteristic dynamic object, and determining the target dynamic object and the characteristic dynamic object with corresponding relationship as the recognition object; according to the dynamic pose matrix corresponding to the recognition object, the static pose matrix corresponding to the first static object and the second static object, and the preset threshold matrix, identify the object state of the target dynamic object in the recognition object; , the first dynamic position and the first image generate a target image, and generate a target projection image according to the object state, the first dynamic position, and the initial projection image corresponding to the first image; based on the gradient error between the initial depth image corresponding to the first image and the target image and the photometric error between the target projection image and the target image, adjust the acquired depth recognition network to obtain a depth recognition model.

結合圖4，所述電腦設備1中的所述儲存器12儲存多個指令以實現一種圖像深度識別方法，所述處理器13可獲取所述多個指令從而實現：獲取待識別圖像，將所述待識別圖像輸入到深度識別模型中，得到所述待識別圖像的目標深度圖像及所述待識別圖像的深度資訊。具體地，所述處理器13對上述指令的具體實現方法可參考圖2及圖4對應實施例中相關步驟的描述，在此不贅述。 4, the memory 12 in the computer device 1 stores a plurality of instructions to realize An image depth recognition method is presented. The processor 13 can obtain the multiple instructions to realize: obtain an image to be recognized, input the image to be recognized into a depth recognition model, and obtain a target depth image of the image to be recognized and depth information of the image to be recognized. Specifically, for the specific implementation method of the above instructions by the processor 13, reference may be made to the description of relevant steps in the embodiments corresponding to FIG. 2 and FIG. 4 , and details are not repeated here.

在本申請所提供的幾個實施例中，應該理解到，所揭露的系統，裝置和方法，可以透過其它的方式實現。例如，以上所描述的裝置實施例僅僅是示意性的，例如，所述模組的劃分，僅僅為一種邏輯功能劃分，實際實現時可以有另外的劃分方式。所述作為分離部件說明的模組可以是或者也可以不是物理上分開的，作為模組顯示的部件可以是或者也可以不是物理單元，即可以處於一個地方，或者也可以分佈到多個網路單元上。可以根據實際的需要選擇其中的部分或者全部模組來實現本實施例方案的目的。 In the several embodiments provided in this application, it should be understood that the disclosed systems, devices and methods can be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of the modules is only a logical function division, and there may be other division methods in actual implementation. The modules described as separate components may or may not be physically separated, and the components displayed as modules may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment.

另外，在本申請各個實施例中的各功能模組可以集成在一個處理單元中，也可以是各個單元單獨物理存在，也可以兩個或兩個以上單元集成在一個單元中。上述集成的單元既可以採用硬體的形式實現，也可以採用硬體加軟體功能模組的形式實現。因此，無論從哪一點來看，均應將實施例看作是示範性的，而且是非限制性的，本申請的範圍由所附請求項而不是上述說明限定，因此旨在將落在請求項的等同要件的含義和範圍內的所有變化涵括在本申請內。不應將請求項中的任何附關聯圖標記視為限制所涉及的請求項。 In addition, each functional module in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit. The above-mentioned integrated units can be implemented not only in the form of hardware, but also in the form of hardware plus software function modules. Therefore, no matter from which point of view, all the embodiments should be regarded as exemplary and non-restrictive, and the scope of the application is defined by the appended claims rather than the above description, so all changes within the meaning and scope of the equivalent requirements of the claims are intended to be included in the application. Any attached reference mark in a claim shall not be deemed to limit the claim to which it relates.

此外，顯然“包括”一詞不排除其他單元或步驟，單數不排除複數。本申請中陳述的多個單元或裝置也可以由一個單元或裝置透過軟體或者硬體來實現。第一、第二等詞語用來表示名稱，而並不表示任何特定的順序。 In addition, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or devices stated in this application may also be realized by one unit or device through software or hardware. The terms first, second, etc. are used to denote names and do not imply any particular order.

最後應說明的是，以上實施例僅用以說明本申請的技術方案而非限制，儘管參照較佳實施例對本申請進行了詳細說明，本領域的普通技術人員應當理解，可以對本申請的技術方案進行修改或等同替換，而不脫離本申請技術方案的精神和範圍。 Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present application without limitation. Although the present application has been described in detail with reference to the preferred embodiments, those of ordinary skill in the art should understand that the technical solutions of the present application can be modified or equivalently replaced without departing from the spirit and scope of the technical solutions of the present application.

101~107:步驟 101~107: Steps

Claims

A depth recognition model training method applied to a computer device, the computer device communicating with a shooting device, wherein the depth recognition model training method includes: acquiring a first image and a second image, the first image and the second image are images of adjacent frames, and the second image and the first image have different generation times; performing instance segmentation on the first image based on an instance segmentation network to obtain a first static object corresponding to the first image, a plurality of first dynamic objects, and a first dynamic position of each first dynamic object, and performing instance segmentation on the second image based on the instance segmentation network to obtain the corresponding a second static object and a plurality of second dynamic objects; selecting a plurality of target dynamic objects from the plurality of first dynamic objects based on the number of pixels and preset positions of each first dynamic object, and selecting a plurality of characteristic dynamic objects from the plurality of second dynamic objects based on the number of pixels of each second dynamic object and the preset position; identifying whether there is a corresponding characteristic dynamic object for each target dynamic object, and determining the corresponding target dynamic object and the characteristic dynamic object as the recognition object; according to the dynamic pose matrix corresponding to the recognition object, the static pose matrix and the preset threshold matrix corresponding to the first static object and the second static object, Identifying the object state of the target dynamic object in the recognition object; generating a target image according to the object state, the first dynamic position, and the first image, and generating a target projection image according to the object state, the first dynamic position, and an initial projection image corresponding to the first image; based on the gradient error between the initial depth image corresponding to the first image and the target image and the photometric error between the target projection image and the target image, adjusting the acquired depth recognition network to obtain a depth recognition model, including: based on the gradient error and the photometric error, calculating the depth loss value of the depth recognition network, based on The depth loss value adjusts the depth recognition network until the depth loss value drops to a minimum to obtain the depth recognition model.

The depth recognition model training method as described in claim 1, wherein, the example The segmentation network includes a feature extraction layer, a classification layer, and a mapping layer. The instance-based segmentation network performs instance segmentation on the first image to obtain the first static object corresponding to the first image, a plurality of first dynamic objects, and the first dynamic position of each first dynamic object. A rectangular area corresponding to each pixel in the initial feature map; classifying the initial feature map based on the classification layer to obtain a predicted probability that each pixel in the initial feature map belongs to a first preset category; determining a pixel corresponding to a predicted probability greater than a preset threshold in the initial feature map as a target pixel, and determining multiple rectangular areas corresponding to multiple target pixels as a plurality of feature areas; mapping each feature area to the initial feature map based on the mapping layer to obtain a mapping area corresponding to each feature area in the initial feature map; obtaining a plurality of divided areas corresponding to each mapped area; determining a central pixel in each divided area, and calculating a pixel value of the central pixel; performing pooling processing on a plurality of pixel values corresponding to a plurality of central pixels to obtain a mapping probability value corresponding to each mapped area; restoring the plurality of mapped areas, and splicing the restored multiple mapped areas to obtain a target feature map; generating a first static object corresponding to the first image, the plurality of first dynamic objects, and each first dynamic object according to the target feature map, the mapped probability value, the restored multiple mapped areas, and the second preset category The first dynamic position of the object.

The depth recognition model training method as described in claim 2, wherein, according to Generating the first static object corresponding to the first image, the plurality of first dynamic objects, and the first dynamic position of each first dynamic object corresponding to the target feature map, the mapping probability value, the plurality of restored mapping areas, and the second preset category includes: classifying each pixel of the target feature map according to the mapping probability value and the second preset category, and obtaining the pixel category of each pixel in the restored mapping area; determining an area composed of a plurality of pixel points corresponding to the same pixel category in the restored mapping area as the first object; obtaining pixel locations of all pixels in the first object and determining the pixel coordinates as the first position corresponding to the first object; dividing the plurality of first objects into the plurality of first dynamic objects and the first static object according to preset rules, and determining the first position corresponding to each first dynamic object as the first dynamic position.

The method for training a depth recognition model according to claim 1, wherein the selecting a plurality of target dynamic objects from the plurality of first dynamic objects based on the number of pixels and preset positions of each first dynamic object includes: counting the number of pixels contained in each first dynamic object; sorting the plurality of first dynamic objects according to the number of pixels; selecting a first dynamic object whose sorted number of pixels is at the preset position as the plurality of target dynamic objects.

The method for training a deep recognition model as described in claim 1, wherein the identifying whether each target dynamic object has a corresponding feature dynamic object includes: obtaining multiple target element information of each target dynamic object, and obtaining feature element information corresponding to each target element information in the same category of feature dynamic objects; matching each target element information with the corresponding feature element information to obtain a matching value between the target dynamic object and the same type of feature dynamic object;

The depth recognition model training method as described in claim 1, wherein the identifying the object state of the target dynamic object in the recognition object according to the dynamic pose matrix corresponding to the recognition object, the static pose matrix corresponding to the first static object and the second static object, and the preset threshold value matrix includes: subtracting each matrix element in the static pose matrix from the corresponding matrix element in the dynamic pose matrix corresponding to the recognition object to obtain a pose difference; taking the absolute value of the pose difference to obtain the pose absolute value in the static pose matrix; The element position of each pose absolute value in the static pose matrix, arrange the pose absolute values to obtain a pose absolute value matrix; compare each pose absolute value in the pose absolute value matrix with the corresponding pose threshold in the preset threshold value matrix, if there is at least one pose absolute value greater than the corresponding pose threshold in the pose absolute value matrix, then determine that the object state of the target dynamic object in the recognition object is moving; or if all the pose absolute values in the pose absolute value matrix are less than or equal to the corresponding threshold , then it is determined that the object state of the target dynamic object in the identified object is static.

The method for training a depth recognition model according to claim 1, wherein said generating a target image according to said object state, said first dynamic position and said first image comprises: if the object state of any target dynamic object in said recognition objects is moving, then performing mask processing on said any target dynamic object in said first image based on the first dynamic position of said any target dynamic object to obtain said target image; or if the object states of all target dynamic objects in said recognition objects are all static, then determining said first image as said target image.

The depth recognition model training method according to claim 1, wherein, adjusting the depth recognition network based on the gradient error between the initial depth image corresponding to the first image and the target image and the photometric error between the target projection image and the target image, and obtaining the depth recognition model includes: Calculating a depth loss value of the depth recognition network based on the gradient error and the photometric error includes: determining the sum of the gradient error and the photometric error as a depth loss value; adjusting the depth recognition network based on the depth loss value until the depth loss value drops to a minimum, and obtaining the depth recognition model.

An image depth recognition method applied to computer equipment, wherein the image depth recognition method includes: acquiring an image to be recognized; inputting the image to be recognized into a depth recognition model to obtain a target depth image of the image to be recognized and depth information of the image to be recognized, and the depth recognition model is obtained by executing the depth recognition model training method described in any one of claims 1 to 8.

A computer device, wherein the computer device includes: a memory storing at least one instruction; and a processor executing the at least one instruction to implement the depth recognition model training method as described in any one of claims 1 to 8, or the image depth recognition method as described in claim 9.