TWI789267B

TWI789267B - Method of using two-dimensional image to automatically create ground truth data required for training three-dimensional pointnet

Info

Publication number: TWI789267B
Application number: TW111108860A
Authority: TW
Inventors: 林春宏; 余佳靜; 林晏瑜
Original assignee: 國立臺中科技大學
Priority date: 2022-03-10
Filing date: 2022-03-10
Publication date: 2023-01-01
Also published as: TW202336702A

Abstract

A method of using a two-dimensional image to automatically create ground truth data required for training a three-dimensional PointNet includes the following steps: using an automated segmentation technology to distinguish a plurality of objects shown on the two-dimensional image; generating a significant number of texture maps applicable to the objects; using YOLOv3 (You Only Look Once v3) to identify the texture maps to produce a plurality of corresponding object identifications; and combining the texture maps with a plurality of 3D triangular grids, whereby to obtain the object identification corresponding to each grid point of each of the 3D triangular grids, creating ground truth data required for training the three-dimensional PointNet. Once the three-dimensional PointNet has completed training, the object identification corresponding to another 3D triangular grid can be obtained by learning the three-dimensional PointNet.

Description

Use 2D images to automatically generate the truth needed to train 3D point cloud learning networks factual method

本發明係與三維點雲學習網路(PointNet)的訓練有關；特別是指一種使用二維影像自動產生訓練三維點雲學習網路所需之真實資料(ground truth)的方法。 The present invention is related to the training of a 3D point cloud learning network (PointNet); in particular, it refers to a method for automatically generating ground truth required for training a 3D point cloud learning network by using 2D images.

在自動駕駛、虛擬實境(virtual reality,VR)、擴增實境(augmented reality,AR)及混合實境(mixed reality,WR)等領域的最新技術發展中，三維模擬系統已成為電腦視覺主要的研究趨勢。為了實現這些應用，物件材質的標識是很重要的一環，但要在三維模型的資料庫中標識出物件材質，往往會受到資料不全的限制，甚或有重新建置資料庫之必要。若有技術能做到自動偵側及分類三維模型內的各種物件，對於擴展適用於各種情境的資料量能有相當之幫助。 In the latest technological developments in the fields of autonomous driving, virtual reality (VR), augmented reality (AR) and mixed reality (mixed reality, WR), 3D simulation systems have become the mainstay of computer vision. research trends. In order to realize these applications, the identification of object materials is a very important part, but to identify object materials in the 3D model database, it is often limited by incomplete data, and it may even be necessary to rebuild the database. If there is a technology that can automatically detect and classify various objects in the 3D model, it can be quite helpful for expanding the amount of data applicable to various situations.

目前習用的做法，主要是透過學習一個已經訓練完成的三維點雲學習網路，藉此對未知的物件進行分類判斷。然而，訓練三維點雲學習網路需要大量的真實資料，但是真實資料的蒐集及標識既費時又費工，無形中限制了三維點雲學習網路的訓練及應用。 The current practice is mainly to classify and judge unknown objects by learning a 3D point cloud learning network that has been trained. However, training a 3D point cloud learning network requires a large amount of real data, but the collection and identification of real data is time-consuming and labor-intensive, which virtually limits the training and application of a 3D point cloud learning network.

有鑑於此，本發明提供一種使用二維影像自動產生訓練三維點雲學習網路所需之真實資料的方法，能夠憑藉一個二維影像，有效地自動生成大量真實資料，可供訓練三維點雲學習網路之用。 In view of this, the present invention provides a method for automatically generating real data required for training a 3D point cloud learning network using a 2D image, which can effectively and automatically generate a large amount of real data for training a 3D point cloud with a 2D image Learn how to use the Internet.

本發明提供一種使用二維影像自動產生訓練三維點雲學習網路所需之真實資料的方法，包括以下步驟：A.採用一自動化切割技術對一個二維影像上所呈現的複數物件進行切割；B.產生適於該些物件的大量紋理貼圖；C.採用第三版YOLO(You Only Look Once v3)類神經網路模型技術識別該些紋理貼圖，並產生對應的複數個物件標識；以及D.將該些紋理貼圖與複數個三維三角形網格結合，藉此取得各該網格的各個點上所對應的物件標識，進而產生複數真實資料。 The present invention provides a method for using two-dimensional images to automatically generate real data required for training a three-dimensional point cloud learning network, comprising the following steps: A. using an automatic cutting technology to cut a plurality of objects presented on a two-dimensional image; B. Generate a large number of texture maps suitable for these objects; C. Use the third version of YOLO (You Only Look Once v3) neural network model technology to identify these texture maps and generate corresponding plural object identifiers; and D . Combining these texture maps with a plurality of three-dimensional triangular meshes, so as to obtain the corresponding object identifiers on each point of each mesh, and then generate a plurality of real data.

在一實施例中，該二維影像為彩色，且步驟A所述之該自動化切割技術係先將彩色的該二維影像轉為灰階影像，接著做二值化(binarization)處理，再對該些物件進行標籤化，藉此偵測到該些物件並加以切割。 In one embodiment, the two-dimensional image is in color, and the automatic cutting technology described in step A first converts the two-dimensional image in color into a grayscale image, then performs binarization, and then converts the two-dimensional image into a grayscale image. The objects are tagged, whereby the objects are detected and cut.

在一實施例中，該二維影像為彩色，且步驟A所述之該自動化切割技術係先將彩色的該二維影像轉為灰階影像，再以直方圖等化(histogram equalization)方式調整影像灰階的對比度，接著再進行二值化處理，並以侵蝕與模糊化去除該些物件的雜訊，藉此偵測到該些物件並加以切割。 In one embodiment, the two-dimensional image is in color, and the automatic cutting technique described in step A is to first convert the two-dimensional image in color into a grayscale image, and then adjust it by means of histogram equalization The contrast of the gray scale of the image is then binarized, and the noise of these objects is removed by erosion and blurring, so as to detect and cut these objects.

在一實施例中，若切割後的任一該物件的長或寬超過4000像素，則將該物件再轉為灰階影像及進行二值化處理，再以開運算(opening)處理以去除該物件的雜訊，藉此偵測到該些物件所包含的複數物件並加以切割。 In one embodiment, if the length or width of any object after cutting exceeds 4000 pixels, the object is converted into a grayscale image and binarized, and then opened to remove the object. The noise of objects, thereby detecting and cutting the plurality of objects contained in these objects.

在一實施例中，模糊化係採用高斯模糊(gaussian blur)的做法，其濾波器表示如下：

In one embodiment, the blurring system adopts Gaussian blur (gaussian blur), and its filter is expressed as follows:

其中σ為常態分佈的標準差，u,v分別是不同維度的模糊半徑。 Where σ is the standard deviation of the normal distribution, and u and v are the blur radii of different dimensions.

在一實施例中，所述的侵蝕是使用大小固定的濾波器對黑白影像進行卷積運算。 In one embodiment, the erosion is to perform a convolution operation on the black and white image using a filter with a fixed size.

在一實施例中，步驟B係針對步驟A中切割完成的該些物件分別施加以RGB通道排列、色彩以及翻轉不同角度等動作而造成各種變化，並對該二維影像的背景調整出不同的變化，透過隨機組合該些物件與該二維影像而生成大量的紋理貼圖。 In one embodiment, step B is to apply actions such as RGB channel arrangement, color, and flipping different angles to the objects cut in step A to cause various changes, and to adjust the background of the two-dimensional image. A large number of texture maps are generated by randomly combining the objects and the 2D image.

在一實施例中，所述RGB通道排列係在不改變RGB值的前提下，改變RGB通道的排列順序。 In one embodiment, the arrangement of the RGB channels is to change the arrangement order of the RGB channels without changing the RGB values.

在一實施例中，該三維點雲學習網路是由Wavefront科技為動畫工具Advanced Visualizer開發的所開發的檔案格式。 In one embodiment, the 3D point cloud learning network is a file format developed by Wavefront Technology for the animation tool Advanced Visualizer.

在一實施例中，該三維點雲學習網路在提取點雲特徵之前，係先與神經網路T-Net的轉換矩陣相乘，其中該轉換矩陣係經過一個卷積層與全連接層的處理後，得到一個3×3或64×64的矩陣。 In one embodiment, the 3D point cloud learning network is multiplied by the transformation matrix of the neural network T-Net before extracting the point cloud features, wherein the transformation matrix is processed by a convolutional layer and a fully connected layer After that, a 3×3 or 64×64 matrix is obtained.

透過本發明所提出的使用二維影像自動產生訓練三維點雲學習網路所需之真實資料的方法，能夠大量產生真實資料，有助於訓練三維點雲學習網路，大幅降低相關成本。 Through the method of using 2D images to automatically generate the real data required for training the 3D point cloud learning network proposed by the present invention, a large amount of real data can be generated, which is helpful for training the 3D point cloud learning network and greatly reduces related costs.

S1、S2、S3、S4:步驟 S1, S2, S3, S4: steps

圖1是本發明一實施例的流程圖，說明一種使用二維影像自動產生訓練三維點雲學習網路所需真實資料之方法的各個步驟；圖2(a)到圖2(h)是各種風格的實際汽車紋理貼圖；圖3是本發明所使用的其中一種自動化切割技術(簡易汽車零件切割法)的流程示意圖；圖4是本發明所使用的另一種自動化切割技術(細緻汽車零件切割法)的流程示意圖；圖5是本發明生成大量紋理貼圖之做法的流程示意圖；圖6(a)到圖6(c)是本發明生成的其中一類紋理貼圖的示意圖；圖7(a)到圖7(c)是本發明生成的另一類紋理貼圖的示意圖；圖8是本發明採用YOLOv3技術偵測零件得到的結果示意圖；圖9是具類別標籤之點雲模型的示意圖；圖10是本發明所採用的三維模型與二維紋理貼圖之資料轉換方式與流程示意圖；圖11是本發明的點雲學習網路之架構圖；以及圖12是本發明針對汽車零件的分類結果。 Fig. 1 is a flowchart of an embodiment of the present invention, illustrating each step of a method for automatically generating real data required for training a 3D point cloud learning network using 2D images; Fig. 2(a) to Fig. 2(h) are various Figure 3 is a schematic flow chart of one of the automatic cutting techniques (simple auto parts cutting method) used in the present invention; Fig. 4 is another automatic cutting technology (detailed auto parts cutting method) used in the present invention ) of the flow chart; Fig. 5 is a schematic flow chart of the method of generating a large number of texture maps in the present invention; Fig. 6 (a) to Fig. 6 (c) are schematic diagrams of one type of texture maps generated in the present invention; Fig. 7 (a) to Fig. 7(c) is a schematic diagram of another type of texture map generated by the present invention; FIG. 8 is a schematic diagram of the results obtained by the present invention using YOLOv3 technology to detect parts; FIG. 9 is a schematic diagram of a point cloud model with category labels; FIG. 10 is a schematic diagram of the present invention Figure 11 is a schematic diagram of the structure of the point cloud learning network of the present invention; and Figure 12 is the classification result of the present invention for auto parts.

為能更清楚地說明本發明，茲舉較佳實施例並配合圖式詳細說明如後。在以下的說明書段落中，我們將以汽車的三維模型作為說明之範例，但此並非本發明的限制所在。本發明所提出之方法原則上可以應用在其他物品的三維模擬系統中，並能用以辨識模擬系統內的物件類別。 In order to illustrate the present invention more clearly, preferred embodiments are given and detailed descriptions are given below in conjunction with drawings. In the following paragraphs of the description, we will use a three-dimensional model of a car as an example for illustration, but this is not a limitation of the present invention. The method proposed by the present invention can be applied in the 3D simulation system of other objects in principle, and can be used to identify the type of objects in the simulation system.

在開始說明本發明的技術內容之前，此處先對三維模型的基本概念略做介紹，以有助於理解本發明。一般而言，三維模型的外型係透過三角形網格(triangular grid)來模擬物件的曲面或實體，若三角形網格的構面越小，其所形成的物件外型表面的解析度就會越高，也越能展現物件外型的細節。一個三角形網格的構面係由三個頂點座標與一個單位法向量所組成，而為了呈現顏色或材質，三角形網格上通常會敷貼有貼圖(texture image)，也就是二維的平面影像。更明確來說，三角形網格的三個頂點座標會對應至貼圖的UV座標，藉此結合模型與貼圖，使得物件模型呈現更加真實的物體外觀。 Before starting to explain the technical content of the present invention, here is a brief introduction to the basic concept of the three-dimensional model to help understand the present invention. Generally speaking, the shape of a 3D model simulates the surface or solid of an object through a triangular grid. If the facet of the triangular grid is smaller, the resolution of the surface of the formed object will be higher. The higher the height, the more details of the shape of the object can be displayed. The facet of a triangular mesh is composed of three vertex coordinates and a unit normal vector, and in order to present color or material, a texture image is usually applied to the triangular mesh, that is, a two-dimensional plane image . More specifically, the coordinates of the three vertices of the triangle mesh will correspond to the UV coordinates of the texture, so that the model and the texture can be combined to make the object model present a more realistic object appearance.

簡言之，三維模型儲存的資料包括其所包含的所有三角形網格、各網格頂點的三維空間資料、構面向量、描述構面的文字資料、二維平面的紋理貼圖關係(一般以UV座標表示)，以及各三角網格間的關聯性資訊等資料。三維模型有多種常用的資料格式，本發明採用點雲(point cloud)格式，具有易於轉換格式、儲存資料量較小、電腦處理時間較短等優點。 In short, the data stored in the 3D model include all the triangular meshes it contains, the 3D space data of each mesh vertex, the facet vector, the text data describing the facet, and the texture map relationship of the 2D plane (generally expressed as UV Coordinate representation), and the correlation information among the triangular meshes. There are many commonly used data formats for 3D models. The present invention adopts the point cloud format, which has the advantages of easy format conversion, small amount of stored data, and short computer processing time.

請參照圖1，為本發明一實施例的流程圖，顯示一種使用二維影像產生訓練三維點雲學習網路所需真實資料之方法的各個步驟。其中第一個步驟S1係採用一自動化切割技術對一個二維影像上所呈現的複數物件進行切割。於此例中，該二維影像是一張二維的汽車紋理貼圖，所述複數物件即為該二維影像上所包含的各個汽車元件、零件或部件(為敘述簡便，以下通稱「零件」)。實務上，汽車紋理貼圖所包含的零件排列緊密且沒有規則，貼圖的呈現方式也因創作者而異，貼圖上呈現的風格也有明顯的差異，每張汽車紋理貼圖上的零件也都沒有一致的規範與標準。圖中的零件與背景有差異大及相似的貼圖，零件排列因數量的多寡而不同，其排列有緊密或邊緣重疊，也有零件輪廓不明顯的貼圖，圖2(a)到圖2(h)所示意的都是實際存在且風格各異的汽車紋理貼圖。 Please refer to FIG. 1 , which is a flowchart of an embodiment of the present invention, showing various steps of a method for generating real data required for training a 3D point cloud learning network using 2D images. The first step S1 is to use an automatic cutting technology to cut a plurality of objects presented on a two-dimensional image. In this example, the 2D image is a 2D car texture map, and the plurality of objects are the various car components, parts or parts contained in the 2D image (for simplicity of description, hereinafter referred to as "parts"). In practice, the parts contained in the car texture map are closely arranged and irregular, and the way the map is presented varies from creator to artist. norms and standards. The parts in the picture and the background have large differences and similar textures, and the arrangement of the parts Depending on the quantity, the arrangement is tight or the edges overlap, and there are also maps with inconspicuous outlines of parts. Figure 2(a) to Figure 2(h) show the actual car texture maps with different styles.

本發明依據貼圖不同的特性，提出兩種不同的自動化切割技術。第一種自動化切割技術是針對零件與背景顏色差異較大的貼圖，於本說明書中稱作簡易汽車零件切割法；第二種自動化切割技術是針對零件與背景顏色差異較小的貼圖，需要採用比較細緻的處理，於本說明書中稱作細緻汽車零件切割法。以下分別對這兩種自動化切割技術進行說明。 The present invention proposes two different automatic cutting technologies according to different characteristics of textures. The first automatic cutting technology is for textures with large color differences between parts and background, which is called simple auto parts cutting method in this manual; the second automatic cutting technology is for textures with small color differences between parts and background, which needs to be used The more detailed processing is referred to as the detailed auto parts cutting method in this specification. The two automated cutting technologies are described below respectively.

首先，簡易汽車零件切割法將彩色的汽車紋理貼圖轉為灰階影像(gray image)，然後再做二值化(binarization)，最後進行零件的標籤化，即可偵測到零件並切割，其處理流程如圖3所示。 First, the simple car parts cutting method converts the colored car texture map into a grayscale image (gray image), and then performs binarization (binarization), and finally performs labeling of the parts, and the parts can be detected and cut. The processing flow is shown in Figure 3.

至於細緻汽車零件切割法，請參考圖4，是先將彩色貼圖影像更換背景色，並轉換成灰階影像，再以直方圖等化(histogram equalization)方式調整影像灰階的對比度。直方圖等化會使得影像整體的色階分佈較均勻，增強影像色階的對比度，能夠提高零件與背景的鑑別度。接著再做二值化，並以侵蝕與模糊化去除零件的雜訊，最後進行零件的標籤化，即可偵測到零件，並將零件進行切割。 As for the detailed auto parts cutting method, please refer to Figure 4. Firstly, the color map image is replaced with the background color, and then converted into a grayscale image, and then the contrast of the image grayscale is adjusted by histogram equalization. Histogram equalization will make the overall color scale distribution of the image more uniform, enhance the contrast of the image color scale, and improve the identification of parts and backgrounds. Then do binarization, and use erosion and blurring to remove the noise of the parts, and finally carry out labeling of the parts, and then the parts can be detected and cut.

此處所述的模糊化，本發明係採用高斯模糊(gaussian blur)的做法，其濾波器表示如下：

For the blurring described here, the present invention adopts the method of Gaussian blur (gaussian blur), and its filter is expressed as follows:

其中σ為常態分佈的標準差，u,v分別是不同維度的模糊半徑。此處所述的侵蝕，則是使用大小固定的濾波器對黑白影像進行卷積運算。 Where σ is the standard deviation of the normal distribution, u and v are the blur radii of different dimensions. The erosion described here is to use a filter with a fixed size to perform a convolution operation on a black and white image.

值得一提的是，為了過濾一些零件重疊的問題，而造成切割到兩個零件以上的大物件，本發明所採用的細緻汽車零件切割法還會對所有零件統計尺寸，若切割後的零件長或寬超過4000像素，則將該零件影像將轉為灰階影像，再對灰階影像做二值化，以開運算(opening)處理以去除零件之間的雜訊，再進行零件的切割處理，如此可以將獨立的零件切割出來。 It is worth mentioning that, in order to filter the problem of some overlapping parts, resulting in the cutting of large objects with more than two parts, the meticulous auto parts cutting method adopted in the present invention will also count the dimensions of all parts. If the cut parts are long Or if the width exceeds 4000 pixels, the part image will be converted into a grayscale image, and then the grayscale image will be binarized, and the noise between the parts will be removed by opening operation, and then the parts will be cut. , so that independent parts can be cut out.

請參照圖1，待該些物件(即汽車零件)由該二維影像切割出來後，接著採取的步驟S2便會產生適用於該些物件的大量紋理貼圖。簡而言之，針對該些切割完成的物件，本發明分別施加以RGB通道排列、色彩以及翻轉不同角度等動作而造成各種變化。除此之外，本發明也對貼圖背景做不同的變化，然後再將各種零件與背景隨機組合生成大量的紋理貼圖。 Please refer to FIG. 1 , after the objects (ie auto parts) are cut out from the 2D image, the next step S2 will generate a large number of texture maps suitable for these objects. In short, for these cut objects, the present invention applies actions such as RGB channel arrangement, color, and flipping different angles to cause various changes. In addition, the present invention also makes different changes to the map background, and then randomly combines various parts and backgrounds to generate a large number of texture maps.

更詳細地說，透過改變RGB通道的排序，以組合成多種顏色的零件。改變通道的方式係在不改變RGB值的前提下，改變RGB通道的排列順序，共可以產生6種排列順序(RGB、RBG、GRB、GBR、BGR、BRG)，如此可以產生6種不同顏色的紋理貼圖影像。至於物件的旋轉，其角度係以90度為單位，以0至270度為各該物件的旋轉範圍。在本例中，該二維影像的背景分別以白色、灰色及黑色作為新紋理貼圖的背景色。 In more detail, by changing the order of the RGB channels, it is possible to combine parts of multiple colors. The way to change the channel is to change the arrangement order of the RGB channels without changing the RGB value, and a total of 6 arrangement orders (RGB, RBG, GRB, GBR, BGR, BRG) can be generated, so that 6 different colors can be generated. Texture map image. As for the rotation of the object, the angle is 90 degrees as the unit, and the rotation range of each object is 0 to 270 degrees. In this example, the background of the 2D image uses white, gray and black as the background colors of the new texture map.

除此之外，本發明亦有可能視情況而對該些物件的形狀進行調整，在外型相似的基礎下做改變，並且符合物件類型特徵的條件。例如在其中一該物件加入額外的小物件，或是對其中另一該物件提取部分特徵而劃分為物件，藉此產生數量更多的物件。 In addition, the present invention may also adjust the shapes of these objects according to the situation, make changes on the basis of similar appearance, and meet the conditions of the characteristics of the object type. For example, adding an additional small object to one of the objects, or extracting some features from the other object to divide it into objects, thereby generating more objects.

由於紋理貼圖的零件排列方式不規則，擺放的角度與方向也不固定，其零件的類別數量也不同，而且貼圖的背景顏色也相當的多元。因此，本發明係以原紋理貼圖的設計及擺放方式，生成大量的汽車紋理貼圖，其製作的流程如圖5所示。詳言之，首先在所有類別中隨機抽取零件，每次固定抽取8至10個零件。接著將旋轉後的零件以不固定間距方式排列，最後再將各類零件隨機的排列並擺放，以生成大量的汽車紋理貼圖。 Due to the irregular arrangement of the parts of the texture map, the angle and direction of placement are not fixed, the number of types of parts is also different, and the background color of the map is also quite diverse. Therefore, the present invention generates a large number of automobile texture maps based on the design and arrangement of the original texture maps, and the production process is shown in FIG. 5 . In detail, first randomly select parts in all categories, and each time a fixed selection of 8 to 10 parts. Then the rotated parts are arranged in a non-fixed pitch, and finally all kinds of parts are randomly arranged and placed to generate a large number of car texture maps.

本發明以零件類別數量、零件型態以及RGB通道排序作為生成紋理貼圖的目標，並將以4組不同的方式生成貼圖訓練集，分別稱為「Type-1」、「Type-2」、「Type-3」和「Type-4」貼圖訓練集，其生成的組合方式如表(1)所示。表中零件類別數量係生成貼圖中其零件類別的數量，分為「所有類別」與「僅單類別」。零件型態則分別由「原零件」或「新生成零件」所組合。RGB通道係指「灰階的物件影像」與「所有顏色通道的排列組合」。Type-1係由所有零件的類別、原零件以及6種顏色影像組成之新紋理貼圖。Type-2是由所有零件的類別、新生成零件以及灰階影像組成之新紋理貼圖。Type-3係由所有零件的類別、新生成零件以及6種顏色影像組成之新紋理貼圖。Type-4則由所有單一零件類別、新生成零件以及6種顏色影像組成之新紋理貼圖。 The present invention takes the number of parts types, parts types and RGB channel sorting as the goal of generating texture maps, and generates texture training sets in 4 different ways, which are called "Type-1", "Type-2", and "Type-1" respectively. Type-3" and "Type-4" map training sets, the combination of which is generated is shown in Table (1). The number of part types in the table is the number of its part types in the generated map, which is divided into "all types" and "only one type". Part types are combined by "original part" or "newly generated part". RGB channel refers to "grayscale object image" and "arrangement and combination of all color channels". Type-1 is a new texture map composed of all parts types, original parts and 6 color images. Type-2 is a new texture map consisting of all part types, newly generated parts, and grayscale images. Type-3 is a new texture map composed of all parts types, newly generated parts and 6 color images. Type-4 is a new texture map consisting of all single part types, newly generated parts, and 6 color images.

再請參照圖1，接著，步驟S3採用第三版YOLO(You Only Look Once v3)類神經網路模型技術識別該些紋理貼圖，並產生對應的複數個物件標識。更明確來說，第三版YOLO類神經網路模型技術是一種基於卷積類神經網路的物件偵測演算法，此演算法除了採用Darknet53提取特徵之外，也引入特徵金字塔網路(feature pyramid network,FPN)的技術，以改善無法偵測小物件的問題。在本發明中，Darknet53的卷積類神經網路架構由53個卷積層(convolutional layer)、5個殘差層(residual layer)、全域池化層(global average pooling)以及一層全連接層(fully connected layer)所組成，此架構的卷積層之步伐(stride)為2；至於特徵金字塔網路係多尺度偵測的概念，其上採樣層則採用最鄰近法(nearest neighbor)。YOLOv3架構含有75層的卷積層，4個route層、2個上取樣層(upsample)和23個Shortcut層。其偵測後的結果如圖8所示，偵測後取得的零件位置與類別以遮罩的方式標記於貼圖上物件所屬之區域，並在此區域內標上類別的標籤，以此作為零件的類別資訊。 Please refer to FIG. 1 again. Then, step S3 adopts the third version of YOLO (You Only Look Once v3) neural network model technology to identify the texture maps and generate a plurality of corresponding object identifiers. More specifically, the third version of the YOLO-like neural network model technology is an object detection algorithm based on a convolutional neural network. In addition to using Darknet53 to extract features, this algorithm also introduces a feature pyramid network (feature Pyramid network, FPN) technology to improve the problem of inability to detect small objects. In the present invention, the convolutional neural network architecture of Darknet53 consists of 53 convolutional layers, 5 residual layers, a global average pooling layer, and a fully connected layer (fully connected layer), the stride of the convolutional layer of this architecture is 2; as for the concept of multi-scale detection of the feature pyramid network, the upsampling layer uses the nearest neighbor method. The YOLOv3 architecture contains 75 layers of convolutional layers, 4 route layers, 2 upsampling layers (upsample) and 23 Shortcut layers. The result after detection is shown in Figure 8. The position and type of the part obtained after detection are marked in the area of the object on the map in the form of a mask, and the label of the type is marked in this area as a part category information for .

接著，圖1所示的步驟S4將該些紋理貼圖與複數個三維三角形網格結合，藉此取得各該網格的各個點上所對應的物件標識，產生複數真實資料。如前所述，本發明所採用的三維資料格式為點雲，這種資料格式是由大量的點座標組成三維物件的表面，通常使用相機與雷射等儀器掃瞄，以取得表面網格結構的資料，然後只考慮網格的頂點資料，以上由這些點組成的資料集，就稱作點雲。點雲能夠轉換為其他型態的資料格式，以減少整體儲存的資料量，以及減少電腦處理的時間。因此，本發明將三維素模型轉換成點雲資料，再以點雲學習網路(PointNet)架構做為三維素模型點雲上零件分類的架構。 Next, in step S4 shown in FIG. 1 , these texture maps are combined with a plurality of three-dimensional triangular meshes, thereby obtaining object identifiers corresponding to each point of each mesh, and generating a plurality of real data. As mentioned above, the 3D data format used in the present invention is point cloud, which is composed of a large number of point coordinates to form the surface of a 3D object, usually scanned by cameras and lasers to obtain the surface grid structure , and then only consider the vertex data of the mesh, the above data set composed of these points is called a point cloud. The point cloud can be converted into other types of data formats to reduce the overall amount of stored data and reduce the time for computer processing. Therefore, the present invention converts the 3D voxel model into point cloud data, and uses the point cloud learning network (PointNet) framework as the framework for classifying parts on the 3D voxel model point cloud.

本發明對二維汽車紋理貼圖進行零件分類與標籤後，再依據三維素模型上的頂點(U,V)座標值，所對應在二維紋理貼圖上的座標之零件標籤，使得三維點雲模型上的點具有零件類別標籤資訊，如圖9所示，圖中多種不同顏色的點係表示不同的零件類別。因此，為了達成三維素模型自動化的零件分類，本發明將已經具有零件類別標籤的三維素模型，再輸入三維神經網路模型架構中進行訓練，學習不同類別的零件特徵，以便未來預測三維點雲模型的零件分類。其中三維點雲模型係由許多三角形網格構面組成，每個三角形網格構面有三個頂點。點雲上的零件標籤本發明將透過點雲學習網路模型取得，此模型訓練的零件標籤係由二維紋理貼圖取得。 The present invention classifies and labels the parts of the two-dimensional automobile texture map, and then according to the vertex (U, V) coordinate values on the three-dimensional voxel model, the part labels corresponding to the coordinates on the two-dimensional texture map make the three-dimensional point cloud model The points above have label information of part categories, as shown in Fig. 9, the points in different colors in the figure represent different part categories. Therefore, in order to achieve the automatic part classification of the three-dimensional voxel model, the present invention inputs the three-dimensional voxel model with the part category label into the three-dimensional neural network model framework for training, and learns the characteristics of different types of parts, so as to predict the three-dimensional point cloud in the future Classification of parts of the model. The 3D point cloud model is composed of many triangular mesh facets, and each triangular mesh facet has three vertices. Part label on point cloud The present invention will learn the network model through the point cloud, and the part label trained by this model is obtained from the two-dimensional texture map.

本發明採用的三維點雲模型是由Wavefront科技所開發的檔案格式，此格式是為動畫工具Advanced Visualizer開發，也是一種幾何體圖形的檔案格式，目前被許多的三維圖形軟體所使用，其副檔名為「obj」。此檔案格式儲存三維幾何圖形的簡單資料之格式，每一個檔案內的排列格式，包括有「點」、「紋理貼圖」、「法向量」、「面」等資料。本發明只需要「點」與「紋理貼圖」」兩個資料，「點」的紀錄的點座標，「紋理貼圖」則紀錄與紋理貼圖相對應的座標。 The 3D point cloud model used in the present invention is a file format developed by Wavefront Technology. This format is developed for the animation tool Advanced Visualizer. It is also a file format of geometric figures. It is currently used by many 3D graphics software. Its sub-file name is "obj". This file format is a format for storing simple data of 3D geometric figures. The arrangement format in each file includes data such as "point", "texture map", "normal vector", and "surface". The present invention only needs two data of "point" and "texture map", "point" records the point coordinates, and "texture map" records the coordinates corresponding to the texture map.

首先在取得三維點雲模型上(其副檔名為「obj」的檔案)的點座標後，再將點上座標儲存成一般文字格式檔案(其副檔名為「pts」的檔案)，另外再開啟對應的紋理貼圖(其副檔名為「mtl」的檔案)，並搜尋位置上的零件標籤值，再將點上的「零件標籤值」儲存成一般文字格式檔案(其副檔名為「seg」)，其資料轉換方式與流程如圖10所示。最後，再將「pts」與「seg」資料轉換成三維模式，以便做為點雲學習網路(PointNet)模型的訓練資料，亦即訓練所需的該些真實資料。 Firstly, after obtaining the coordinates of the points on the 3D point cloud model (the file whose extension is "obj"), then save the point coordinates as a file in general text format (the file whose extension is "pts"). Then open the corresponding texture map (the file whose extension is "mtl"), and search for the part label value at the position, and then save the "part label value" on the point as a general text format file (the extension name is "seg"), its data conversion method and process are shown in Figure 10. Finally, the "pts" and "seg" data are converted into three-dimensional models, so as to be used as training data for the PointNet model, that is, the real data required for training.

該些真實資料產生後，即可用以訓練一個三維點雲學習網路。由於點雲或網格的格式不是常規格式，因此大多數研究人員會先將此類數據轉換為常規3D體素網格或圖像集合，然後再將其提供給深層學習網路體系結構。但是，這種數據表示形式的轉換，將無法顯示數據的自然不變性特性。三維點雲模型是幾何資料結構中重要的類型之一，其資料具有「旋轉不變性」、「相互關聯性」以及「無序性」的三種特性。現有的點雲模型都是特定任務，係由人工方式設計與建置。點雲上的點通常是統計屬性之編碼，此編碼有轉換不變性的特性。因此，本發明係採用三維點雲來表示三維幾何的不同輸入形式，並使用點雲網路做為三維物件分類模型的深度學習網路架構。 After the real data is generated, it can be used to train a 3D point cloud learning network. Since the format of a point cloud or grid is not conventional, most researchers convert such data into a regular 3D voxel grid or collection of images before feeding it to a deep learning network architecture. However, this transformation of the data representation will fail to reveal the natural invariance properties of the data. The 3D point cloud model is one of the important types of geometric data structures, and its data has three characteristics: "rotation invariance", "interrelationship" and "disorder". Existing point cloud models are task-specific and are designed and constructed manually. Points on a point cloud are usually codes of statistical properties, which have the property of transformation invariance. Therefore, the present invention adopts 3D point cloud to represent different input forms of 3D geometry, and uses the point cloud network as the deep learning network architecture of the 3D object classification model.

點雲學習網路架構在提取點雲特徵之前，必須先經過小型的神經網路T-Net的轉換矩陣與點雲資料相乘，使得點雲的資料對齊，以保留點雲旋轉不變的特性。T-Net的轉換矩陣係經過一個卷積層與全連接層的處理後，得到一個3×3或64×64的矩陣。 Before the point cloud learning network architecture extracts point cloud features, it must be multiplied by the transformation matrix of the small neural network T-Net and the point cloud data, so that the data of the point cloud are aligned to preserve the invariant characteristics of point cloud rotation . The transformation matrix of T-Net is processed by a convolutional layer and a fully connected layer to obtain a 3×3 or 64×64 matrix.

三維模型係由點組成，其中空間所有點與點之間有「相互關聯性」。點雲學習網路架構中，語意分割(semantic segmentation)結合分類(classification)的區域(local)與全域(global)的特徵資訊，使得有效利用點間的空間(spatial)資訊。「無序性」係將不同排序的點雲資料輸入模型架構中，點雲資料不會受到影響。雖然點雲資料的輸入是無序性，但卷積層是有順序性，因此在卷積且取得特徵後，點雲內的點間尚有順序性之關係。為了點雲無序性的特性，點雲學習網路模型以「最大池化層」(max pooling layer)提取其特徵，使得點雲仍保持無順序性。 A 3D model is composed of points, and there is "interrelationship" between all points in space. In the point cloud learning network architecture, semantic segmentation combines the local and global feature information of the classification to make effective use of the spatial information between points. "Irregularity" refers to inputting point cloud data of different sorts into the model structure, and the point cloud data will not be affected. Although the input of point cloud data is disordered, the convolutional layer is sequential. Therefore, after convolution and feature acquisition, there is still a sequential relationship between points in the point cloud. In view of the disordered characteristics of point clouds, the point cloud learning network model uses the "max pooling layer" to extract its features, so that the point cloud remains disordered.

點雲學習網路的架構圖如圖11所示，其架構分為分類(classification)與分割(segmentation)兩個架構，分別是分類與分割架構。此模型先進行分類架構，以n×3作為輸入，n為點雲的採樣點數量，3為x、y、z軸的三維座標數值。首先由小型神經網路T-Net產生3×3大小的轉換矩陣，再將輸入的座標值做轉換，使其點的座標值對齊。接著再經由共享權重的多層感知機取得點雲的特徵，再經由T-Net產生64×64大小的轉換矩陣進行點的對齊，然後再採用多層感知機取得點的特徵，如此每個點將得到1024維的特徵向量，再經由最大池化取得全域的特徵，最後再由一個多層感知機得到點的類別分數(score)。完成分類架構的處理後，再進行分割架構的處理，其架構將分類架構中取得的64維的區域特徵向量，再與最大池化取得全域的特徵向量做結合，最後透過共享權重的多層感知機，則能取得每一個點的零件類別分數。汽車零件分類結果如圖12所示。 The architecture diagram of the point cloud learning network is shown in Figure 11. Its architecture is divided into two architectures: classification and segmentation. This model first performs a classification structure, taking n×3 as input, n is the number of sampling points of the point cloud, and 3 is the three-dimensional coordinate value of the x, y, and z axes. First, the small neural network T-Net generates a 3×3 conversion matrix, and then converts the input coordinate values to align the coordinate values of the points. Then, the features of the point cloud are obtained through the multi-layer perceptron with shared weights, and then the transformation matrix of 64×64 size is generated by T-Net to align the points, and then the features of the points are obtained by the multi-layer perceptron, so that each point will get 1024 Dimensional feature vector, and then obtain the feature of the whole domain through the maximum pooling, and finally obtain the category score (score) of the point by a multi-layer perceptron. After completing the processing of the classification structure, the processing of the segmentation structure is carried out. The structure combines the 64-dimensional regional feature vector obtained in the classification structure with the feature vector of the entire domain obtained by the maximum pooling, and finally through the multi-layer perceptron with shared weights. , the part category score of each point can be obtained. The classification results of auto parts are shown in Fig. 12.

綜上所述，由於本發明透過傳統影像處理技術與卷積類神經網路對二維影像進行物件切割與分類，因此能夠減少標記場景物件類別的成本。除此之外，本發明藉由改變物件的形狀、顏色、旋轉、背景色系等特徵，隨機組合生成大量的紋理貼圖，可以取代人工進行三維模型的物件類別標記工作。另外，本發明使用二維與三維深度學習網路進行三維模型的物件分類，並以貼圖的分類結果作為輔助使用，無需由人工逐一檢視檔案及處理分類，所以能夠減少人力與時間上的成本。 To sum up, since the present invention performs object segmentation and classification on two-dimensional images through traditional image processing technology and convolutional neural network, it can reduce the cost of labeling scene object categories. In addition, the present invention randomly combines and generates a large number of texture maps by changing the characteristics of the object such as shape, color, rotation, and background color system, which can replace the work of manually marking the object category of the 3D model. In addition, the present invention uses two-dimensional and three-dimensional deep learning networks to classify objects of three-dimensional models, and uses the classification results of texture maps as an auxiliary use. It is not necessary to manually check files and process classification one by one, so manpower and time costs can be reduced.

以上所述僅為本發明較佳可行實施例而已，舉凡應用本發明說明書及申請專利範圍所為之等效方法變化，理應包含在本發明之專利範圍內。 The above description is only a preferred embodiment of the present invention, and all equivalent method changes made by applying the description of the present invention and the scope of the patent application should be included in the scope of the patent of the present invention.

S1、S2、S3、S4:步驟 S1, S2, S3, S4: steps

Claims

A method of using two-dimensional images to automatically generate the ground truth required for training a three-dimensional point cloud learning network (PointNet), which is implemented by a computer. The method includes the following steps: A. receiving a plurality of two-dimensional images, Each of these two-dimensional images contains a plurality of objects, and an automatic cutting technology is used to cut these objects presented on each of these two-dimensional images; B. For the objects that have been cut in step A, apply RGB Various changes are caused by channel arrangement, color, and flipping different angles, and different changes are adjusted to the background of the two-dimensional image. By randomly combining these objects and the two-dimensional image, a large number of textures suitable for these objects are generated. texture maps, the number of these texture maps is greater than the number of these two-dimensional images; C. Use the third version of YOLO (You Only Look Once v3) neural network model technology to identify these texture maps and generate corresponding plural Object identification; and D. Combining these texture maps with a plurality of three-dimensional triangular meshes, thereby obtaining object identifications corresponding to each point of each of the meshes, and then generating complex numbers for training a three-dimensional point cloud learning network factual information.

The method for using two-dimensional images as described in claim 1 to generate real data required for training a three-dimensional point cloud learning network, wherein the two-dimensional images are in color, and the automatic cutting technology described in step A is to first convert the color The two-dimensional image is converted into a grayscale image, and then binarization is performed, and then the objects are labeled, so as to detect and cut the objects.

The method of using two-dimensional images as described in claim 1 to generate real data required for training three-dimensional point cloud learning networks, wherein the two-dimensional images are in color, and the steps described in step A The automatic cutting technology first converts the color two-dimensional image into a grayscale image, then adjusts the grayscale contrast of the image by means of histogram equalization, and then performs binarization processing, and uses erosion and blur Remove the noise of these objects, thereby detect these objects and cut them.

As described in claim 3, the method for using two-dimensional images to generate real data required for training three-dimensional point cloud learning networks, wherein, if the length or width of any object after cutting exceeds 4000 pixels, then the object is re- Convert to a grayscale image and perform binarization processing, and then perform an opening operation (opening) to remove the noise of the object, so as to detect and cut the multiple objects contained in the objects.

As described in claim 3, the method of using two-dimensional images to generate the real data required for training the three-dimensional point cloud learning network, wherein the blurring system adopts the method of Gaussian blur (gaussian blur), and its filter is expressed as follows:

Where σ is the standard deviation of the normal distribution, u and v are the blur radii of different dimensions.

The method for generating real data required for training a 3D point cloud learning network using a 2D image as described in Claim 3, wherein the erosion is to use a filter with a fixed size to perform a convolution operation on a black and white image.

The method for using two-dimensional images to generate real data required for training a three-dimensional point cloud learning network as described in claim 1, wherein the RGB channel arrangement is to change the arrangement order of the RGB channels without changing the RGB values.

The method of using two-dimensional images to generate real data required for training three-dimensional point cloud learning network as described in claim item 7, wherein changing the arrangement order of RGB channels includes RGB, RBG, GRB, GBR, BGR, and BRG.

As described in Claim 1, the method of using two-dimensional images to generate real data required for training a three-dimensional point cloud learning network, wherein the three-dimensional point cloud learning network is a file format developed by Wavefront Technology for the animation tool Advanced Visualizer.

As described in claim 9, the method of using two-dimensional images to generate the real data required for training a three-dimensional point cloud learning network, wherein the three-dimensional point cloud learning network is first combined with the neural network T- before extracting point cloud features Net's transformation matrix is multiplied, where the transformation matrix is processed by a convolutional layer and a fully connected layer to obtain a 3×3 or 64×64 matrix.