TWI643137B

TWI643137B - Object recognition method and object recognition system

Info

Publication number: TWI643137B
Application number: TW106113453A
Authority: TW
Inventors: 潘品睿
Original assignee: 潘品睿
Priority date: 2017-04-21
Filing date: 2017-04-21
Publication date: 2018-12-01
Also published as: TW201839665A

Abstract

本發明提出一種物件辨識方法及物件辨識系統。此方法包括：取得多媒體資料；將多媒體資料輸入至深度學習模型以辨識多媒體資料中的物件；以及根據所辨識出的多媒體資料中的物件，輸出對應於此物件的輸出資訊。The invention provides an object identification method and an object identification system. The method includes: acquiring multimedia data; inputting the multimedia data to the deep learning model to identify the object in the multimedia data; and outputting output information corresponding to the object according to the identified object in the multimedia data.

Description

Object identification method and object identification system

本發明是有關於一種使用深度學習模型進行物件辨識的物件辨識方法及物件辨識系統。The invention relates to an object identification method and an object identification system for object recognition using a deep learning model.

一般來說，擴增實境可以分成兩個主要階段：物件辨識的階段以及根據辨識的結果來將對應的擴增實境內容疊加在影像上並顯示的階段。然而，物件辨識能力的好壞，大大影響了擴增實境的體驗。In general, the augmented reality can be divided into two main phases: the stage of object recognition and the stage of superimposing the corresponding augmented reality content on the image and displaying it according to the result of the recognition. However, the ability to identify objects greatly affects the experience of augmented reality.

特別是，物件辨識的階段還可以分成兩個階段：特徵擷取階段以及分類階段。圖1是物件辨識的流程示意圖。請參照圖1，一般物件辨識方法是將影像10輸入至物件辨識模組100中。當影像10被輸入至物件辨識模組100中時，首先會經由步驟S101來執行特徵擷取階段。在步驟S101中，物件辨識模組100可以將影像10進行特徵擷取，並且在步驟S102產生特徵向量(Feature vector)，此些特徵向量中的每個維度皆用來表示影像10中的某種特徵。之後在步驟S103中，物件辨識模組100可以將在步驟S102擷取到的特徵向量輸入至一分類器，分類器會依照此特徵向量進行分類，進而辨識出影像10中的目標物。特別是，經由特徵擷取階段而取得的特徵向量，通常決定了物件辨識結果的好壞。In particular, the stage of object identification can be divided into two phases: the feature extraction phase and the classification phase. Figure 1 is a schematic flow chart of object recognition. Referring to FIG. 1 , the general object identification method is to input the image 10 into the object recognition module 100 . When the image 10 is input into the object recognition module 100, the feature extraction phase is first performed via step S101. In step S101, the object recognition module 100 can perform feature extraction on the image 10, and generate a feature vector in step S102. Each of the feature vectors is used to represent a certain image 10 feature. Then, in step S103, the object recognition module 100 can input the feature vector captured in step S102 to a classifier, and the classifier classifies the feature vector according to the feature vector, thereby identifying the target object in the image 10. In particular, the feature vector obtained through the feature capture stage usually determines the quality of the object recognition result.

圖2A是傳統的物件辨識的示意圖。請參照圖2A，在傳統的擴增實境中，物件辨識的特徵擷取階段(即，圖1中的步驟S101)通常是使用與顏色轉折點(Corner)相關的演算法。該演算法利用微分的概念找出影像中顏色變化較大的點，並利用該些顏色轉折點，產生屬於該影像的特徵向量(如圖2A所繪示的顏色轉折點20)。最後再依照該特徵向量，進行分類、預測出物件的大小、位置。2A is a schematic diagram of conventional object recognition. Referring to FIG. 2A, in the conventional augmented reality, the feature extraction phase of object recognition (ie, step S101 in FIG. 1) generally uses an algorithm associated with a color turning point (Corner). The algorithm uses the concept of differentiation to find points with large color changes in the image, and uses the color turning points to generate feature vectors belonging to the image (such as the color turning point 20 shown in FIG. 2A). Finally, according to the feature vector, classification and prediction of the size and position of the object are performed.

然而，使用顏色轉折點演算法的物件辨識方法具有以下缺點：無法辨識不同角度拍攝的相同物件。舉例來說，圖2B是同一物件經由不同角度拍攝所取得的影像的示意圖。請參照圖2B，由於影像22、影像24以及影像26是分別經由不同角度拍攝的同一物件；但因顏色轉折點分布也會隨著拍攝角度改變，故可能會造成無法被歸類為相同物件。因此目前以影像辨識的擴增實境來說，通常僅能辨識平面標記(如：二維碼、簡單圖片)；而難以有效地辨識複雜的三維物件。However, the object recognition method using the color turning point algorithm has the following disadvantages: the same object photographed at different angles cannot be recognized. For example, FIG. 2B is a schematic diagram of an image taken by the same object through different angles. Referring to FIG. 2B, since the image 22, the image 24, and the image 26 are the same object photographed through different angles respectively, the distribution of the color turning points may also change according to the shooting angle, so that it may not be classified as the same object. Therefore, in the current augmented reality of image recognition, it is usually only possible to recognize planar marks (eg, two-dimensional codes, simple pictures); and it is difficult to effectively identify complex three-dimensional objects.

此外，使用顏色轉折點演算法的物件辨識方法還具有以下缺點：無法辨識不同物件，但屬於相同類型的物件。例如，甲地的火山與乙地的火山因形狀稍微不同，卻無法接被歸類同為火山，並疊加顯示相關的擴增實境內容。In addition, the object recognition method using the color turning point algorithm has the following disadvantages: different objects cannot be identified, but belong to the same type of object. For example, the volcanoes of the A and the volcanoes of the B are slightly different in shape, but they cannot be classified as volcanoes, and superimposed and displayed the relevant augmented reality content.

再者，使用顏色轉折點演算法的物件辨識方法還具有以下缺點：當欲辨識目標增加時，誤認機率增大。詳細來說，當辨識目標增加時，代表顏色轉折點的分布雷同的機率增加。若完全只取顏色轉折點當作特徵向量，則甲目標物容易被誤認為乙目標物，最終造成放置到錯誤的擴增實境內容。Furthermore, the object recognition method using the color turning point algorithm has the following disadvantages: when the target to be recognized increases, the probability of misidentification increases. In detail, when the recognition target is increased, the probability of representing the distribution of the color turning points is increased. If only the color turning point is taken as the feature vector, then the target A is easily mistaken for the B target, and finally the wrong augmented reality content is placed.

最後，使用顏色轉折點演算法的物件辨識方法還具有以下缺點：在辨識高解析度圖片時，速度大幅下降。詳細來說，因解析度高的圖片，要計算的像素變多，其產生的顏色轉折點也較多，造成比對擴增實境目標與實體影像的時間增加。Finally, the object recognition method using the color turning point algorithm has the following disadvantages: the speed is greatly reduced when the high resolution picture is recognized. In detail, due to the high resolution of the picture, the number of pixels to be calculated is increased, and the number of color turning points generated is also large, resulting in an increase in the time for amplifying the real target and the solid image.

本發明提出一種使用深度學習模型進行物件辨識的物件辨識方法及物件辨識系統，其可以提高對於多媒體資料中辨識物件的準確度與彈性，亦可以應用在擴增實境的技術當中並提供更良好的使用者體驗。The invention provides an object identification method and an object identification system for object recognition using a deep learning model, which can improve the accuracy and elasticity of the identification object in the multimedia data, and can also be applied in the technology of augmented reality and provides better. User experience.

本發明的提出一種物件辨識方法。此方法包括：取得多媒體資料；將多媒體資料輸入至深度學習模型以辨識多媒體資料中的物件；以及根據所辨識出的多媒體資料中的物件，輸出對應於此物件的輸出資訊。The invention provides an object identification method. The method includes: acquiring multimedia data; inputting the multimedia data to the deep learning model to identify the object in the multimedia data; and outputting output information corresponding to the object according to the identified object in the multimedia data.

在本發明的一實施例中，其中深度學習模型包括卷積層類神經網路。In an embodiment of the invention, wherein the deep learning model comprises a convolutional layer-like neural network.

在本發明的一實施例中，其中卷積層類神經網路包括至少一卷積層與至少一池化層，上述的物件辨識方法還包括：藉由所述卷積層與所述池化層擷取多媒體資料的特徵值。In an embodiment of the invention, the convolutional layer-like neural network includes at least one convolution layer and at least one pooling layer, and the object identification method further includes: capturing, by the convolution layer and the pooling layer The eigenvalue of the multimedia material.

在本發明的一實施例中，其中深度學習模型包括還包括全連接層或一機器學習演算法，在輸出對應於物件的輸出資訊的步驟包括：藉由全連接層或機器學習演算法根據特徵值將物件分類並取得對應於物件的物件資訊。In an embodiment of the invention, wherein the deep learning model further comprises a fully connected layer or a machine learning algorithm, the step of outputting the output information corresponding to the object comprises: according to the full connection layer or the machine learning algorithm according to the feature The value classifies the object and obtains the object information corresponding to the object.

在本發明的一實施例中，其中物件資訊包括物件在多媒體資料中的中心點、用於圈選出物件的邊界框的長度與邊界框的寬度以及物件在多媒體資料中的旋轉角度。In an embodiment of the invention, the object information includes a center point of the object in the multimedia material, a length of the bounding box for circled the object, a width of the bounding box, and a rotation angle of the object in the multimedia material.

在本發明的一實施例中，其中物件資訊包括用於圈選出物件的邊界框的多個頂點的座標。In an embodiment of the invention, the item information includes coordinates for enclosing a plurality of vertices of the bounding box of the object.

在本發明的一實施例中，其中物件資訊包括物件的種類。In an embodiment of the invention, the item information includes the type of the item.

在本發明的一實施例中，其中深度學習模型包括多個層，上述的物件辨識方法還包括：輸入多個待訓練多媒體資料與分別對應此些待訓練多媒體資料的解答至深度學習模型；以及根據此些待訓練多媒體資料與解答調整深度學習模型中每一層的多個權重以訓練深度學習模型。In an embodiment of the present invention, the deep learning model includes a plurality of layers, and the object identification method further includes: inputting a plurality of multimedia materials to be trained and solutions corresponding to the multimedia materials to be trained to the deep learning model; The plurality of weights of each layer in the deep learning model are adjusted according to the multimedia materials and solutions to be trained to train the deep learning model.

在本發明的一實施例中，深度學習模型在訓練階段包括懲罰層(loss layer)，在根據待訓練多媒體資料與解答調整深度學習模型中每一層的權重以訓練深度學習模型的步驟包括：藉由深度學習模型根據待訓練多媒體資料輸出分別對應於待訓練多媒體資料的多個輸出解答；以及藉由懲罰層比較輸出解答以及對應待訓練多媒體資料的解答並根據懲罰函數調整深度學習模型中每一層的權重。In an embodiment of the invention, the deep learning model includes a loss layer in the training phase, and the steps of training the deep learning model by adjusting the weight of each layer in the deep learning model according to the multimedia data and the solution to be trained include: And outputting, by the deep learning model, a plurality of output solutions respectively corresponding to the multimedia material to be trained according to the multimedia data to be trained; and outputting the solution and the solution corresponding to the multimedia material to be trained by the penalty layer, and adjusting each layer in the deep learning model according to the penalty function the weight of.

在本發明的一實施例中，其中在取得一多媒體資料的步驟之前，該物件辨識方法還包括：取得目前地理資訊；判斷目前地理資訊是否符合特定地理資訊；以及當目前地理資訊符合特定地理資訊時，執行取得多媒體資料的步驟。In an embodiment of the present invention, before the step of acquiring a multimedia material, the object identification method further comprises: obtaining current geographic information; determining whether the current geographic information meets specific geographic information; and when the current geographic information meets the specific geographic information. When performing the steps of obtaining multimedia materials.

在本發明的一實施例中，其中在輸出對應於物件的輸出資訊的步驟中，包括：根據物件取得對應於物件的疊加物件；將疊加物件疊加至多媒體資料；以及輸出已疊加疊加物件的多媒體資料。In an embodiment of the present invention, in the step of outputting output information corresponding to the object, the method comprises: acquiring a superimposed object corresponding to the object according to the object; superimposing the superimposed object on the multimedia material; and outputting the multimedia of the superimposed superimposed object data.

在本發明的一實施例中，其中在將疊加物件疊加至多媒體資料的步驟中，包括：根據物件在多媒體資料中的旋轉角度，旋轉疊加物件並將疊加物件疊加至多媒體資料，其中物件在多媒體資料中的旋轉角度是藉由深度學習模型所辨識出。In an embodiment of the present invention, in the step of superimposing the superimposed object on the multimedia material, the method comprises: rotating the superimposed object according to the rotation angle of the object in the multimedia material, and superimposing the superimposed object on the multimedia material, wherein the object is in the multimedia The angle of rotation in the data is identified by the deep learning model.

在本發明的一實施例中，其中在將疊加物件疊加至多媒體資料的步驟中，包括：根據物件在多媒體資料中的中心點，將疊加物件疊加至多媒體資料中的中心點的位置，其中物件在多媒體資料中的中心點是藉由深度學習模型所辨識出。In an embodiment of the present invention, in the step of superimposing the superimposed object on the multimedia material, the method comprises: superimposing the superimposed object on a position of a central point in the multimedia material according to a central point of the object in the multimedia material, wherein the object The central point in the multimedia material is identified by the deep learning model.

在本發明的一實施例中，其中在將疊加物件疊加至多媒體資料的步驟中，包括：根據用於圈選出物件的邊界框的長度、邊界框的寬度對疊加物件進行縮放並將疊加物件疊加至多媒體資料，其中物件的邊界框的長度、邊界框的寬度是藉由深度學習模型所辨識出。In an embodiment of the invention, in the step of superimposing the superimposed object on the multimedia material, the method comprises: scaling the superimposed object according to the length of the bounding box for circled the object, and width of the bounding box and superimposing the superimposed object To the multimedia material, the length of the bounding box of the object and the width of the bounding box are recognized by the deep learning model.

在本發明的一實施例中，其中多媒體資料包括影像、點雲(point cloud)、立體像素(voxel)以及網格(mesh)的至少其中之一。In an embodiment of the invention, the multimedia material includes at least one of an image, a point cloud, a voxel, and a mesh.

在本發明提出一種物件辨識系統，此系統包括輸入裝置、處理器以及輸出裝置。輸入裝置用以取得多媒體資料。處理器用以將多媒體資料輸入至深度學習模型以辨識多媒體資料中的物件。輸出裝置用以根據所辨識出的多媒體資料中的物件，輸出對應於此物件的輸出資訊。In the present invention, an object recognition system is provided, the system comprising an input device, a processor and an output device. The input device is used to obtain multimedia materials. The processor is configured to input the multimedia material into the deep learning model to identify the objects in the multimedia material. The output device is configured to output output information corresponding to the object according to the identified object in the multimedia material.

在本發明的一實施例中，其中卷積層類神經網路包括至少一卷積層與至少一池化層，處理器更用以藉由卷積層與池化層擷取多媒體資料的特徵值。In an embodiment of the invention, the convolutional layer-like neural network includes at least one convolution layer and at least one pooling layer, and the processor is further configured to extract the feature values of the multimedia material by using the convolution layer and the pooling layer.

在本發明的一實施例中，其中深度學習模型包括一全連接層或一機器學習演算法，其中在輸出對應於物件的輸出資訊的運作中，處理器更用以藉由全連接層或機器學習演算法根據特徵值將物件分類並取得對應此物件的物件資訊。In an embodiment of the invention, the deep learning model includes a fully connected layer or a machine learning algorithm, wherein in the operation of outputting the output information corresponding to the object, the processor is further used by the fully connected layer or the machine The learning algorithm classifies the object according to the feature value and obtains the object information corresponding to the object.

在本發明的一實施例中，其中物件資訊包括物件在多媒體資料中的中心點、用於圈選出物件的邊界框的長度、邊界框的寬度以及物件在多媒體資料中的旋轉角度。In an embodiment of the invention, the object information includes a center point of the object in the multimedia material, a length of the bounding box for circled the object, a width of the bounding box, and a rotation angle of the object in the multimedia material.

在本發明的一實施例中，其中深度學習模型包括多個層，輸入裝置更用以輸入多個待訓練多媒體資料與分別對應待訓練多媒體資料的解答至深度學習模型，處理器更用以根據待訓練多媒體資料與解答調整深度學習模型中每一層的多個權重以訓練深度學習模型。In an embodiment of the present invention, the deep learning model includes a plurality of layers, and the input device is further configured to input a plurality of to-be-trained multimedia materials and solutions corresponding to the multimedia materials to be trained to the deep learning model, and the processor is further configured to The multimedia materials and solutions to be trained adjust multiple weights of each layer in the deep learning model to train the deep learning model.

在本發明的一實施例中，其中深度學習模型包括懲罰層，在根據待訓練多媒體資料與解答調整深度學習模型中每一層的權重以訓練深度學習模型的運作中，處理器更用以藉由深度學習模型根據待訓練多媒體資料輸出分別對應於待訓練多媒體資料的多個輸出解答，處理器更用以藉由懲罰層比較輸出解答以及對應待訓練多媒體資料的解答並根據懲罰函數調整深度學習模型中每一層的權重。In an embodiment of the present invention, wherein the deep learning model includes a penalty layer, the processor is further used in the operation of training the deep learning model by adjusting the weight of each layer in the deep learning model according to the multimedia data and the solution to be trained. The deep learning model outputs a plurality of output solutions respectively corresponding to the multimedia data to be trained according to the multimedia data to be trained, and the processor is further configured to compare the output solution and the solution corresponding to the multimedia material to be trained by the penalty layer and adjust the deep learning model according to the penalty function. The weight of each layer in the middle.

在本發明的一實施例中，其中在取得多媒體資料的運作之前，處理器更用以取得目前地理資訊，處理器更用以判斷目前地理資訊是否符合特定地理資訊，當目前地理資訊符合特定地理資訊時，執行取得多媒體資料的運作。In an embodiment of the present invention, before the operation of the multimedia material is obtained, the processor is further configured to obtain the current geographic information, and the processor is further configured to determine whether the current geographic information conforms to the specific geographic information, and when the current geographic information meets the specific geographic information. In the case of information, the operation of obtaining multimedia materials is performed.

在本發明的一實施例中，物件辨識系統更包括儲存裝置，其中在輸出對應於物件的輸出資訊的運作中，處理器更用以根據物件從儲存裝置中取得對應於物件的疊加物件，處理器更用以將疊加物件疊加至多媒體資料，輸出裝置更用以輸出已疊加疊加物件的多媒體資料。In an embodiment of the present invention, the object recognition system further includes a storage device, wherein in the operation of outputting the output information corresponding to the object, the processor is further configured to: obtain the superimposed object corresponding to the object from the storage device according to the object, and process The device is further configured to superimpose the superimposed object on the multimedia material, and the output device is further configured to output the multimedia material of the superimposed superimposed object.

在本發明的一實施例中，其中在將疊加物件疊加至多媒體資料的運作中，處理器更用以根據物件在多媒體資料中的旋轉角度，旋轉疊加物件並將疊加物件疊加至多媒體資料，其中物件在多媒體資料中的旋轉角度是藉由深度學習模型所辨識出。In an embodiment of the present invention, in the operation of superimposing the superimposed object on the multimedia material, the processor is further configured to rotate the superimposed object and superimpose the superimposed object to the multimedia material according to the rotation angle of the object in the multimedia material, wherein The angle of rotation of the object in the multimedia material is identified by the deep learning model.

在本發明的一實施例中，其中在將疊加物件疊加至多媒體資料的運作中，處理器更用以根據物件在多媒體資料中的中心點，將疊加物件疊加至多媒體資料中的中心點的位置，其中物件在多媒體資料中的中心點是藉由深度學習模型所辨識出。In an embodiment of the present invention, in the operation of superimposing the superimposed object on the multimedia material, the processor is further configured to superimpose the superimposed object to the position of the central point in the multimedia material according to the central point of the object in the multimedia material. The center point of the object in the multimedia material is identified by the deep learning model.

在本發明的一實施例中，其中在將疊加物件疊加至多媒體資料的運作中，處理器更用以根據用於圈選出物件的邊界框的長度、邊界框的寬度對疊加物件進行縮放並將疊加物件疊加至多媒體資料，其中物件的邊界框的長度、邊界框的寬度是藉由深度學習模型所辨識出。In an embodiment of the present invention, in the operation of superimposing the superimposed object on the multimedia material, the processor is further configured to scale the superimposed object according to the length of the bounding box for circled the object and the width of the bounding box and The superimposed object is superimposed on the multimedia material, wherein the length of the bounding box of the object and the width of the bounding box are recognized by the deep learning model.

基於上述，本發明的物件辨識方法及物件辨識系統可以提高對於多媒體資料中辨識物件的準確度，亦可以應用在擴增實境的技術當中並提供更良好的使用者體驗。Based on the above, the object identification method and the object identification system of the present invention can improve the accuracy of the identification object in the multimedia material, and can also be applied in the technology of augmented reality and provide a better user experience.

為讓本發明的上述特徵和優點能更明顯易懂，下文特舉實施例，並配合所附圖式作詳細說明如下。The above described features and advantages of the invention will be apparent from the following description.

圖3是依據本發明一實施例所繪示的物件辨識系統的示意圖。請參照圖3，物件辨識系統300可以包括子系統302以及子系統304。子系統302包括輸入裝置30、輸出裝置32、處理器34a以及儲存裝置36。子系統304包括處理器34b。FIG. 3 is a schematic diagram of an object identification system according to an embodiment of the invention. Referring to FIG. 3, the object recognition system 300 can include a subsystem 302 and a subsystem 304. Subsystem 302 includes input device 30, output device 32, processor 34a, and storage device 36. Subsystem 304 includes a processor 34b.

輸入裝置30例如是採用電荷耦合元件（Charge coupled device，CCD）鏡頭、互補式金氧半電晶體（Complementary metal oxide semiconductor transistors，CMOS）鏡頭、或深度攝影機(Depth Camera、Time-Of-Flight Camera)、立體攝影機(Stereo Camera)。The input device 30 is, for example, a charge coupled device (CCD) lens, a complementary metal oxide semiconductor transistor (CMOS) lens, or a depth camera (Depth Camera, Time-Of-Flight Camera). Stereo Camera.

輸出裝置32例如是液晶顯示器（liquid crystal display, LCD）、發光二極體（light-emitting diode, LED）、場發射顯示器（field emission display, FED）等提供顯示功能的顯示裝置。The output device 32 is, for example, a display device that provides a display function such as a liquid crystal display (LCD), a light-emitting diode (LED), or a field emission display (FED).

處理器34a以及處理器34b可以是中央處理單元（Central Processing Unit，CPU），或是其他可程式化之一般用途或特殊用途的微處理器（Microprocessor）、數位信號處理器（Digital Signal Processor，DSP）、可程式化控制器、特殊應用積體電路（Application Specific Integrated Circuit，ASIC）或其他類似元件或上述元件的組合。The processor 34a and the processor 34b may be a central processing unit (CPU), or other programmable general purpose or special purpose microprocessor (Microprocessor), digital signal processor (Digital Signal Processor, DSP). ), a programmable controller, an Application Specific Integrated Circuit (ASIC) or other similar component or a combination of the above.

儲存裝置36可以是任何型態的固定或可移動隨機存取記憶體（random access memory，RAM）、唯讀記憶體（read-only memory，ROM）、快閃記憶體（flash memory）或類似元件或上述元件的組合。The storage device 36 can be any type of fixed or removable random access memory (RAM), read-only memory (ROM), flash memory or the like. Or a combination of the above elements.

在本範例實施例中，輸入裝置30、輸出裝置32以及儲存裝置36可以分別地透過有線或無線的方式與處理器34a連結。在本範例實施例中，子系統302例如是手持式電子裝置，子系統304例如是一伺服器。子系統302可以透過有線或無線的方式與子系統304連結，並且子系統302可以將部分的運算交由子系統304的處理器34b來執行，以達成雲端運算的功能。須注意的是，本發明並不用於限定物件辨識系統300中子系統的實際配置架構。在一實施例中，儲存裝置36也可以獨立於子系統302之外。In the present exemplary embodiment, the input device 30, the output device 32, and the storage device 36 can be coupled to the processor 34a by wire or wirelessly, respectively. In the present exemplary embodiment, subsystem 302 is, for example, a handheld electronic device, and subsystem 304 is, for example, a server. Subsystem 302 can be coupled to subsystem 304 in a wired or wireless manner, and subsystem 302 can pass portions of the operations to processor 34b of subsystem 304 for implementation of cloud computing functions. It should be noted that the present invention is not intended to limit the actual configuration architecture of the subsystems in the object recognition system 300. In an embodiment, storage device 36 may also be external to subsystem 302.

在此須說明的是，本發明是使用深度學習(Deep Learning)模型來辨識多媒體資料中的物件。所述多媒體資料例如是二維影像、三維點雲(point cloud)、立體像素(voxel)以及網格(mesh)的至少其中之一。在本範例實施例中，多媒體資料是二維影像。It should be noted that the present invention uses a deep learning model to identify objects in a multimedia material. The multimedia material is, for example, at least one of a two-dimensional image, a three-dimensional point cloud, a voxel, and a mesh. In the present exemplary embodiment, the multimedia material is a two-dimensional image.

在本範例實施例中，深度學習模型是由卷積層類神經網路(Convolution Neural Network, CNN)實作。圖4是依據本發明一實施例所繪示的卷積層類神經網路的示意圖。請參照圖4，在本範例實施例中，卷積層類神經網路400是由至少一個的卷積層(Convolution Layer) 401、至少一個的池化層(Pooling Layer) 402以及至少一個的全連接層(Fully connected layer) 403所構成。其中，在卷積層類神經網路400的前段通常由卷積層401與池化層402串連組成，通常用來作為影像的特徵擷取來取得所輸入的多媒體資料40的特徵值。此特徵值為可以是多維陣列，一般被視為代表多媒體資料40的特徵向量。然而須注意的是，在另一實施例中，卷積層401與池化層402也可以使用串聯結合並聯的方式進行組合，本發明並不用於限定卷積層401與池化層402的組合或排列方式。In the present exemplary embodiment, the deep learning model is implemented by a Convolution Neural Network (CNN). 4 is a schematic diagram of a convolutional layer-like neural network according to an embodiment of the invention. Referring to FIG. 4, in the present exemplary embodiment, the convolutional layer-like neural network 400 is composed of at least one Convolution Layer 401, at least one Pooling Layer 402, and at least one fully connected layer. (Fully connected layer) 403. The convolutional layer-like neural network 400 is generally composed of a convolutional layer 401 and a pooling layer 402. It is generally used as a feature extraction of the image to obtain the feature value of the input multimedia material 40. This feature value can be a multi-dimensional array and is generally considered to be a feature vector representing the multimedia material 40. It should be noted, however, that in another embodiment, the convolution layer 401 and the pooling layer 402 may also be combined in a series connection in parallel. The present invention is not used to limit the combination or arrangement of the convolution layer 401 and the pooling layer 402. the way.

而在卷積層類神經網路400的後段包括全連接層403，全連接層403會根據經由卷積層401與池化層402所產生的特徵值來將多媒體資料40中的物件進行分類。並且可以取得對應於所辨識出的物件的物件資訊。In the latter stage of the convolutional layer-like neural network 400, a fully-connected layer 403 is used. The fully-connected layer 403 classifies the objects in the multimedia material 40 according to the feature values generated by the convolutional layer 401 and the pooling layer 402. And the object information corresponding to the identified object can be obtained.

在本範例實施例中，物件資訊包括用於圈選出所辨識出的物件的邊界框(Bonding Box)，且物件資訊還包括所辨識出的物件在多媒體資料40中的中心點、邊界框的長度與寬度以及該物件在多媒體資料40中的旋轉角度。在另一範例實施例中，物件資訊還包括邊界框為在多媒體資料中的多個頂點的座標。在另一範例實施例中，物件資訊還包括所辨識出的物件的種類。特別是，全連接層403的分類功能亦可由傳統的機器學習演算法來取代。然而，若欲得出上述的物件資訊(例如，中心座標、長寬、旋轉角度)等，仍需靠全連接層的類神經網路。此外，上述的傳統機器學習方法例如是支撐向量機(Support Vector Machine，SVM)、聯合貝葉斯氏(Joint Bayesian) 、回歸分析(Regression Analysis)等等。在此須說明的是，在上述的傳統的演算法中，傳統的演算法通常比全連接層403更能有效分類物件。所以若要求結果更精準，可以先用全連接層求出物件座標、大小、旋轉角度，再將全連接層403的輸入再輸入傳統演算法做分類。 In this exemplary embodiment, the object information includes a bounding box for circled the identified object, and the object information further includes a center point of the identified object in the multimedia material 40, and a length of the bounding box. And the width and the angle of rotation of the object in the multimedia material 40. In another exemplary embodiment, the object information further includes a coordinate box whose coordinates are a plurality of vertices in the multimedia material. In another exemplary embodiment, the item information further includes the type of the identified object. In particular, the classification function of the fully connected layer 403 can also be replaced by a conventional machine learning algorithm. However, if you want to get the above object information (for example, center coordinates, length and width, rotation angle), etc., you still need to rely on the neural network of the fully connected layer. In addition, the above conventional machine learning methods are, for example, Support Vector Machine (SVM), Joint Bayesian , Regression Analysis, and the like. It should be noted here that in the above conventional algorithms, the conventional algorithms are generally more efficient in classifying objects than the fully connected layer 403. Therefore, if the result is more accurate, the coordinates, size, and rotation angle of the object can be obtained by using the full connection layer, and then the input of the fully connected layer 403 is input into the traditional algorithm for classification.

在此須說明的是，卷積層401是分別由多組的濾波器(kernel，亦稱為filter)所構成。而對於每一組濾波器，則再由多個滑動窗口(sliding window)組成。It should be noted that the convolution layer 401 is composed of a plurality of sets of filters (also referred to as filters). For each set of filters, it is composed of a plurality of sliding windows.

具體來說，圖5是依據本發明一實施例所繪示的濾波器與其滑動窗口的示意圖。請參照圖5，在本範例實施例中，多媒體資料501首先會被拆解為紅色通道的影像501_R、綠色通道的影像501_G以及藍色通道的影像501_B。而以濾波器503來說，濾波器503包括用來與影像501_R進行卷積計算(convolution operation)的滑動窗口503_R、用來與影像501_G進行卷積計算的滑動窗口503_G以及用來與影像501_B進行卷積計算的滑動窗口503_B。滑動窗口503_R、滑動窗口503_G以及滑動窗口503_B分別包含多個權重。在辨識多媒體資料501時，即透過滑動窗口503_R、滑動窗口503_G以及滑動窗口503_B中的權重分別與影像501_R、影像501_G以及影像501_B作用。Specifically, FIG. 5 is a schematic diagram of a filter and a sliding window thereof according to an embodiment of the invention. Referring to FIG. 5, in the exemplary embodiment, the multimedia material 501 is first disassembled into a red channel image 501_R, a green channel image 501_G, and a blue channel image 501_B. In the case of the filter 503, the filter 503 includes a sliding window 503_R for convolution operation with the image 501_R, a sliding window 503_G for convolution calculation with the image 501_G, and for performing with the image 501_B. The sliding window 503_B of the convolution calculation. The sliding window 503_R, the sliding window 503_G, and the sliding window 503_B respectively contain a plurality of weights. When the multimedia material 501 is recognized, the weights in the sliding window 503_R, the sliding window 503_G, and the sliding window 503_B are respectively applied to the image 501_R, the image 501_G, and the image 501_B.

圖6A至圖6E是依據本發明一實施例所繪示的滑動窗口的作用方式的示意圖。請參照圖6A至圖6E，在圖6A中，從影像501_R的某個角落開始(在本範例實施例中，是從左上角開始)，將滑動窗口503_R中每個權重的值，乘上影像501_R中區塊6001中對應的像素(Pixel)值，最後再將每個權重與像素值的乘積結果相加得到一個輸出值R _out。 6A-6E are schematic diagrams showing the operation mode of a sliding window according to an embodiment of the invention. Referring to FIG. 6A to FIG. 6E, in FIG. 6A, starting from a certain corner of the image 501_R (in the present exemplary embodiment, starting from the upper left corner), the value of each weight in the sliding window 503_R is multiplied by the image. The corresponding pixel (Pixel) value in the block 6001 in 501_R, and finally the product result of each weight and the pixel value is added to obtain an output value R _out .

類似地，在圖6B中，從影像501_G的某個角落開始(在本範例實施例中，是從左上角開始)，將滑動窗口503_G中每個權重的值，乘上影像501_G中區塊6002中對應的像素(Pixel)值，最後再將每個權重與像素值的乘積結果相加得到一個輸出值G _out。 Similarly, in FIG. 6B, starting from a certain corner of the image 501_G (in the present exemplary embodiment, starting from the upper left corner), the value of each weight in the sliding window 503_G is multiplied by the block 6002 in the image 501_G. The corresponding pixel (Pixel) value is added, and finally the product of each weight and the pixel value is added to obtain an output value G _out .

類似地，在圖6C中，從影像501_B的某個角落開始(在本範例實施例中，是從左上角開始)，將滑動窗口503_B中每個權重的值，乘上影像501_B中區塊6003中對應的像素(Pixel)值，最後再將每個權重與像素值的乘積結果相加得到一個輸出值B _out。 Similarly, in FIG. 6C, starting from a certain corner of the image 501_B (in the present exemplary embodiment, starting from the upper left corner), the value of each weight in the sliding window 503_B is multiplied by the block 6003 in the image 501_B. The corresponding pixel (Pixel) value, and finally the product of each weight and the pixel value is added to obtain an output value B _out .

在計算出輸出值R _out、輸出值G _out以及輸出值B _out後，如圖6D所示，可以將輸出值R _out、輸出值G _out以及輸出值B _out以及該濾波器503的偏權值(Bias)相加，再將相加後的總和輸入至一函數(通常為Sigmoid或ReLU函數，圖6D中以函數f表示)，最後將其輸出到二維陣列600中的對應位置。隨後滑動窗口503_R、滑動窗口503_G以及滑動窗口503_B會分別在影像501_R、影像501_G以及影像501_B中往右移特定像素(例如，一個像素)並再重複前述操作，並將其輸出接在上一個輸出右邊，如圖6E所示。 After calculating the output value R _out , the output value G _{out ,} and the output value B _out , as shown in FIG. 6D, the output value R _out , the output value G _{out ,} and the output value B _out and the bias value of the filter 503 may be used. (Bias) is added, and the summed sum is input to a function (usually a Sigmoid or ReLU function, represented by a function f in Fig. 6D), and finally output to a corresponding position in the two-dimensional array 600. Then, the sliding window 503_R, the sliding window 503_G, and the sliding window 503_B respectively shift a specific pixel (for example, one pixel) to the right in the image 501_R, the image 501_G, and the image 501_B, and repeat the foregoing operation, and connect the output to the previous output. Right, as shown in Figure 6E.

當橫向做完後(即滑動窗口503_R、滑動窗口503_G以及滑動窗口503_B分別掃完影像501_R、影像501_G以及影像501_B的一列時)，滑動窗口503_R、滑動窗口503_G以及滑動窗口503_B會向下移動特定像素，再由左往右重複前述操作，並再將其輸出接在二維陣列600中上一列輸出的下方。最後直到滑動窗口503_R、滑動窗口503_G以及滑動窗口503_B分別掃完整張影像501_R、影像501_G以及影像501_B。當濾波器503與影像501_R、影像501_G以及影像501_B作用完成後，其結果輸出成二維陣列600。同樣地，其他濾波器亦可以輸出二維陣列601與二維陣列602。最後，可以再將這些二維陣列串聯，並送往下一層類神經網路，例如池化層。After the horizontal direction is completed (ie, the sliding window 503_R, the sliding window 503_G, and the sliding window 503_B respectively scan a column of the image 501_R, the image 501_G, and the image 501_B), the sliding window 503_R, the sliding window 503_G, and the sliding window 503_B move downward to the specific The pixels are repeated from left to right, and their outputs are connected below the output of the previous column in the two-dimensional array 600. Finally, the sliding image 503_R, the sliding window 503_G, and the sliding window 503_B sweep the entire image 501_R, the image 501_G, and the image 501_B, respectively. When the filter 503 is activated by the image 501_R, the image 501_G, and the image 501_B, the result is output as a two-dimensional array 600. Similarly, other filters may also output a two-dimensional array 601 and a two-dimensional array 602. Finally, these two-dimensional arrays can be cascaded and sent to the next layer of neural networks, such as the pooling layer.

卷積層的輸出通常為三維陣列，其中許多元素值接近零。因此，為了減少計算量，通常會在卷積層後面接上池化層。其作用類似於卷積層；但池化層的滑動窗口的作用是將其所包含的資料取平均或最大值並輸出。此作用類似於將資料中局域的特徵作彙整。之後通常再送到下一層卷積層。此卷積層的輸入陣列深度不再如第一層為R、G、B；而可能是一個非常深的陣列(深度與第一層卷積層的濾波器數目相等)。但作用原理仍同前所述。The output of the convolutional layer is typically a three-dimensional array in which many element values are close to zero. Therefore, in order to reduce the amount of calculation, the pooling layer is usually connected after the convolutional layer. It acts like a convolutional layer; however, the sliding window of the pooled layer acts to average or maximize the data it contains. This effect is similar to the reconciliation of local features in the data. It is usually sent to the next layer of convolution. The input array depth of this convolutional layer is no longer as R, G, B as the first layer; it may be a very deep array (the depth is equal to the number of filters of the first convolutional layer). But the principle of action is still the same as before.

在此須說明的是，卷積層與池化層的結合可以稱為卷積層類神經網路，而卷積層類神經網路具有多種特性。例如，卷積層類神經網路可以由訓練資料得出各濾波器中滑動窗口的權重。傳統物件辨識必須由特定演算法(例如，顏色轉折點的計算)或由專家設計出的濾波器來求出輸入影像的特徵向量。而卷積層類神經網路則是藉由在其訓練階段輸入大量影像，與其對應的解答(即分類結果、物件框)來讓網路自動找出適合的濾波器。通常人工能夠設計出的特徵擷取演算法或濾波器有限，且難免有遺漏。但深度學習可藉由大量的卷積層串、並聯，並在每個卷積層內增加大量數目的濾波器，來讓電腦找出大量合適的濾波器。故不只節省了人類設計濾波器的時間，更能找出大量人類無法想到的濾波器，以得出適合表示影像的特徵向量。It should be noted that the combination of the convolutional layer and the pooled layer may be referred to as a convolutional layer-like neural network, and the convolutional layer-like neural network has various characteristics. For example, a convolutional layer-like neural network can derive the weight of the sliding window in each filter from the training data. Traditional object recognition must be derived from a specific algorithm (for example, the calculation of color turning points) or a filter designed by an expert to find the feature vector of the input image. Convolutional neural networks, by inputting a large number of images during their training phase, and their corresponding solutions (ie, classification results, object boxes), allow the network to automatically find suitable filters. Often the feature extraction algorithms or filters that can be artificially designed are limited, and it is inevitable that there are omissions. But deep learning allows the computer to find a large number of suitable filters by using a large number of convolutional layers, paralleling, and adding a large number of filters to each convolutional layer. Therefore, it not only saves the time for the human design filter, but also finds a large number of filters that humans can't think of, to obtain a feature vector suitable for representing images.

此外，卷積層類神經網路還具有權重共享的特性。具體來說，傳統的類神經網路，每個神經元必須與前一層(或輸入資料)做全連結。如此常會造成龐大的計算量。例如：輸入圖片為500*300像素，而假設第一層類神經網路共有700個神經元；則該層中即有500*300*700個連結，或是說有500*300*700個權重需在訓練時調整。最終通常會導致電腦無法負擔如此龐大的計算量。而卷積層中每張圖片(或輸入陣列)皆共享同一組滑動窗口的權重。藉此可以大幅減少全連結網路的計算量。In addition, the convolutional layer-like neural network also has the characteristics of weight sharing. Specifically, in a traditional neural network, each neuron must be fully connected to the previous layer (or input data). This often results in a huge amount of calculations. For example, the input picture is 500*300 pixels, and assuming that the first layer of neural network has 700 neurons; then there are 500*300*700 links in the layer, or 500*300*700 weights. Need to adjust during training. In the end, it usually causes the computer to be unable to afford such a large amount of calculations. Each picture (or input array) in the convolutional layer shares the weight of the same set of sliding windows. This can greatly reduce the amount of computation of the fully connected network.

卷積層類神經網路還具有局預感知的特性。具體來說，如前所述，因一個濾波器通常小於整個影像(或輸入陣列)，故濾波器通常只對影像的局域做感知。此特性能有效支援待辨識目標在影像中的平移。假設一個5*5的濾波器能有效找到人臉，則不論人臉在影像中的哪個位置，只要濾波器平移到該處，就能偵測到人臉。比起全連接類神經網路，此法更有效率地處理目標物的平移。Convolutional-like neural networks also have pre-perceived characteristics. Specifically, as previously mentioned, since a filter is typically smaller than the entire image (or input array), the filter typically only perceives the local area of the image. This feature effectively supports the translation of the target to be identified in the image. Assuming that a 5*5 filter can effectively find a face, no matter where the face is in the image, the face can be detected as long as the filter is translated there. This method handles the translation of the target more efficiently than a fully connected neural network.

卷積層類神經網路還具有自動習得不同抽象層次的濾波器的特性。具體來說，雖然在卷積層中，濾波器中的權重一開始為隨機；然後藉由大量資料訓練，最終習得適合的權重。但經驗法則上，將深度學習的每層卷積層(或池化層)映射回(deconvolution)所輸入的多媒體資料(或影像)後，會發現較前面幾層通常在偵測較局域或較低抽象層次的特徵(例如：顏色轉折點、邊界線、輪廓)；而後層通常偵測較廣域或較高抽象層次的特徵(例如：人臉、汽車、大樓… 等欲辨識目標)。此特性是人工設計的濾波器難以實現的。Convolutional-like neural networks also have the characteristics of filters that automatically acquire different levels of abstraction. Specifically, although in the convolutional layer, the weights in the filter are initially random; then, with a large amount of data training, the appropriate weights are finally learned. However, in the rule of thumb, after each layer of convolutional layer (or pooling layer) of deep learning is mapped back to the multimedia data (or image) input, it will be found that the previous layers are usually detected in the local area or Low level of abstraction features (eg, color turning points, boundary lines, outlines); while back layers typically detect features that are broader or higher in abstraction (eg, faces, cars, buildings, etc.). This feature is difficult to implement with manually designed filters.

請再次參照圖4，緊接在卷積層類神經網路400後的是全連接層，例如是全連接的類神經網路或傳統分類器(如：SVM)。因為卷積層通常只考慮局域的特徵，而全連接層可以將所有局域的特徵綜合考量，並加以分類、預測用於圈選出物件的邊界框。Referring again to FIG. 4, immediately following the convolutional layer-like neural network 400 is a fully-connected layer, such as a fully-connected neural network or a traditional classifier (eg, SVM). Because the convolutional layer usually only considers the characteristics of the local area, the fully connected layer can comprehensively consider the features of all localities, and classify and predict the bounding box for circled objects.

圖7是依據本發明一實施例所繪示的訓練卷積層類神經網路的示意圖。請參照圖7，在規劃好深度學習模型的卷積層類神經網路400後，還必須輸入大量的待訓練多媒體資料700並同時標記好每個待訓練多媒體資料700的解答703至深度學習模型701。深度學習模型701包括前述的卷積層類神經網路400以及緊接在其後的懲罰層。也就是說，在本範例實施例中，深度學習模型701在訓練階段還會包括懲罰層。懲罰層中定義了計算誤差的方法。深度學習模型701可依照此誤差，來調整各層網路的權重。開發人員可依需求來定義計算誤差的方法。例如：若成功預測物件種類較為重要，則可設計出一個函數，當深度學習模型在預測物件種類錯誤時，會產生較大的誤差(例如：預測結果與解答差值的四次方)。若成功預測物件的寬、高較為次要，則可用一般計算誤差的方式(例如：預測結果與解答差值的平方)。FIG. 7 is a schematic diagram of a training convolutional layer-like neural network according to an embodiment of the invention. Referring to FIG. 7, after the convolutional layer-like neural network 400 of the deep learning model is planned, a large amount of multimedia data to be trained 700 must be input and the solution 703 to the deep learning model 701 of each multimedia material to be trained 700 is also marked. . The deep learning model 701 includes the aforementioned convolutional layer-like neural network 400 and a penalty layer immediately following it. That is, in the present exemplary embodiment, the deep learning model 701 also includes a penalty layer during the training phase. The method of calculating the error is defined in the penalty layer. The deep learning model 701 can adjust the weight of each layer network according to this error. Developers can define methods for calculating errors as needed. For example, if the object type is successfully predicted, a function can be designed. When the deep learning model predicts the wrong type of object, it will produce a large error (for example, the fourth power of the difference between the predicted result and the solution). If the width and height of the object are successfully predicted to be secondary, then the general calculation error can be used (for example, the square of the difference between the prediction result and the solution).

在本範例實施例中，懲罰層會比較卷積層類神經網路400的輸出702以及多媒體資料700解答703，並且計算出誤差。隨後深度學習模型701根據該誤差來調整其網路內部中每一層所具有的權重以訓練深度學習模型。當調整上述權重達一定程度後，卷積層類神經網路400的輸出702會非常接近所輸入的待訓練多媒體資料700的解答703，此時稱為學習完成，或稱為網路已收斂。In the present exemplary embodiment, the penalty layer compares the output 702 of the convolutional layer-like neural network 400 with the multimedia material 700 solution 703 and calculates the error. The deep learning model 701 then adjusts the weights of each layer in its network to train the deep learning model based on the error. When the weight is adjusted to a certain extent, the output 702 of the convolutional layer-like neural network 400 is very close to the input solution 703 of the multimedia material 700 to be trained, which is referred to as learning completion, or the network has converged.

也就是說，在本範例實施例中，深度學習模型701在訓練階段會在卷積層類神經網路400的尾端加上懲罰層(Loss layer，又稱損失層)。懲罰層會比較卷積層類神經網路400根據待訓練多媒體資料所輸出的輸出解答與待訓練多媒體資料700的解答，並計算出誤差。隨後卷積層類神經網路藉由此誤差，並以倒傳遞的方式，逐一地由後往前，來調整網路中每一個層的權重。其誤差計算的方式(即：懲罰函數)例如是：平方差、Softmax… 等。在本範例實施例中，懲罰層只在訓練階段使用。當訓練階段完成後懲罰層會被拿掉。That is to say, in the present exemplary embodiment, the deep learning model 701 adds a penalty layer (also referred to as a loss layer) at the end of the convolutional layer-like neural network 400 during the training phase. The penalty layer compares the output of the convolutional layer-like neural network 400 with the output of the multimedia material to be trained according to the output of the multimedia material to be trained, and calculates an error. Then the convolutional layer-like neural network adjusts the weight of each layer in the network by using this error and by backwards, one by one, from the back to the front. The way in which the error is calculated (ie, the penalty function) is, for example, square difference, Softmax, and the like. In this exemplary embodiment, the penalty layer is only used during the training phase. The penalty layer will be removed when the training phase is completed.

當學習完成後，可以進入部屬(Deploy)階段。圖8是依據本發明一實施例所繪示的部屬階段的示意圖。請參照圖8，只要輸入多媒體資料801，深度學習模型803(相同於上述的深度學習模型701)即可取得其所偵測到的物件的物件資訊805。在本範例實施例中，多媒體資料801為日本城，而物件資訊805可以包括深度學習模型803判斷多媒體資料801為日本城的機率，並且取得邊界框為在多媒體資料801中的中心點的座標(即，X值與Y值)以及用於圈選出日本城的邊界框的長度、寬度與日本城在多媒體資料801中的旋轉角度。When the study is complete, you can enter the Deploy phase. FIG. 8 is a schematic diagram of a subordinate stage according to an embodiment of the invention. Referring to FIG. 8, as long as the multimedia material 801 is input, the deep learning model 803 (same as the above-described deep learning model 701) can obtain the object information 805 of the detected object. In the present exemplary embodiment, the multimedia material 801 is the Japanese city, and the object information 805 may include the depth learning model 803 determining the probability that the multimedia material 801 is the Japanese city, and obtaining the coordinates of the bounding box as the center point in the multimedia material 801 ( That is, the X value and the Y value) and the length and width of the bounding box for circled the Japanese city and the rotation angle of the Japanese city in the multimedia material 801.

在本發明的範例實施例中，深度學習模型會應用於擴增實境的技術中。以上述圖8的範例來說，在取得上述的物件資訊805後，可以依照所得出的物件資訊805，從儲存裝置36中取出相對應的疊加物件(亦稱為，擴增實境內容)，並且依照上述的旋轉角度旋轉此疊加物件、依照上述的邊界框的長度、寬度縮放此疊加物件並將此疊加物件擺放至上述中心點的座標的位置。In an exemplary embodiment of the invention, the deep learning model will be applied to the technique of augmented reality. In the example of FIG. 8 above, after the object information 805 is obtained, the corresponding superimposed object (also referred to as augmented reality content) may be taken out from the storage device 36 according to the obtained object information 805. And rotating the superimposed object according to the rotation angle described above, scaling the superimposed object according to the length and width of the above-mentioned bounding box, and placing the superimposed object to the position of the coordinate of the center point.

也就是說，利用深度學習模型的輸出(例如上述所辨識出的物件的機率、邊界框)，便能在輸入的多媒體資料中找出待辨識物件的位置、大小與旋轉角度。如此，便能將擴增實境影像(如：圖片、動畫模型、圖文資訊)依據該大小縮放，並放到相對合適的位置上，以達到擴增實境效果。That is to say, by using the output of the deep learning model (such as the probability of the object identified above, the bounding box), the position, size and rotation angle of the object to be identified can be found in the input multimedia material. In this way, the augmented reality images (eg, pictures, animated models, graphic information) can be scaled according to the size and placed in a relatively suitable position to achieve the augmented reality effect.

藉由深度學習模型的以上特性，可以下列方法解決傳統物件辨識的缺點：(1)藉由輸入各種角度拍攝目標物的訓練資料，並將其歸類為同一類別，解決傳統擴增實境無法辨識不同角度的缺點。(2)若訓練目標為辨識某物件的種類，則將同種類不同物件的資料加入訓練資料(如：將甲地與乙地的火山，皆丟入同一類別訓練)，以達成能辨識同種類物件的能力。若想區分同種類不同物件(如：想區分甲地與乙地的火山)，則將不同物件的資料分成不同類別，並輸入至深度學習模型訓練。如此，則具有比傳統物件辨識方法更高的彈性。(3)傳統的物件辨識，僅能取出低抽象層次的顏色轉折點。而藉由比對這些轉折點的分布，容易造成誤判(例如：當山與房子的顏色轉折點分布相似，則易誤判)。而深度學習模型，得以將低抽象層次的特徵向量，精煉成更高抽象的特徵，故可避免誤判的狀況。(4)因深度學習模型，並非以顏色轉折點來判斷兩影像是否相似，故可將高解析度的圖片縮小成低解析度的圖像，再做判斷。又因深度學習模型係採用較高抽象層次的特徵向量判斷兩影像是否相近，故當影像縮小時，不會降低其判斷精準度。Through the above characteristics of the deep learning model, the shortcomings of traditional object recognition can be solved by the following methods: (1) By inputting various angles of the training materials of the target and classifying them into the same category, the traditional augmented reality cannot be solved. Identify the shortcomings of different angles. (2) If the training target is to identify the type of an object, the information of the same type of object is added to the training data (for example, the volcanoes of A and B are thrown into the same category) to achieve the same type of identification. The ability of the object. If you want to distinguish between different objects of the same type (for example, you want to distinguish between the volcanoes of A and B), divide the data of different objects into different categories and input them to the deep learning model training. In this way, it has higher flexibility than the conventional object recognition method. (3) The traditional object recognition can only take out the color turning points of the low abstraction level. By comparing the distribution of these turning points, it is easy to cause misjudgment (for example, when the distribution of color transition points between the mountain and the house is similar, it is easy to misjudge). The deep learning model can refine the feature vectors of low abstract levels into higher abstract features, thus avoiding the situation of misjudgment. (4) Because of the deep learning model, it is not determined by the color turning point whether the two images are similar, so the high-resolution image can be reduced to a low-resolution image, and then judged. Because the deep learning model uses the feature vector of higher abstraction level to judge whether the two images are similar, when the image is reduced, the accuracy of the judgment is not lowered.

在此須說明的是，上述關於深度學習模型的訓練與部屬皆可透過圖3的物件辨識系統300的處理器34a或34b來執行。而圖3的物件辨識系統300可以使用深度學習模型來達成擴增實境的技術。It should be noted that the above training and subordinates regarding the deep learning model can be performed by the processor 34a or 34b of the object recognition system 300 of FIG. The object recognition system 300 of FIG. 3 can use a deep learning model to achieve augmented reality techniques.

例如，圖9是依據本發明一實施例所繪示的擴增實境的顯示流程的示意圖。請參照圖9，搭配圖3的物件辨識系統300中的子系統302(例如，手持式裝置)來進行說明。在步驟S901中，子系統302例如可以透過輸入裝置30拍攝並取得內容為「橋」的多媒體資料。接著，在步驟S902中，子系統302的處理器34a例如可以將內容為「橋」的多媒體資料輸入至深度學習模型。之後，在步驟S903中，子系統302的處理器34a可以藉由深度學習模型得到多媒體資料中物件的類別為「橋」，以及用於圈選出此物件的邊界框的中心點座標、邊界框的長度、邊界框的寬度以及上述「橋」在多媒體資料中的旋轉角度。之後在步驟S904中，子系統302的處理器34a例如可以根據所辨識出的物件的類別，從儲存裝置36中取得對應該物件的疊加物件。最後在步驟S905中，物件辨識系統300的處理器34可以根據上述的邊界框的中心點座標、邊界框的長度、邊界框的寬度以及上述「橋」在多媒體資料中的旋轉角度將疊加物件疊加至多媒體資料，並且藉由輸出裝置32輸出(或顯示)已疊加疊加物件的多媒體資料。For example, FIG. 9 is a schematic diagram of a display flow of an augmented reality according to an embodiment of the invention. Referring to FIG. 9, a description will be given with a subsystem 302 (eg, a handheld device) in the object recognition system 300 of FIG. In step S901, the subsystem 302 can capture and retrieve the multimedia material whose content is "bridge" through the input device 30, for example. Next, in step S902, the processor 34a of the subsystem 302 can, for example, input the multimedia material whose content is "bridge" to the deep learning model. Then, in step S903, the processor 34a of the subsystem 302 can obtain the category of the object in the multimedia material as a "bridge" by using the deep learning model, and the center point coordinates and the bounding box of the bounding box of the object. The length, the width of the bounding box, and the angle of rotation of the above "bridge" in the multimedia material. Then, in step S904, the processor 34a of the subsystem 302 can obtain the superimposed object corresponding to the object from the storage device 36, for example, according to the identified category of the object. Finally, in step S905, the processor 34 of the object recognition system 300 may superimpose the superimposed objects according to the center point coordinates of the bounding box, the length of the bounding box, the width of the bounding box, and the rotation angle of the "bridge" in the multimedia material. To the multimedia material, and outputting (or displaying) the multimedia material of the superimposed superimposed object by the output device 32.

在此須說明的是，上述步驟S902與步驟S903可以使用雲端運算的方式來實現。例如，使用者例如可以將在步驟S901中擷取到的多媒體資料從子系統302傳送至子系統304，並且子系統304的處理器34b會藉由位在子系統304中的深度學習模型執行步驟S902，並在步驟S903中回傳對應的辨識結果至子系統302。最後再由子系統302執行步驟S904與步驟S905完成擴增實境的顯示。藉此，可以達到使用雲端運算的方式來運行深度學習模型，進而降低使用者所使用的裝置(即，子系統302)的計算量。It should be noted that the above steps S902 and S903 can be implemented by using cloud computing. For example, the user may, for example, transfer the multimedia material retrieved in step S901 from subsystem 302 to subsystem 304, and processor 34b of subsystem 304 performs the steps by the deep learning model located in subsystem 304. S902, and the corresponding identification result is returned to the subsystem 302 in step S903. Finally, the subsystem 302 performs step S904 and step S905 to complete the display of the augmented reality. Thereby, the deep learning model can be run by using cloud computing, thereby reducing the amount of computation of the device used by the user (ie, subsystem 302).

圖10是依據本發明一實施例所繪示的擴增實境的顯示方法的示意圖。上述的擴增實境的顯示方法可以藉由圖10來進行說明。請參照圖10，在步驟S1001中，使用者可以使用子系統302的輸入裝置來取得多媒體資料。在步驟S1003中，子系統302的處理器34a可以判斷是否進行雲端運算。當欲進行雲端運算時，在步驟S1005中，子系統302的處理器34a可以透過通訊單元(未繪示)傳送至子系統304，並且藉由子系統304的處理器34b將多媒體資料輸入至位在子系統304中的深度學習模型以辨識多媒體資料中的物件，並且將辨識結果傳回給子系統302。當沒有要進行雲端運算時，在步驟S1007中，子系統302的處理器34a可以直接將多媒體資料輸入至位在子系統302中的深度學習模型以辨識多媒體資料中的物件。在辨識完多媒體資料中的物件後，在步驟S1009中，子系統302的處理器34a可以執行擴增實境影像處理模組來根據所辨識出的物件，取得對應的疊加物件，並且將此疊加物件疊加至多媒體資料。最後在步驟S1011中，子系統302的處理器34a可以透過輸出裝置32輸出(或顯示)已疊加疊加物件的多媒體資料。FIG. 10 is a schematic diagram of a display method of augmented reality according to an embodiment of the invention. The above display method of augmented reality can be explained by FIG. Referring to FIG. 10, in step S1001, the user can use the input device of the subsystem 302 to acquire the multimedia material. In step S1003, the processor 34a of the subsystem 302 can determine whether to perform cloud computing. When the cloud computing is to be performed, in step S1005, the processor 34a of the subsystem 302 can transmit to the subsystem 304 through the communication unit (not shown), and the multimedia data is input to the processor 34b of the subsystem 304. The deep learning model in subsystem 304 identifies the objects in the multimedia material and passes the identification results back to subsystem 302. When there is no cloud computing to be performed, in step S1007, the processor 34a of the subsystem 302 can directly input the multimedia material into the deep learning model located in the subsystem 302 to identify the objects in the multimedia material. After the object in the multimedia material is identified, in step S1009, the processor 34a of the subsystem 302 can execute the augmented reality image processing module to obtain the corresponding superimposed object according to the identified object, and superimpose the superimposed object. Objects are superimposed on the multimedia material. Finally, in step S1011, the processor 34a of the subsystem 302 can output (or display) the multimedia material of the superimposed overlay object through the output device 32.

在本發明一範例實施例中，擴增實境的顯示方式還可以結合深度學習模型以及地理資訊。In an exemplary embodiment of the present invention, the display mode of the augmented reality may also be combined with a deep learning model and geographic information.

圖11是依據本發明一實施例所繪示的將地理資訊應用至擴增實境的顯示方法的示意圖。FIG. 11 is a schematic diagram of a method for displaying geographic information to an augmented reality according to an embodiment of the invention.

請參照圖11，在步驟S1101中，子系統302的處理器34a可以透過定位裝置(未繪示)取得子系統302目前地理位置。接著在步驟S1103中，子系統302的處理器34a會判斷目前地理位置是否符合一特定地理資訊。若目前地理位置不符合特定地理資訊時，返回執行步驟S1101。若目前地理位置符合特定地理資訊時，可以執行圖10的步驟S1101，並且繼續執行圖10中後續的步驟。Referring to FIG. 11, in step S1101, the processor 34a of the subsystem 302 can obtain the current geographic location of the subsystem 302 through a positioning device (not shown). Next, in step S1103, the processor 34a of the subsystem 302 determines whether the current geographic location conforms to a particular geographic information. If the current geographic location does not meet the specific geographic information, the process returns to step S1101. If the current geographic location conforms to the specific geographic information, step S1101 of FIG. 10 may be performed, and the subsequent steps in FIG. 10 are continued.

須說明的是，在先前技術中，僅能判斷使用者是否到一特定地理座標附近，且依據網路訊號，可能產生非常大的誤差可能非常大，故無法確切知道使用者的行動裝置是否真的到達特定位置並且拍攝到具有目標物的多媒體資料(即，影像)。以當前流行的遊戲「寶可夢」來說，玩家甚至不需要打開手機的攝像頭，便可以進行遊玩。It should be noted that in the prior art, it is only possible to determine whether the user is near a specific geographic coordinate, and depending on the network signal, a very large error may be generated, so that it is impossible to know exactly whether the user's mobile device is true. Reach a specific location and capture multimedia material (ie, image) with the target. In the current popular game "Bao Ke Meng", the player does not even need to open the camera of the mobile phone to play.

然而，藉由本發明的深度學習模組，應用程式除了可以根據目前地理資訊來判斷使用者是否到達指定地點，更得以進一步判斷使用者是否用輸入裝置取得具有目標物的多媒體資料(即，影像)，並判斷是否讓擴增實境內容與玩家互動。如此，則能進一步增強使用者與真實環境的互動、讓擴增實境內容栩栩如生。例如：應用程式要求使用者不只要到日月潭，更要用手機捕捉到日月潭的池水，才會進一步產生擴增實境與其互動。However, with the deep learning module of the present invention, the application can determine whether the user has reached the designated location based on the current geographic information, and further determine whether the user uses the input device to obtain the multimedia material (ie, image) with the target object. And determine whether to allow the augmented reality content to interact with the player. In this way, the user's interaction with the real environment can be further enhanced, and the augmented reality content can be brought to life. For example, the application requires users not only to go to Sun Moon Lake, but also to capture the pool of Sun Moon Lake with a mobile phone, in order to further generate augmented reality and interact with it.

特別是，將地理資訊結合深度學習模型的物件辨識，更能克服僅使用地理資訊所無法達成的效果。例如：如果要讓使用者在賣場中，用行動裝置掃描商品並出現對應的擴增實境介紹或廣告，若是使用地理資訊，則由於在賣場中各商品的地理座標非常接近，並無法有效地達成擴增實境的顯示。且傳統的物件辨識並無法有效辨識三維物體，也難以達到準確的物件辨識的目標。In particular, the combination of geographic information and deep learning model object recognition can overcome the effects that can only be achieved by using only geographic information. For example, if the user wants to use the mobile device to scan the product and display the corresponding augmented reality introduction or advertisement in the store, if the geographic information is used, the geographic coordinates of each product in the store are very close and cannot be effectively Achieve the display of augmented reality. Moreover, traditional object recognition does not effectively identify three-dimensional objects, and it is difficult to achieve accurate object recognition goals.

結合深度學習模型，亦可以用來判斷是否有某事件發生，並讓應用程式對應該事件產生特殊的反應。Combined with the deep learning model, it can also be used to determine if an event has occurred and to have the application react specifically to the event.

圖12是依據本發明另一實施例所繪示的將地理資訊應用至擴增實境的顯示方法的示意圖。FIG. 12 is a schematic diagram of a method for displaying geographic information to an augmented reality according to another embodiment of the invention.

請參照圖12，在步驟S1201中，子系統302的處理器34a可以透過定位裝置(未繪示)取得子系統302目前地理位置。接著在步驟S1203中，子系統302的處理器34a會判斷目前地理位置是否符合一特定地理資訊。若目前地理位置不符合特定地理資訊時，返回執行步驟S1201。若目前地理位置符合特定地理資訊時，則在步驟S1205中，使用者可以使用子系統302的輸入裝置來取得多媒體資料。在步驟S1207中，子系統302的處理器34a可以將多媒體資料輸入至位在子系統302中的深度學習模型以辨識多媒體資料中的物件。在辨識完多媒體資料中的物件後，在步驟S1209中，子系統302的處理器34a可以判斷是否補捉到特定的物件。若沒有補捉到特定的物件，則返回執行步驟S1207。若有補捉到特定的物件，則在步驟S1211中判斷該物件是否處在特殊事件的狀態。若該物件沒有處在特殊事件的狀態，則在步驟S1213中，子系統302的處理器34a可以疊加一般的疊加資訊來顯示一般的擴增實境內容。若該物件處在特殊事件的狀態，則在步驟S1215中，子系統302的處理器34a可以疊加特殊的疊加資訊來顯示特殊的擴增實境內容。Referring to FIG. 12, in step S1201, the processor 34a of the subsystem 302 can obtain the current geographic location of the subsystem 302 through a positioning device (not shown). Next, in step S1203, the processor 34a of the subsystem 302 determines whether the current geographic location conforms to a particular geographic information. If the current geographic location does not meet the specific geographic information, the process returns to step S1201. If the current geographic location meets the specific geographic information, then in step S1205, the user can use the input device of the subsystem 302 to obtain the multimedia material. In step S1207, the processor 34a of the subsystem 302 can input the multimedia material into a deep learning model located in the subsystem 302 to identify the objects in the multimedia material. After identifying the object in the multimedia material, in step S1209, the processor 34a of the subsystem 302 can determine whether to capture a particular object. If the specific object is not captured, the process returns to step S1207. If a specific item is captured, it is determined in step S1211 whether the object is in a state of a special event. If the object is not in a special event state, then in step S1213, the processor 34a of the subsystem 302 may superimpose the general overlay information to display the general augmented reality content. If the object is in the state of a special event, then in step S1215, the processor 34a of the subsystem 302 can superimpose the special overlay information to display the particular augmented reality content.

上述方法可以應用在下述情況，例如：如果應用程式會對水體產生反應；則除了應用程式原本建立的湖泊、水池… 等地理資料庫外，雨後突然產生的小水塘亦會讓應用程式與使用者互動。又例如：當應用程式捕捉到「山」的影像時，可以播放某種擴增實境內容；但當火山噴發時，又可以播放另一種特殊的擴增實境內容。如此，則可以達到出其不意、賦予應用程式隨機應變的能力。The above method can be applied to the following situations, for example, if the application reacts to the water body, in addition to the geographical database such as the lake, pool, etc. originally established by the application, the small pool suddenly generated after the rain will also make the application and use. Interaction. For example, when the application captures the image of "Mountain", it can play some kind of augmented reality content; but when the volcano erupts, another special augmented reality content can be played. In this way, you can achieve the ability to surprise and give the application a random response.

圖13是依據本發明一實施例所繪示的物件辨識方法的流程圖。請參照圖13，在步驟S1301中，輸入裝置30用以取得多媒體資料。在步驟S1303中，處理器34a或處理器34b將多媒體資料輸入至深度學習模型以辨識多媒體資料中的物件。最後在步驟S1305中，輸出裝置32根據所辨識出的多媒體資料中的物件，輸出對應於此物件的輸出資訊。FIG. 13 is a flow chart of an object identification method according to an embodiment of the invention. Referring to FIG. 13, in step S1301, the input device 30 is configured to acquire multimedia materials. In step S1303, the processor 34a or the processor 34b inputs the multimedia material to the deep learning model to identify the object in the multimedia material. Finally, in step S1305, the output device 32 outputs output information corresponding to the object according to the identified object in the multimedia material.

綜上所述，本發明提出一種使用深度學習模型進行物件辨識的物件辨識方法及物件辨識系統，其可以提高對於多媒體資料中辨識物件的準確度，亦可以應用在擴增實境的技術當中並提供更良好的使用者體驗。In summary, the present invention provides an object identification method and an object identification system for object recognition using a deep learning model, which can improve the accuracy of identifying objects in multimedia data, and can also be applied in augmented reality technology. Provide a better user experience.

雖然本發明已以實施例揭露如上，然其並非用以限定本發明，任何所屬技術領域中具有通常知識者，在不脫離本發明的精神和範圍內，當可作些許的更動與潤飾，故本發明的保護範圍當視後附的申請專利範圍所界定者為準。Although the present invention has been disclosed in the above embodiments, it is not intended to limit the present invention, and any one of ordinary skill in the art can make some changes and refinements without departing from the spirit and scope of the present invention. The scope of the invention is defined by the scope of the appended claims.

10、22、24、26、501_R、501_G、501_B：影像 100：物件辨識模組步驟S101：特徵擷取的步驟步驟S102：產生特徵向量的步驟步驟S103：分類器依照特徵向量進行分類的步驟 20：顏色轉折點 300：物件辨識系統 302、304：子系統 30：輸入裝置 32：輸出裝置 34a、34b：處理器 36：儲存裝置 400：卷積層類神經網路 40：多媒體資料 401：卷積層 402：池化層 403：全連接層 501：多媒體資料 503：濾波器 503_R、503_G、503_B：滑動窗口 6001、6002、6003：區塊 600、601、602：二維陣列 700：待訓練多媒體資料 701、803：深度學習模型 702：輸出 703：解答 801：多媒體資料 805：物件資訊步驟S901：透過輸入裝置取得多媒體資料的步驟步驟S902：將多媒體資料輸入至深度學習模型的步驟步驟S903：藉由深度學習模型得到多媒體資料中物件的類別、用於圈選出此物件的邊界框的中心點座標、邊界框的長度與寬度以及此物件在多媒體資料中的旋轉角度的步驟步驟S904：根據所辨識出的物件的類別，取得對應該物件的疊加物件的步驟步驟S905：將疊加物件疊加至多媒體資料，並且輸出已疊加疊加物件的多媒體資料的步驟步驟S1001：取得多媒體資料的步驟步驟S1003：判斷是否進行雲端運算的步驟步驟S1005：傳送多媒體資料，藉由另一子系統將多媒體資料輸入至深度學習模型以辨識多媒體資料中的物件，並且回傳辨識結果的步驟步驟S1007：將多媒體資料輸入至深度學習模型以辨識多媒體資料中的物件的步驟步驟S1009：執行擴增實境影像處理模組來根據所辨識出的物件取得對應的疊加物件，並且將此疊加物件疊加至多媒體資料的步驟步驟S1011：輸出已疊加疊加物件的多媒體資料的步驟步驟S1101：取得目前地理位置的步驟步驟S1103：判斷目前地理位置是否符合特定地理資訊的步驟步驟S1201：取得目前地理位置的步驟步驟S1203：判斷目前地理位置是否符合特定地理資訊的步驟步驟S1205：取得多媒體資料的步驟步驟S1207：將多媒體資料輸入至深度學習模型以辨識多媒體資料中的物件的步驟步驟S1209：判斷是否補捉到特定的物件的步驟步驟S1211：判斷物件是否處在特殊事件的狀態的步驟步驟S1213：疊加一般的疊加資訊來顯示一般的擴增實境內容的步驟步驟S1215：疊加特殊的疊加資訊來顯示特殊的擴增實境內容的步驟步驟S1301：取得多媒體資料的步驟步驟S1303：將多媒體資料輸入至深度學習模型以辨識多媒體資料中的物件的步驟步驟S1305：根據所辨識出的多媒體資料中的物件，輸出對應於此物件的輸出資訊的步驟10, 22, 24, 26, 501_R, 501_G, 501_B: image 100: object recognition module step S101: step of feature extraction step S102: step of generating feature vector step S103: step 20 of classifier classifying according to feature vector Color Inflection Point 300: Object Identification System 302, 304: Subsystem 30: Input Device 32: Output Device 34a, 34b: Processor 36: Storage Device 400: Convolutional Layer Neural Network 40: Multimedia Data 401: Convolution Layer 402: Pooling layer 403: Fully connected layer 501: Multimedia material 503: Filters 503_R, 503_G, 503_B: sliding windows 6001, 6002, 6003: Blocks 600, 601, 602: Two-dimensional array 700: To-be-trained multimedia materials 701, 803 : Deep Learning Model 702: Output 703: Solution 801: Multimedia Data 805: Object Information Step S901: Step of Obtaining Multimedia Material by Input Device Step S902: Step of Inputting Multimedia Data into Deep Learning Model Step S903: Model by Deep Learning Get the category of the object in the multimedia material, the center point coordinate of the bounding box used to circle the object, and the bounding box Length and width and the rotation angle of the object in the multimedia material. Step S904: Step of obtaining the superimposed object corresponding to the object according to the identified category of the object. Step S905: superimposing the superimposed object on the multimedia material, and outputting the Step S1001: Step of acquiring multimedia data Step S1003: Step of determining whether to perform cloud computing Step S1005: transmitting multimedia material, and inputting multimedia material to a deep learning model by another subsystem to identify multimedia The object in the data, and the step of returning the identification result, step S1007: the step of inputting the multimedia material into the deep learning model to identify the object in the multimedia material, step S1009: executing the augmented reality image processing module according to the identified The step of acquiring the corresponding superimposed object and superimposing the superimposed object on the multimedia material, step S1011: outputting the multimedia material of the superimposed superimposed object, step S1101: step of obtaining the current geographical position, step S1103: determining that the current geographical position is Step S1201: Step of Retrieving Current Geographical Position Step S1203: Step of Determining Whether the Current Geographical Location Meets Specific Geographical Information Step S1205: Step of Obtaining Multimedia Data Step S1207: Input multimedia material into the deep learning model to identify Step of the object in the multimedia material Step S1209: Step of determining whether to capture a specific object Step S1211: Step of determining whether the object is in a state of a special event Step S1213: superimposing general superimposed information to display a general augmented reality Step of content step S1215: Step of superimposing special superimposed information to display special augmented reality content Step S1301: Step of acquiring multimedia material Step S1303: Step of inputting multimedia material into the deep learning model to identify objects in the multimedia material Step S1305: Step of outputting output information corresponding to the object according to the identified object in the multimedia material

圖1是物件辨識的流程示意圖。圖2A是傳統的物件辨識的示意圖。圖2B是同一物件經由不同角度拍攝所取得的影像的示意圖。圖3是依據本發明一實施例所繪示的物件辨識系統的示意圖。圖4是依據本發明一實施例所繪示的卷積層類神經網路的示意圖。圖5是依據本發明一實施例所繪示的濾波器與其滑動窗口的示意圖。圖6A至圖6E是依據本發明一實施例所繪示的滑動窗口的作用方式的示意圖。圖7是依據本發明一實施例所繪示的訓練卷積層類神經網路的示意圖。圖8是依據本發明一實施例所繪示的部屬階段的示意圖。圖9是依據本發明一實施例所繪示的擴增實境的顯示流程的示意圖。圖10是依據本發明一實施例所繪示的擴增實境的顯示方法的示意圖。圖11是依據本發明一實施例所繪示的將地理資訊應用至擴增實境的顯示方法的示意圖。圖12是依據本發明另一實施例所繪示的將地理資訊應用至擴增實境的顯示方法的示意圖。圖13是依據本發明一實施例所繪示的物件辨識方法的流程圖。Figure 1 is a schematic flow chart of object recognition. 2A is a schematic diagram of conventional object recognition. 2B is a schematic view of an image obtained by photographing the same object through different angles. FIG. 3 is a schematic diagram of an object identification system according to an embodiment of the invention. 4 is a schematic diagram of a convolutional layer-like neural network according to an embodiment of the invention. FIG. 5 is a schematic diagram of a filter and a sliding window thereof according to an embodiment of the invention. 6A-6E are schematic diagrams showing the operation mode of a sliding window according to an embodiment of the invention. FIG. 7 is a schematic diagram of a training convolutional layer-like neural network according to an embodiment of the invention. FIG. 8 is a schematic diagram of a subordinate stage according to an embodiment of the invention. FIG. 9 is a schematic diagram showing a display flow of an augmented reality according to an embodiment of the invention. FIG. 10 is a schematic diagram of a display method of augmented reality according to an embodiment of the invention. FIG. 11 is a schematic diagram of a method for displaying geographic information to an augmented reality according to an embodiment of the invention. FIG. 12 is a schematic diagram of a method for displaying geographic information to an augmented reality according to another embodiment of the invention. FIG. 13 is a flow chart of an object identification method according to an embodiment of the invention.

Claims

An object identification method includes: acquiring a multimedia material; inputting the multimedia data to a deep learning model to identify an object in the multimedia material; and outputting, according to the identified object in the multimedia material, an output corresponding to the object An output information of the object, wherein, in the step of outputting the output information corresponding to the object, the method comprises: obtaining, according to the object identified by the deep learning model, a superimposed object corresponding to the object; Rotating the superimposed object at a rotation angle in the multimedia material, and scaling the superimposed object according to a length of a bounding box for circled the object, and adding the superimposed object to the superimposed object a multimedia material; and outputting the multimedia material superimposed with the superimposed object, wherein the rotation angle of the object in the multimedia material, the length of the bounding box of the object, and the width of the bounding box are learned by the depth The model recognizes it.

The object identification method according to claim 1, wherein the deep learning model comprises a Convolution Neural Network (CNN), the convolutional layer-like neural network comprising at least one convolutional layer And the at least one pooling layer, the object identifying method further includes: capturing, by the at least one convolution layer and the at least one pooling layer, a feature value of the multimedia material.

The object identification method of claim 2, wherein the deep learning model further comprises a fully connected layer or a machine learning algorithm, and the step of outputting the output information corresponding to the object comprises: The fully connected layer or the machine learning algorithm classifies the object based on the feature value and obtains an object information corresponding to the object.

The object identification method of claim 3, wherein the object information includes coordinates for enclosing a plurality of vertices of the bounding box of the object.

The object identification method of claim 3, wherein the object information includes the type of the object.

The object identification method of claim 1, wherein the deep learning model comprises a plurality of layers, and the object identification method further comprises: inputting a plurality of multimedia materials to be trained and respectively corresponding to the multimedia materials to be trained. And a plurality of solutions to the deep learning model; and adjusting the plurality of weights of each of the layers in the deep learning model to train the deep learning model according to the multimedia materials to be trained and the solutions.

The object identification method according to claim 6, wherein the deep learning model includes a loss layer, and the multimedia to be trained according to the And the step of adjusting the weights of each of the layers in the deep learning model to train the deep learning model includes: the depth learning model according to the multimedia data output to be trained respectively corresponding to the Training a plurality of output solutions of the multimedia material; and comparing the output solutions and the answers to the multimedia materials to be trained by the penalty layer and adjusting the respective layers of the deep learning model according to a penalty function Some weights.

The object identification method of claim 1, wherein before the step of obtaining a multimedia material, the object identification method further comprises: obtaining a current geographic information; determining whether the current geographic information conforms to a specific geographic information; When the current geographic information conforms to the specific geographic information, the step of obtaining the multimedia material is performed.

The object identification method of claim 1, wherein the multimedia material comprises at least one of an image, a point cloud, a voxel, and a mesh.

An object identification system comprising: an input device for acquiring a multimedia material; a processor for inputting the multimedia data to a deep learning model to identify an object in the multimedia material; and an output device for Outputting an output information corresponding to the object according to the identified object in the multimedia material; a storage device, wherein in the operation of outputting the output information corresponding to the object, the processor further uses the object recognized by the deep learning model to obtain a superimposed object corresponding to the object from the storage device The processor is further configured to rotate the superimposed object according to a rotation angle of the object in the multimedia material, and according to a length of a bounding box for circled the object, the width of the bounding frame is superimposed on the width of the bounding frame The object is zoomed and the superimposed object is superimposed on the multimedia material, and the output device is further configured to output the multimedia material that has been superimposed on the superimposed object, wherein the rotation angle of the object in the multimedia material, the boundary of the object The length of the frame and the width of the bounding box are identified by the depth learning model.