TW202129593A

TW202129593A - Convolutional neural network detection method and system for animals

Info

Publication number: TW202129593A
Application number: TW109101571A
Authority: TW
Inventors: 林春宏; 詹永寬; 鄭琮翰; 陳佳鴻
Original assignee: 國立臺中科技大學
Priority date: 2020-01-16
Filing date: 2020-01-16
Publication date: 2021-08-01
Also published as: TWI728655B

Abstract

A convolutional neural network detecting system and method for animals are disclosed. The method includes the steps of: generating an image by an image capturing device, and then sequentially performing the first object detection and positioning process on the image by using a processor, and it extracts the first object information and the first sub-image. The first object type identification process extracts the probability values of the plurality of the first object types. If the maximum value of the plurality of the first object types is less than the first predetermined value, performing the second object detection and positioning process on the first sub-image, and it extracts the second object information and the second sub-image. To execute a weighting calculation to the first object type probability value and the second object type probability value, and generating the final object type probability values.

Description

Convolutional neural network detection method and system applied to animals

本發明是有關於一種卷積神經網路偵測方法及系統，特別是有關於一種應用於動物的卷積神經網路偵測方法及系統。The present invention relates to a detection method and system of a convolutional neural network, and particularly relates to a detection method and system of a convolutional neural network applied to animals.

傳統監控系統將欲監視的視野區域直接從監控螢幕播出，或是以圖片或影片的方式記錄於硬碟，此種監控方式不能在事件狀況發生前提出警告，只有在特定事件發生時，監控者才開始調閱大量資料做逐一比對，搜索感興趣或特定目標，例如寵物狗或流浪狗。這種監控方式不僅浪費大量儲存空間，也對監控者造成大量負擔。The traditional monitoring system broadcasts the field of view to be monitored directly from the monitoring screen, or records it on the hard disk in the form of pictures or videos. This monitoring method cannot provide a warning before the occurrence of the event, and only monitors when a specific event occurs. Only then began to read a large amount of information to do one-by-one comparison, search for interest or specific targets, such as pet dogs or stray dogs. This monitoring method not only wastes a lot of storage space, but also causes a lot of burden on the monitor.

有鑑於上述習知問題，本發明提供一種應用於動物的卷積神經網路(convolutional neural network)偵測方法及系統，應用於寵物狗或流浪狗的定位以及種類辨識，以達到自動化的監控系統。In view of the above-mentioned conventional problems, the present invention provides a convolutional neural network (convolutional neural network) detection method and system applied to animals, which is applied to the positioning and type identification of pet dogs or stray dogs, so as to achieve an automated monitoring system .

本發明提出一種應用於動物的卷積神經網路偵測方法，其方法利用影像擷取裝置及處理器進行。首先藉由影像擷取裝置產生影像。接著藉由處理器對影像進行第一物件偵測及定位程序，從影像萃取出第一物件資訊及對應第一物件資訊的第一子影像。藉由處理器對第一子影像進行第一物件種類辨識程序，萃取出複數個第一物件種類機率值，若複數個第一物件種類機率值的最大值大於等於一個第一預設值，則將第一物件資訊及第一物件種類機率值的最大值合併，產生第一最終物件資訊。若複數個第一物件種類機率值的最大值小於第一預設值，則藉由處理器對第一子影像進行第二物件偵測及定位程序，萃取出第二物件資訊及第二子影像。藉由處理器對第二子影像進行第二物件種類辨識程序，萃取出複數個第二物件種類機率值。藉由處理器對複數個第一物件種類機率值及對應的複數個第一權重值，以及複數個第二物件種類機率值及對應的複數個第二權重值進行加權計算，產生對應的複數個最終物件種類機率值，並且將第一物件資訊、第二物件資訊及複數個最終物件種類機率值的最大值合併，產生第二最終物件資訊。The present invention provides a convolutional neural network detection method applied to animals. The method uses an image capture device and a processor. First, an image is generated by an image capture device. Then, the processor performs a first object detection and positioning process on the image, and extracts the first object information and the first sub-image corresponding to the first object information from the image. The processor performs the first object type identification process on the first sub-image to extract a plurality of first object type probability values. If the maximum value of the plurality of first object type probability values is greater than or equal to a first preset value, then Combine the first object information and the maximum value of the probability value of the first object type to generate the first final object information. If the maximum value of the probability values of the plurality of first object types is less than the first preset value, the processor performs a second object detection and positioning process on the first sub-image to extract the second object information and the second sub-image . The processor performs a second object type identification process on the second sub-image to extract a plurality of second object type probability values. The processor performs a weighted calculation on a plurality of first object type probability values and corresponding plural first weight values, and a plurality of second object type probability values and corresponding plural second weight values to generate corresponding pluralities The final object type probability value, and the first object information, the second object information, and the maximum value of the plurality of final object type probability values are combined to generate the second final object information.

較佳地，第一物件偵測及定位程序與第二物件偵測及定位程序應用於狗的偵測及定位，且第一物件種類辨識程序及第二物件種類辨識程序應用於狗的種類辨識。Preferably, the first object detection and location process and the second object detection and location process are applied to the detection and location of dogs, and the first object type identification process and the second object type identification process are applied to dog type identification .

較佳地，應用於動物的卷積神經網路偵測方法進一步包含以下步驟：藉由處理器將第一最終物件資訊或第二最終物件資訊合併至影像中，產生偵測影像。藉由儲存裝置儲存上述影像及偵測影像。藉由影像輸出裝置輸出偵測影像。Preferably, the convolutional neural network detection method applied to animals further includes the following steps: the processor merges the first final object information or the second final object information into the image to generate the detection image. The above-mentioned image and the detected image are stored by the storage device. The detection image is output by the image output device.

較佳地，第一物件偵測及定位程序與第二物件偵測及定位程序包括五層最大值池化層(max pooling layer)運算、二十二層卷積層(convolution layer)運算五層最大值池化層運算、二十二層卷積層運算及各自對應之一帶洩漏線性整流激活函數運算(leaky rectified linear activation function)。Preferably, the first object detection and location process and the second object detection and location process include five-layer max pooling layer operation, and twenty-two convolution layer operations with five-layer max. The value pooling layer operation, the twenty-two layer convolutional layer operation and the corresponding one of the leaky rectified linear activation function operation (leaky rectified linear activation function).

較佳地，第一物件種類辨識程序及第二物件種類辨識程序包括三層卷積核(Inception)運算、六層卷積層運算及兩層最大值池化層運算。Preferably, the first object type identification procedure and the second object type identification procedure include a three-layer convolution kernel (Inception) operation, a six-layer convolution layer operation, and a two-layer maximum pooling layer operation.

較佳地，第一物件種類辨識程序及第二物件種類辨識程序包括種類辨識方法，其包括召回率(recall)函數、精確率(precision)函數及平均交並比(mean intersection over union)函數。Preferably, the first object type recognition program and the second object type recognition program include a type recognition method, which includes a recall function, a precision function, and a mean intersection over union function.

較佳地，第一物件偵測及定位程序與第二物件偵測及定位程序包括損失函數。Preferably, the first object detection and location process and the second object detection and location process include a loss function.

較佳地，本發明也提供一種應用於動物的卷積神經網路偵測系統。其系統可以包含影像擷取裝置、處理器、儲存裝置及影像輸出裝置。影像擷取裝置產生影像，處理器連接影像擷取裝置，處理器包括物件偵測及定位器與物件種類辨識器，物件偵測及定位器接收影像擷取裝置產生的影像，輸出第一物件資訊及第一子影像，並且物件偵測及定位器輸入第一子影像，輸出第二物件資訊。物件種類辨識器接收第一物件資訊及第二物件資訊，輸出第一最終物件資訊及第二最終物件資訊。處理器將第一最終物件資訊或第二最終物件資訊合併至影像中，產生偵測影像。儲存裝置連接處理器，儲存裝置儲存影像及偵測影像。影像輸出裝置連接儲存裝置，影像輸出裝置輸出偵測影像。Preferably, the present invention also provides a convolutional neural network detection system applied to animals. The system can include an image capture device, a processor, a storage device, and an image output device. The image capture device generates an image, and the processor is connected to the image capture device. The processor includes an object detection and locator and an object type recognizer. The object detection and locator receives the image generated by the image capture device and outputs the first object information And the first sub-image, and the object detection and locator inputs the first sub-image and outputs the second object information. The object type identifier receives the first object information and the second object information, and outputs the first final object information and the second final object information. The processor merges the first final object information or the second final object information into the image to generate a detection image. The storage device is connected to the processor, and the storage device stores images and detects images. The image output device is connected to the storage device, and the image output device outputs the detected image.

承上所述，本發明之卷積神經網路偵測方法及系統，其可具有一或多個下述優點：In summary, the convolutional neural network detection method and system of the present invention can have one or more of the following advantages:

(1)建立應用於動物的自動化辨識系統，可以即時的提供欲辨識的動物的位置及種類的資訊。(1) Establish an automated identification system for animals, which can provide real-time information on the location and type of animals to be identified.

(2)利用單一的神經網路進行物件偵測，可以使用較少的運算資源達成一定程度的效果。(2) Using a single neural network for object detection can achieve a certain degree of effect with less computing resources.

(3)利用對欲辨識的動物的全身及頭部分別進行物件偵測，搭配動物的種類辨識的神經網路，可提高辨識成功率。(3) The use of object detection on the whole body and head of the animal to be identified, together with the neural network for identifying the type of animal, can increase the identification success rate.

為利貴審查委員瞭解本發明之技術特徵、內容與優點及其所能達成之功效，茲將本發明配合附圖，並以實施例之表達形式詳細說明如下，而其中所使用之圖式，其主旨僅為示意及輔助說明書之用，未必為本發明實施後之真實比例與精準配置，故不應就所附之圖式的比例與配置關係解讀、侷限本發明於實際實施上的權利範圍，合先敘明。In order to facilitate the reviewers to understand the technical features, content and advantages of the present invention and the effects that can be achieved, the present invention is described in detail with the accompanying drawings and in the form of embodiment expressions as follows. The drawings used therein are as follows: The subject matter is only for the purpose of illustration and auxiliary description, and may not be the true proportions and precise configuration after the implementation of the invention. Therefore, it should not be interpreted in terms of the proportions and configuration relationships of the accompanying drawings, and should not limit the scope of rights of the invention in actual implementation. Hexian stated.

請參閱第1圖，其係本發明的實施例的應用於動物的卷積神經網路偵測方法的步驟流程度。如圖所示，應用於動物的卷積神經網路偵測方法包含以下步驟(S1~S6)：Please refer to Figure 1, which shows the flow of steps of the convolutional neural network detection method applied to animals according to the embodiment of the present invention. As shown in the figure, the convolutional neural network detection method applied to animals includes the following steps (S1~S6):

步驟S1：藉由影像擷取裝置100產生欲進行物件偵測及定位之影像10。影像擷取裝置100可以是網路攝影機、數位攝影機、智慧型手機、行車紀錄器等電子設備，其影像解析度為416x416。Step S1: Use the image capturing device 100 to generate an image 10 for object detection and positioning. The image capturing device 100 can be an electronic device such as a webcam, a digital camera, a smart phone, a driving recorder, and the like, and its image resolution is 416x416.

請參閱第2圖，其係根據本發明的實施例的應用於動物的卷積神經網路偵測方法的欲偵測影像示意圖。如圖所示，由網路攝影機所擷取的影像10中即包含欲進行偵測及定位的物件，例如狗1a及行人1b等等。Please refer to FIG. 2, which is a schematic diagram of an image to be detected of the convolutional neural network detection method applied to animals according to an embodiment of the present invention. As shown in the figure, the image 10 captured by the webcam contains objects to be detected and located, such as dogs 1a and pedestrians 1b.

步驟S2：藉由處理器200進行第一物件偵測及定位程序。上述影像擷取裝置100產生的影像10，可輸入至處理器200，其處理器200可以包含第一物件偵測及定位程序，當第一物件偵測及定位程序執行完成之後，可以萃取出第一物件資訊21及對應第一物件資訊21的第一子影像30。此處的第一物件偵測及定位程序的詳細內容，將在下文中進一步描述。Step S2: Perform a first object detection and positioning process by the processor 200. The image 10 generated by the above-mentioned image capturing device 100 can be input to the processor 200, and the processor 200 can include a first object detection and positioning process. After the first object detection and positioning process is executed, the first object detection and positioning process can be extracted. An object information 21 and a first sub-image 30 corresponding to the first object information 21. The detailed content of the first object detection and location procedure here will be further described below.

步驟S3：藉由處理器200進行第一物件種類辨識程序。上述第一物件偵測及定位程序萃取出的第一子影像30可以輸入處理器200，以進行第一物件種類辨識程序，當第一物件種類辨識程序執行完成之後，可以萃取出複數個第一物件種類機率值。若複數個第一物件種類機率值之中的最大值大於等於一個第一預設值，則將第一物件資訊21及複數個第一物件種類機率值之中的最大值合併，產生第一最終物件資訊22。請參閱第3圖，其係根據本發明的實施例的應用於動物的卷積神經網路偵測方法之進行第一物件偵測及定位程序及第一物件種類辨識程序之後的示圖。如圖所示，代表第一最終物件資訊22的邊框(bounding box)即為第一物件資訊21及複數個第一物件種類機率值之中的最大值合併之後的結果。若複數個第一物件種類機率值之中的最大值小於一個第一預設值，則進行步驟S4。此處的第一物件種類辨識程序的詳細內容，將在下文中進一步描述。Step S3: Perform the first object type identification process by the processor 200. The first sub-image 30 extracted by the above-mentioned first object detection and positioning process can be input to the processor 200 to perform a first object type identification process. After the first object type identification process is executed, a plurality of first object types can be extracted. The probability value of the object type. If the maximum value among the probability values of the plurality of first object types is greater than or equal to a first preset value, the first object information 21 and the maximum value among the probability values of the plurality of first object types are combined to generate the first final Object information 22. Please refer to FIG. 3, which is a diagram after the first object detection and positioning process and the first object type identification process of the convolutional neural network detection method applied to animals according to the embodiment of the present invention. As shown in the figure, the bounding box representing the first final object information 22 is the combined result of the first object information 21 and the maximum value among the probability values of the plurality of first object types. If the maximum value among the probability values of the plurality of first object types is less than a first preset value, step S4 is performed. The detailed content of the first object type identification procedure here will be further described below.

步驟S4：藉由處理器200進行第二物件偵測及定位程序。上述第一物件偵測及定位程序萃取出的第一子影像30可以輸入處理器200，以進行第二物件偵測及定位程序，當第二物件偵測及定位程序執行完成之後，可以萃取出第二物件資訊31及對應第二物件資訊31的第二子影像40。請參閱第4圖，其係根據本發明的實施例的應用於動物的卷積神經網路偵測方法之進行第二物件偵測及定位程序之後的示意圖。如圖所示，代表第二物件資訊31的邊框即為執行完第二物件偵測及定位程序對應第二物件資訊31及第二子影像40的區域。此處的第二物件偵測及定位程序與上述的第一物件偵測及定位程序相似，其詳細內容將在下文中進一步描述。Step S4: Perform a second object detection and positioning process by the processor 200. The first sub-image 30 extracted by the above-mentioned first object detection and positioning process can be input to the processor 200 to perform the second object detection and positioning process. After the second object detection and positioning process is completed, it can be extracted The second object information 31 and the second sub-image 40 corresponding to the second object information 31. Please refer to FIG. 4, which is a schematic diagram after the second object detection and positioning process of the convolutional neural network detection method applied to animals according to an embodiment of the present invention. As shown in the figure, the frame representing the second object information 31 is the area corresponding to the second object information 31 and the second sub-image 40 after executing the second object detection and positioning process. The second object detection and location process here is similar to the above-mentioned first object detection and location process, and the details will be described further below.

步驟S5：藉由處理器200進行第二物件種類辨識程序。上述第二物件偵測及定位程序萃取出的第二子影像40可以輸入處理器200，以進行第二物件種類辨識程序，當第二物件種類辨識程序執行完成之後，可以萃取出複數個第二物件種類機率值。Step S5: Perform a second object type identification process by the processor 200. The second sub-image 40 extracted by the above-mentioned second object detection and positioning process can be input to the processor 200 to perform a second object type identification process. After the second object type identification process is executed, a plurality of second object types can be extracted. The probability value of the object type.

步驟S6：藉由處理器200進行加權計算。上述的複數個第一物件種類機率值之中的每一個機率值均對應有一個第一權重值，以及複數個第二物件種類機率值之中的每一個機率值也對應有一個第二權重值，藉由處理器200對上述物件種類機率值及對應的權重值進行加權運算，產生對應的複數個最終物件種類機率值，並且將上述的第一物件資訊21、第二物件資訊31及複數個最終物件種類機率值之中的最大值合併，產生第二最終物件資訊41。請參閱第5圖，其係根據本發明的實施例的應用於動物的卷積神經網路偵測方法之進行加權運算後的示意圖。如圖所示，代表第二最終物件資訊41的兩個邊框，即為加權運算之後，第一物件資訊21、第二物件資訊31及複數個最終物件種類機率值之中的最大值合併之後的結果。此處的加權計算的詳細內容將在下文中進一步描述。Step S6: Perform weighting calculation by the processor 200. Each probability value of the plurality of first object type probability values corresponds to a first weight value, and each probability value of the plurality of second object type probability values also corresponds to a second weight value , The processor 200 performs a weighting operation on the probability value of the above-mentioned object type and the corresponding weight value to generate a plurality of corresponding probability values of the final object type, and the above-mentioned first object information 21, second object information 31, and a plurality of The maximum value among the probability values of the final object type is combined to generate the second final object information 41. Please refer to FIG. 5, which is a schematic diagram of the convolutional neural network detection method applied to animals after weighting operation according to an embodiment of the present invention. As shown in the figure, the two frames representing the second final object information 41 are the combination of the first object information 21, the second object information 31, and the maximum value among the probability values of the plurality of final object types after the weighting operation. result. The detailed content of the weighting calculation here will be further described below.

上述步驟S2中的第一物件偵測及定位程序，即為採用YOLO(you only look once)的卷積神經網路架構進行的物件偵測及定位程序，YOLO與其他物件偵測的卷積類神經網路架構不同，其他的類神經網路架構大都採多類神經網路架構組合而成，而YOLO採用單一神經網路架構，影像從一開始輸入到產生出物件的位置皆在同一個網路架構裡。YOLO的概念是把影像分成S × S 個區塊，每個區塊可以產生B個預測物件框，其中物件框的資訊為(x, y, h, w, confidence)五個值的集合，(x, y)代表該物件框的中心位置；(h, w)代表該物件框的長寬，此長寬是相對於整張影像的長寬比例。confidence則是當有物件的中心點落在區塊內時，此物件的預測的信任分數，其計算方式如下：The first object detection and location process in the above step S2 is the object detection and location process using the convolutional neural network architecture of YOLO (you only look once), the convolution type of YOLO and other object detection Neural network architectures are different. Most other neural network architectures are composed of multiple neural network architectures, while YOLO uses a single neural network architecture. The image is input from the beginning to the place where the object is generated on the same network. In the road architecture. The concept of YOLO is to divide the image into S × S blocks, each block can generate B predicted object boxes, where the information of the object box is a set of five values (x, y, h, w, confidence), ( x, y) represents the center position of the object frame; (h, w) represents the length and width of the object frame, which is the aspect ratio relative to the entire image. Confidence is the predicted trust score of an object when the center point of an object falls within the block. The calculation method is as follows:

(1)

當區塊沒有物件中心時，則令Pr(Object) = 0，反之則Pr(Object) = 1之外，還要預測Ｃ個類別的機率，而類別的機率要分為訓練與測試階段來表示。訓練階段時，類別機率表示如下：When the block has no object center, set Pr(Object) = 0, otherwise, Pr(Object) = 1, in addition to predicting the probability of C categories, and the probability of the category should be divided into training and testing phases to represent . During the training phase, the category probability is expressed as follows:

(2)

而測試階段時將信任分數與類別機率相乘，其表示如下：In the testing phase, the trust score is multiplied by the category probability, which is expressed as follows:

(3)

此處的

定義如下：Here

It is defined as follows:

(4)

將從影像擷取裝置100輸入的影像10，藉由處理器200執行第一物件偵測及定位程序之後，萃取出包括物件的邊框，例如，對應至動物的外觀形狀，以及物件的所屬類別(class)的信任機率值，物件框的資訊可以對應上述步驟S2的第一物件資訊21，且對應第一物件資訊21之中的物間邊框(X, Y, H, W)的位置及長寬即為第一子影像30。After the image 10 input from the image capture device 100 is executed by the processor 200 after the first object detection and positioning process, the frame including the object is extracted, for example, the shape corresponding to the animal's appearance and the category of the object ( class), the information of the object frame can correspond to the first object information 21 of step S2 above, and correspond to the position and length and width of the object border (X, Y, H, W) in the first object information 21 That is the first sub-image 30.

上述的第一物件資訊21，在第一物件偵測及定位程序中可以產生一個或一個以上，即辨識一個以上的目標物件。物件的所屬類別的信任機率值，可以用三個參數的乘積得到，第一個參數為在所判別的區域內出現物件的機率，第二個參數為在所判別的區域內出現物件，且判別為某個類別的機率，第三個參數為進行物件判別的區域大小與區域內判別出的物件之大小，其交集面積與聯集面積的比值大小。第一物件資訊21所包含的物件資訊即為這五個元素所決定。The above-mentioned first object information 21 can generate one or more than one in the first object detection and positioning process, that is, identify more than one target object. The trust probability value of the category of the object can be obtained by the product of three parameters. The first parameter is the probability of the object appearing in the discriminated area, and the second parameter is the object appearing in the discriminated area, and the discrimination is It is the probability of a certain category, and the third parameter is the ratio of the size of the area where the object is identified and the size of the object identified in the area, and the area of the intersection and the area of the union. The object information contained in the first object information 21 is determined by these five elements.

上述步驟S3中的第一物件種類辨識程序，即為採用Google開源的Inception V3架構生成的物件種類辨識程序。在處理器200對影像10執行第一物件偵測及定位程序之後產生的第一子影像30，進行第一物件種類辨識程序，萃取出的複數個第一物件種類機率值(例如，對應第一物件資訊的位置，判別有某種品種的動物的機率值)，若其複數個第一物件種類機率值之中的最大值大於等於一個第一預設值(例如，50%)，即結合第一物件資訊21，可產生步驟S3中的第一最終物件資訊22，亦即，在影像10中的某區域，判別出其中存在一個或一個以上的物件，且此物件為某種動物，並且進一步判別出此某種動物為某一個特定品種的機率在50%以上。The first object type identification program in the above step S3 is an object type identification program generated by using Google's open source Inception V3 architecture. The first sub-image 30 generated after the processor 200 performs the first object detection and positioning process on the image 10 performs the first object type identification process to extract a plurality of first object type probability values (for example, corresponding to the first object type). The position of the object information, the probability value of identifying a certain species of animal), if the maximum value among the plurality of first object type probability values is greater than or equal to a first preset value (for example, 50%), it is combined with the first An object information 21 can generate the first final object information 22 in step S3, that is, in a certain area of the image 10, it is determined that there is one or more objects, and the object is a certain animal, and further The probability of identifying this certain animal as a certain species is above 50%.

上述步驟S4中的第二物件偵測及定位程序，其概念及方法與第一物件偵測及定位程序類似。其差異在於第一物件偵測及定位程序執行的對象為上述的影像10，而第二物件偵測及定位程序執行的對象為上述的第一子影像30。第一物件偵測及定位程序萃取出的第一物件資訊21，可以對應至欲辨識的物件的整體外觀形狀(例如，對應至狗的整體外觀形狀)，而第二物件偵測及定位程序萃取出的第二物件資訊31，可以對應至欲辨識的物件的頭部形狀(例如，對應至狗的頭部外觀形狀)。The concept and method of the second object detection and location process in the above step S4 are similar to the first object detection and location process. The difference is that the object executed by the first object detection and positioning procedure is the aforementioned image 10, and the object executed by the second object detection and positioning procedure is the aforementioned first sub-image 30. The first object information 21 extracted by the first object detection and positioning process can correspond to the overall appearance shape of the object to be recognized (for example, corresponding to the overall appearance shape of the dog), and the second object detection and positioning process extracts The second object information 31 obtained may correspond to the head shape of the object to be recognized (for example, corresponding to the appearance shape of the dog's head).

上述步驟S5中的第二物件種類辨識程序，其概念及方法與第一物件種類辨識程序類似。其差異在於第一物件種類辨識程序執行的對象為上述的第一子影像30，而第二物件種類辨識程序執行的對象為上述的第二子影像40。第一物件種類辨識程序萃取出的複數個第一物件種類機率值，可以對應至利用欲辨識的物件的整體外觀進行物件的種類辨識(例如，利用狗的整體外觀進行狗的種類辨識)，而第二物件種類辨識程序萃取出的複數個第二物件種類機率值，可以對應至利用欲辨識的物件的頭部外觀進行物件的種類辨識(例如，利用狗的頭部外觀進行狗的種類辨識)。The concept and method of the second object type identification process in step S5 are similar to the first object type identification process. The difference is that the object executed by the first object type identification procedure is the aforementioned first sub-image 30, and the object executed by the second object type identification procedure is the aforementioned second sub-image 40. The plurality of first object type probability values extracted by the first object type identification program can correspond to the use of the overall appearance of the object to be identified to identify the type of the object (for example, use the overall appearance of the dog to identify the type of dog), and The plurality of second object type probability values extracted by the second object type recognition program can correspond to the object type recognition using the head appearance of the object to be recognized (for example, the dog type recognition using the head appearance of the dog) .

關於上述步驟S6中提及的加權運算的較詳細的描述，舉例來說，其方法為利用複數個第一物件種類機率值之中，前五個數值最大的機率值及其對應的五個第一權重值，與複數個第二物件種類機率值之中，前五個數值最大的機率值及其對應的五個第二權重值作加權運算，若某一個物件種類出現在這兩組機率值中，即可得出加權運算結果，最後再從加權運算結果中，選出機率值最大的一個及其對應的物件種類，並且與第一物件資訊21、第二物件資訊31合併，產生第二最終物件資訊41。Regarding a more detailed description of the weighting operation mentioned in step S6, for example, the method is to use the probability values of the plurality of first object types, the probability value of the first five values with the largest value and its corresponding five A weight value, among the probability values of the plurality of second object types, the probability value of the first five values with the largest value and the corresponding five second weight values are weighted. If a certain object type appears in these two sets of probability values Then, the weighted operation result can be obtained. Finally, from the weighted operation result, the one with the largest probability value and its corresponding object type are selected, and combined with the first object information 21 and the second object information 31 to generate the second final Object information 41.

上述提及的第一物件偵測及定位程序與第二物件偵測及定位程序，更具體地說，係利用YOLO架構生成的卷積神經網路，對欲進行物件偵測及定位的影像10或第一子影像30，萃取出狗的外觀及其位置或狗的頭部的外觀及其位置。The above-mentioned first object detection and location process and the second object detection and location process, more specifically, use the convolutional neural network generated by the YOLO framework to detect and locate the image 10 Or the first sub-image 30 extracts the appearance and position of the dog or the appearance and position of the dog's head.

上述提及的第一物件種類辨識程序與第二物件種類辨識程序，更具體地說，係利用Google開源的Inception V3架構生成的卷積神經網路，對欲進行物件種類辨識的第一子影像30或第二子影像40，辨識出各種狗的對應品種為何。The above-mentioned first object type recognition program and the second object type recognition program, more specifically, use the convolutional neural network generated by Google's open source Inception V3 framework to identify the first sub-image of the object type. 30 or the second sub-image 40, to identify the corresponding breeds of various dogs.

上述的應用於動物的卷積神經網路偵測方法，還可以依序進一步包括以下步驟：藉由處理器200，合併第一最終物件資訊22或第二最終物件資訊41至影像10，產生偵測影像50。藉由儲存裝置300儲存上述影像10及偵測影像50。以及藉由影像輸出裝置400，輸出偵測影像50。The above-mentioned convolutional neural network detection method applied to animals may further include the following steps in sequence: through the processor 200, the first final object information 22 or the second final object information 41 is merged with the image 10 to generate the detection Measurement image 50. The above-mentioned image 10 and the detected image 50 are stored by the storage device 300. And by the image output device 400, the detection image 50 is output.

上述提及的利用YOLO架構，雖然運算效能比較好，但是精確度的部分尚待提升，因此採用了新的架構Darknet-19進行特徵的萃取，其中包含19個卷積層，5個最大池化層，如表1所示，並使用IamgeNet的1000個類別資料集，以進行分類器的預訓練。The above-mentioned use of the YOLO architecture, although the computing performance is relatively good, but the accuracy part needs to be improved, so the new architecture Darknet-19 is used for feature extraction, which includes 19 convolutional layers and 5 maximum pooling layers , As shown in Table 1, and use IamgeNet's 1000 category data set to pre-train the classifier.

表1：Darknet-19架構操作類型濾波器數量濾波器大小/步長輸出尺寸第1層卷積層 32 3 x 3 224x 224 第2層最大值池化層 2 x 2 /2 112 x 112 第3層卷積層 64 3 x 3 112 x 112 第4層最大值池化層 2 x 2 /2 56 x 56 第5層卷積層 128 3 x 3 56 x 56 第6層卷積層 64 1 x 1 56 x 56 第7層卷積層 128 3 x 3 56 x 56 第8層最大值池化層 2 x 2/2 28 x28 第9層卷積層 256 3 x 3 28 x28 第10層卷積層 128 1 x 1 28 x28 第11層卷積層 256 3 x 3 28 x28 第12層最大值池化層 2 x 2/2 14 x14 第13層卷積層 512 3 x 3 14 x14 第14層卷積層 256 1 x 1 14 x14 第15層卷積層 512 3 x 3 14 x14 第16層卷積層 256 1 x 1 14 x14 第17層卷積層 512 3 x 3 14 x14 第18層最大值池化層 2 x 2/2 7 x 7 第19層卷積層 1024 3 x 3 7 x 7 第20層卷積層 512 1 x 1 7 x 7 第21層卷積層 1024 3 x 3 7 x 7 第22層卷積層 512 1 x 1 7 x 7 第23層卷積層 1024 3 x 3 7 x 7 第24層卷積層 1000 1 x 1 7 x 7 第25層平均值池化層 Global 1000 第26層分類器 (Softmax) Table 1: Darknet-19 architecture Operation type Number of filters Filter size/step size Output size Level 1 Convolutional layer 32 3 x 3 224x 224 Level 2 Maximum pooling layer 2 x 2 /2 112 x 112 Level 3 Convolutional layer 64 3 x 3 112 x 112 Level 4 Maximum pooling layer 2 x 2 /2 56 x 56 Layer 5 Convolutional layer 128 3 x 3 56 x 56 Level 6 Convolutional layer 64 1 x 1 56 x 56 Layer 7 Convolutional layer 128 3 x 3 56 x 56 Layer 8 Maximum pooling layer 2 x 2/2 28x28 Level 9 Convolutional layer 256 3 x 3 28x28 Layer 10 Convolutional layer 128 1 x 1 28x28 Level 11 Convolutional layer 256 3 x 3 28x28 Level 12 Maximum pooling layer 2 x 2/2 14 x14 13th floor Convolutional layer 512 3 x 3 14 x14 14th floor Convolutional layer 256 1 x 1 14 x14 15th floor Convolutional layer 512 3 x 3 14 x14 16th floor Convolutional layer 256 1 x 1 14 x14 17th floor Convolutional layer 512 3 x 3 14 x14 18th floor Maximum pooling layer 2 x 2/2 7 x 7 Level 19 Convolutional layer 1024 3 x 3 7 x 7 20th floor Convolutional layer 512 1 x 1 7 x 7 Level 21 Convolutional layer 1024 3 x 3 7 x 7 Level 22 Convolutional layer 512 1 x 1 7 x 7 Level 23 Convolutional layer 1024 3 x 3 7 x 7 Level 24 Convolutional layer 1000 1 x 1 7 x 7 Level 25 Average pooling layer Global 1000 Level 26 Classifier (Softmax)

在本實施例實際進行物件偵測時，首先將Darknet-19中的24至26層移除，這三層分別為分類用的卷積層、平均值池化層及分類器Softmax層，如表1所示，改為3個卷積層，並將輸入的影像解析度改為416 × 416，如表2所示。整體架構中的2個路由(Route)層與重組(Reorg)層，其作法是將第16層輸出的26 × 26 × 256尺寸的特徵圖，在第一個路由層時分成4個13 × 13 × 512尺寸的特徵圖，在重組層時組合成13 × 13 × 2048尺寸的特徵圖，最後於第二個路由層與第24層的輸出結合成13×13×3072尺寸的特徵圖，這是為了讓整體進行更細致的分類。上述第一物件偵測及定位程序與第二物件偵測及定位程序所使用的YOLO架構，如表2所示。In the actual object detection in this embodiment, firstly remove the 24-26 layers in Darknet-19. These three layers are the convolutional layer for classification, the average pooling layer, and the softmax layer of the classifier, as shown in Table 1. As shown, change to 3 convolutional layers, and change the input image resolution to 416 × 416, as shown in Table 2. The two routing (Route) layer and the reorganization (Reorg) layer in the overall architecture are implemented by dividing the 26 × 26 × 256 feature map output from the 16th layer into four 13 × 13 at the first routing layer The feature map of size × 512 is combined into a feature map of 13 × 13 × 2048 size when the layers are reorganized. Finally, the output of the second routing layer and the 24th layer are combined into a feature map of 13 × 13 × 3072 size, which is In order to allow a more detailed classification of the whole. Table 2 shows the YOLO architecture used in the first object detection and location process and the second object detection and location process.

表2：YOLO架構操作類型濾波器數量濾波器大小/步長輸出尺寸第1層卷積層 32 3 x 3 416 x 416 第2層最大值池化層 2 x 2 /2 208 x 208 第3層卷積層 64 3 x 3 208 x 208 第4層最大值池化層 2 x 2 /2 104 x 104 第5層卷積層 128 3 x 3 104 x 104 第6層卷積層 64 1 x 1 104 x 104 第7層卷積層 128 3 x 3 104 x 104 第8層最大值池化層 2 x 2/2 52 x 52 第9層卷積層 256 3 x 3 52 x 52 第10層卷積層 128 1 x 1 52 x 52 第11層卷積層 256 3 x 3 52 x 52 第12層最大值池化層 2 x 2/2 26 x 26 第13層卷積層 512 3 x 3 26 x 26 第14層卷積層 256 1 x 1 26 x 26 第15層卷積層 512 3 x 3 26 x 26 第16層卷積層 256 1 x 1 26 x 26 第17層卷積層 512 3 x 3 26 x 26 第18層最大值池化層 2 x 2/2 13 x 13 第19層卷積層 1024 3 x 3 13 x 13 第20層卷積層 512 1 x 1 13 x 13 第21層卷積層 1024 3 x 3 13 x 13 第22層卷積層 512 1 x 1 13 x 13 第23層卷積層 1024 3 x 3 13 x 13 第24層卷積層 1024 3 x 3 13 x 13 第25層路由層 16 第26層重組層 /2 13 x 13 第27層路由層 26 24 第28層卷積層 1024 3 x 3 13 x 13 第29層卷積層 30 1 x 1 13 x 13 Table 2: YOLO architecture Operation type Number of filters Filter size/step size Output size Level 1 Convolutional layer 32 3 x 3 416 x 416 Level 2 Maximum pooling layer 2 x 2 /2 208 x 208 Level 3 Convolutional layer 64 3 x 3 208 x 208 Level 4 Maximum pooling layer 2 x 2 /2 104 x 104 Layer 5 Convolutional layer 128 3 x 3 104 x 104 Level 6 Convolutional layer 64 1 x 1 104 x 104 Layer 7 Convolutional layer 128 3 x 3 104 x 104 Layer 8 Maximum pooling layer 2 x 2/2 52 x 52 Level 9 Convolutional layer 256 3 x 3 52 x 52 Layer 10 Convolutional layer 128 1 x 1 52 x 52 Level 11 Convolutional layer 256 3 x 3 52 x 52 Level 12 Maximum pooling layer 2 x 2/2 26 x 26 13th floor Convolutional layer 512 3 x 3 26 x 26 14th floor Convolutional layer 256 1 x 1 26 x 26 15th floor Convolutional layer 512 3 x 3 26 x 26 16th floor Convolutional layer 256 1 x 1 26 x 26 17th floor Convolutional layer 512 3 x 3 26 x 26 18th floor Maximum pooling layer 2 x 2/2 13 x 13 Level 19 Convolutional layer 1024 3 x 3 13 x 13 20th floor Convolutional layer 512 1 x 1 13 x 13 Level 21 Convolutional layer 1024 3 x 3 13 x 13 Level 22 Convolutional layer 512 1 x 1 13 x 13 Level 23 Convolutional layer 1024 3 x 3 13 x 13 Level 24 Convolutional layer 1024 3 x 3 13 x 13 Level 25 Routing layer 16 Level 26 Reorganization layer /2 13 x 13 27th floor Routing layer 26 24 28th floor Convolutional layer 1024 3 x 3 13 x 13 Level 29 Convolutional layer 30 1 x 1 13 x 13

經過最後一層的卷積後，YOLO輸出了13 × 13 × 30尺寸的張量(tensor)，此張量所代表的是YOLO的預測資訊，張量內含了13 × 13的向量(vector)，每個向量有5個描點框(Anchor Box)，而每個描點框有6個值分別為物件預測框的值(

,

,

,

)、信任分數(confidence )與類別的預測值。其中(

,

)為中心座標的偏移量，(

,

)為預測框的x, y 方向邊界偏移量。YOLO將

與

經過Sigmoid函數後得到σ(

)與σ(

)，其值代表在網格(grid cell)內x, y 方向的偏移量，因為Sigmoid的關係，所以預測框的中心位置會約束於網格內部，以防止偏移過多。After the last layer of convolution, YOLO outputs a tensor with a size of 13 × 13 × 30. This tensor represents the prediction information of YOLO. The tensor contains a 13 × 13 vector. Each vector has 5 Anchor Boxes, and each Anchor Box has 6 values which are the value of the object prediction box (

,

), confidence score ( confidence ) and the predicted value of the category. in(

,

) Is the offset of the center coordinate, (

,

) Is the boundary offset of the prediction box in the x and y directions. YOLO will

and

After the Sigmoid function, σ(

) And σ(

), its value represents the offset in the x and y directions within the grid cell. Because of the Sigmoid relationship, the center position of the prediction box will be constrained inside the grid to prevent excessive offset.

上述提及之卷積層運算指的是，由卷積層內設置之濾波器在影像10上滑動，藉此提取影像10之特徵(例如，圓形、直線及三角形等等)之過程。表1中的濾波器大小(例如，1x1、2x2及3x3)指的是對應輸入之影像解析度(416x416)之解析度大小，步長指的是以對應輸入之影像解析度，每次滑動多少個單位(例如，1或2)以提取影像之特徵，而卷積層運算可藉由訓練達到最佳化。卷積層所需之數量及卷積層內設置之濾波器數量，則需視上述影像10中，欲進行偵測之物件種類及物件對應之特徵之複雜度或數量而決定，雖然直觀上濾波器數量越多可越精準擷取物件之特徵，但程式複雜度及運算量也大幅提高，因此亦需要選擇適當之組成。The above-mentioned convolutional layer operation refers to the process of sliding on the image 10 by the filter set in the convolutional layer, thereby extracting the features of the image 10 (for example, circles, lines, triangles, etc.). The filter size in Table 1 (for example, 1x1, 2x2, and 3x3) refers to the resolution size of the corresponding input image resolution (416x416), and the step size refers to the corresponding input image resolution, how many slides each time Units (for example, 1 or 2) are used to extract the features of the image, and the convolutional layer operation can be optimized by training. The number of convolutional layers required and the number of filters set in the convolutional layer depends on the type of object to be detected and the complexity or number of features corresponding to the object in the image 10 above. Although the number of filters is intuitively determined The more you can capture the features of the object more accurately, but the complexity of the program and the amount of calculation are also greatly increased, so you also need to choose an appropriate composition.

接著，上述提及之最大值池化層運算指的是，將上一層運算(在表1中為卷積層運算)執行完成後之影像，藉由輸出濾波器內具有最大值之數值，再設定所需之步長後，可得到壓縮之縮小影像。Then, the above-mentioned maximum pooling layer operation refers to the image after the previous layer operation (convolutional layer operation in Table 1) is executed, and the value with the maximum value in the output filter is then set After the required step size, a compressed and reduced image can be obtained.

每個卷基層及最大值池化層在輸出資訊時，均會利用一個帶洩漏線性整流激活函數，將一些數值很小的資料乘上一個對應的很小的權重值。When outputting information, each volume base layer and maximum pooling layer will use a leaky linear rectification activation function to multiply some data with a small value by a corresponding small weight value.

由於有些犬隻的特徵在全身區域，有些犬隻的特徵在頭部區域，若從某些部分進行辨識，難免會無法精確辨識。因此，本揭露將利用YOLO物件偵測方法，偵測出全身區域影像與犬隻頭部區域影像後，再利用卷積類神經網路來分別進行特徵萃取，最後將兩者特徵向量進行融合再分類。為了能精確地將影像中的特徵萃取出來，本研究採用InceptionV3網路架構做為特徵萃取的架構。Inception架構為Google開源的卷積網路模型，Inception架構主要是將卷積層以並聯方式組合而成，而大部分的卷積網路模型都是採用串聯的方式組合，這種串聯的方式通常會有三個問題：1.參數太多，容易過擬合(over fitting)；2. 計算複雜大，難以應用；3.容易發生梯度(gradient)消失。所以Google於提出使用多種尺寸的卷積核的GoogLeNet架構(InceptionV1)，用來增加網路深度和寬度，提升深度神經網路的性能，其中包含1 × 1卷積核、3 × 3卷積核、5 × 5卷積核與1 × 1池化層的兩種Inception模組組合而成。其中架構中加入兩個輔助分類器，此分類器只在訓練階段時使用，主要是避免網路架構太深時之梯度消失的問題。Since some of the characteristics of dogs are in the whole body area, and some of the characteristics of dogs are in the head area, it is inevitable that the recognition will be impossible if the recognition is performed from certain parts. Therefore, this disclosure will use the YOLO object detection method to detect the whole body region image and the dog head region image, and then use the convolutional neural network to perform feature extraction respectively, and finally the two feature vectors are merged and then Classification. In order to accurately extract the features in the image, this research uses the InceptionV3 network architecture as the feature extraction architecture. The Inception architecture is Google’s open source convolutional network model. The Inception architecture is mainly composed of convolutional layers in parallel. Most of the convolutional network models are combined in series. This series usually involves There are three problems: 1. There are too many parameters and it is easy to overfit; 2. The calculation is complicated and difficult to apply; 3. The gradient disappears easily. So Google proposed the GoogLeNet architecture (InceptionV1) that uses multiple sizes of convolution kernels to increase the depth and width of the network and improve the performance of deep neural networks. It includes 1 × 1 convolution kernel and 3 × 3 convolution kernel. , 5 × 5 convolution kernel and 1 × 1 pooling layer of two Inception modules combined. Two auxiliary classifiers are added to the architecture. This classifier is only used in the training phase, mainly to avoid the problem of gradient disappearance when the network architecture is too deep.

為了改善內部協變量移位(Internal covariate shift)的問題，通常改善的方式是將學習率(learning rate)調整小一點，但是學習率(learning rate)太小時，會導致收斂速度過慢，訓練時間拉長。因此，提出的將批次正規化(Batch Normalization, BN)的方法加入至GoogLeNet中，此方法是一個正規化的方法，可以加快訓練的速度，並且提高分類的準確率。批次正規化運用在每個網路層，並且對每個輸出資料進行標準化(normalization)處理。為了進一步提高計算的效率，在InceptionV3架構中運用了卷積分解(Factorizing Convolutions)。卷積分解是為了解決尺寸過大的卷積核，所造成計算量過大的問題，以5 × 5卷積來說，其參數量為3 × 3卷積的

倍，而利用卷積分解可以在不影響辨識率下，達到減少參數量的結果，其操作方式有：1. 將Inception模組中的5 × 5卷積核替換成連續2個3 × 3卷積核；2. 將n × n尺寸的卷積分解成n × 1尺寸與1 × n尺寸卷積核；3. 3 × 3尺寸的卷積分解成3 × 1尺寸與1 × 3尺寸卷積核。除了運用卷積分解之外，在利用池化層降低維度時，由於特徵圖的大小急劇縮減，在模型訓練時造成特徵表達的瓶頸，如果將池化層與Inception模組的順序顛倒，雖然可以解決此問題，但還是就無法減少參數量。所以， InceptionV3中將池化層融合至Inception的模組。In order to improve the internal covariate shift (Internal covariate shift) problem, the usual way to improve is to adjust the learning rate (learning rate) to be smaller. However, if the learning rate is too small, the convergence speed will be too slow and the training time will be too slow. Elongated. Therefore, the proposed batch normalization (BN) method is added to GoogLeNet. This method is a normalization method that can speed up training and improve the accuracy of classification. Batch normalization is applied at each network layer, and each output data is normalized. In order to further improve the efficiency of calculation, Factorizing Convolutions is used in the InceptionV3 architecture. The convolution solution is to solve the problem of too large calculation amount caused by the convolution kernel with too large size. In the case of 5 × 5 convolution, the parameter amount is 3 × 3 convolution

However, the use of volume integral solution can reduce the number of parameters without affecting the recognition rate. The operation methods are as follows: 1. Replace the 5 × 5 convolution kernel in the Inception module with two consecutive 3 × 3 volumes Convolution kernel; 2. Solve the n × n size convolution kernel into n × 1 size and 1 × n size convolution kernel; 3. 3 × 3 size convolution into 3 × 1 size and 1 × 3 size convolution nuclear. In addition to using the volume integration solution, when the pooling layer is used to reduce the dimensionality, the size of the feature map is sharply reduced, which causes a bottleneck in feature expression during model training. If the order of the pooling layer and the Inception module is reversed, it is possible Solve this problem, but still unable to reduce the amount of parameters. Therefore, the pooling layer is integrated into the Inception module in InceptionV3.

除了前述的改善之外，還在架構中新增了8×8、17×17、35× 35三個卷積模組，利用卷積分解，將n × n的卷積核分解成兩個n × 1與1 × n卷積核的模組，以便增加網路深度及減少運算，並將輸入的影像解析度從224 × 224改成299 × 299，以下表3為InceptionV3的完整架構。In addition to the aforementioned improvements, three new convolution modules, 8×8, 17×17, and 35×35, have been added to the architecture. The convolution solution is used to decompose the n×n convolution kernel into two n × 1 and 1 × n convolution kernel modules to increase network depth and reduce operations, and change the input image resolution from 224 × 224 to 299 × 299. Table 3 below shows the complete architecture of InceptionV3.

表3： Inception V3架構操作類型濾波器大小/步長輸入尺寸第1層卷積層 3 x 3 /2 299 x 299 x 3 第2層卷積層 3 x 3 /1 149 x 149 x 32 第3層卷積層墊零 3 x 3 /1 147 x 147 x 32 第4層池化層 3 x 3 /2 147 x 147 x 64 第5層卷積層 3 x 3/ 1 73 x 73 x 64 第6層卷積層 3 x 3 /2 71 x 71 x 80 第7層卷積層 3 x 3 /1 35 x 35 x 192 第8層 3 x Inception 35 x 35 x 288 第9層 3 x Inception 17 x 17 x 768 第10層 3 x Inception 8 x 8 x 1280 第11層池化層 8 x 8 8 x 8 x 2048 第12層線性層(linear) 輔助分類 1 x 1 x 2048 第13層分類器(Softmax) 分類 1 x 1 x 1000 Table 3: Inception V3 architecture Operation type Filter size/step size Input size Level 1 Convolutional layer 3 x 3 /2 299 x 299 x 3 Level 2 Convolutional layer 3 x 3 /1 149 x 149 x 32 Level 3 Convolutional layer pad zero 3 x 3 /1 147 x 147 x 32 Level 4 Pooling layer 3 x 3 /2 147 x 147 x 64 Layer 5 Convolutional layer 3 x 3/ 1 73 x 73 x 64 Level 6 Convolutional layer 3 x 3 /2 71 x 71 x 80 Layer 7 Convolutional layer 3 x 3 /1 35 x 35 x 192 Layer 8 3 x Inception 35 x 35 x 288 Level 9 3 x Inception 17 x 17 x 768 Layer 10 3 x Inception 8 x 8 x 1280 Level 11 Pooling layer 8 x 8 8 x 8 x 2048 Level 12 Linear layer (linear) Auxiliary classification 1 x 1 x 2048 13th floor Classifier (Softmax) Classification 1 x 1 x 1000

上述的第一物件種類辨識程序與第二物件種類辨識程序，其辨識的方法包括召回率函數、精確率函數及平均交並比函數。其中召回率函數可表示成：

(5) 其中Gt_i 表示第i張欲辨識種類的影像的真實區域，Pi為第i張測試影像預測的物件區域的結果，N為測試總張數，Gt_i 與P_i 的交集比例除以P_i 即為對應的召回率。精確率函數可表示成：

(6) 與召回率函數不同的是，精確率函數是利用欲辨識影像的像素Gt_i 中有多少像素被正確偵測的像素P_i ，其交集比例除以P_i 即為對應的精確率。平均交並比函數可以表示成：

(7) 平均交並比函數是利用欲辨識種類的影像的真實區域Gt_i 與預測的物件區域P_i 的交集區域，除以欲辨識種類的影像的真實區域Gt_i 與預測的物件區域P_i 的聯集區域即為對應的平均交並比。判定為辨識成功的方式為若在這三種計算方式中的其中一種得到機率值大於0.5以上，即為辨識成功。以這三種方式來計算辨識的正確率會有較高的可信度。In the above-mentioned first object type identification program and the second object type identification program, the identification methods include a recall function, a precision function, and an average cross-to-comparison function. The recall function can be expressed as:

(5) where _i denotes the i Gt sheet to be real areas of the image recognition type, Pi is the i-th sheet test results prediction object image area, N is the total number of test sheets, the intersection of the _i P _i ratio Gt divided P _{i is} the corresponding recall rate. The precision rate function can be expressed as:

(6) Different from the recall rate function, the accuracy rate function uses the _{number of pixels P i} _{that are correctly detected among the pixels Gt i} of the image to be recognized. The intersection ratio divided by P _{i is} the corresponding accuracy rate. The average intersection ratio function can be expressed as:

(7) The average ratio of the cross and the area of intersection is the use of the function of the image area to be true of Gt _i identifies the type of the predicted object region and the P _i, to be divided by the true type of the image area identification Gt _i and the predicted object region P _i The union area of is the corresponding average intersection ratio. The way to determine that the identification is successful is if the probability value obtained in one of the three calculation methods is greater than 0.5, the identification is successful. Using these three methods to calculate the correct rate of recognition will have a higher credibility.

上述提及的第一物件偵測及定位程序與第二物件偵測及定位程序，為了正確萃取出所需的資訊，需包括對應的損失函數來訓練，在此列出對應的損失函數：

(8)In the first object detection and location process and the second object detection and location process mentioned above, in order to correctly extract the required information, it is necessary to include the corresponding loss function for training. The corresponding loss function is listed here:

(8)

上述損失函數中提及之λcoord及λnoobj代表的是對應邊框中心點座標(X,Y)之加權因子(weighting factor)、邊框之寬度與高度(W,H)之加權因子及邊框類別之加權因子，利用邊框中心點座標、邊框寬度及邊框類別等參數，即可得出邊框之信賴分數(Confidence score)。詳細的各項對應定義為

可預測邊框之X座標、

可預測邊框之Y座標、

可預測邊框之寬度、

可預測邊框之高度、

可判斷第(i,j)個區域(例如，將影像10或第一子影像30分割成多個區域)中是否有物件(若有物件，則

之值為1，若無則為0)、

可判斷第i個物件框中是否有物件(若有物件，則

之值為1，若無則為0)。其中，j之數值與參數B相關，參數B則為第一物件偵測及定位程序或第二物件偵測及定位程序中，每個區域之中可偵測之邊框數量，i之數值與參數s相關，參數s對應影像10及第一子影像30之解析度(例如，416x416或299x299)。The λcoord and λnoobj mentioned in the above loss function represent the weighting factor of the center point coordinates (X, Y) of the frame, the weighting factor of the width and height of the frame (W, H), and the weighting factor of the frame category. , Using parameters such as the center point coordinates of the frame, the width of the frame, and the type of the frame, the Confidence score of the frame can be obtained. The detailed correspondence is defined as

The X coordinate of the predictable frame,

The Y coordinate of the frame can be predicted,

Predictable border width,

The height of the frame can be predicted,

It can be judged whether there is an object in the (i, j)th area (for example, the image 10 or the first sub-image 30 is divided into multiple areas) (if there is an object, then

The value is 1, or 0 if there is no),

Can judge whether there is an object in the ith object box (if there is an object, then

The value is 1, or 0 if there is no). Among them, the value of j is related to parameter B, and parameter B is the number of detectable frames in each area in the first object detection and positioning process or the second object detection and positioning process, the value of i and the parameter s is related, and the parameter s corresponds to the resolution of the image 10 and the first sub-image 30 (for example, 416x416 or 299x299).

XGBoost(eXtreme Gradient Boosting)是基於Gradient Boosted Decision Tree(GBDT)演算法的改良延伸而成的分類器，過去被大量應用於解決監督式學習的問題。 Gradient Boosted Decision Tree是一種利用決策樹預測模型集合形成的模型，簡單來說就是利用多個分類模型結果來結合成最後的分類結果，此種模型可以來解決迴歸與分類的問題。XGBoost的目標函數表示如下：XGBoost (eXtreme Gradient Boosting) is a classifier based on an improved extension of the Gradient Boosted Decision Tree (GBDT) algorithm. It has been widely used in the past to solve the problem of supervised learning. Gradient Boosted Decision Tree is a model formed by using a collection of decision tree prediction models. Simply put, it uses multiple classification model results to combine into the final classification result. This model can solve regression and classification problems. The objective function of XGBoost is expressed as follows:

(9)

在XGBoost中代表的是懲罰項，由於XGBoost的目標在不犧牲精確度的情況下提升效能，而設置懲罰項的目的是為了降低模型複雜度。

XGBoost represents the penalty term. Since XGBoost's goal is to improve performance without sacrificing accuracy, the purpose of setting the penalty term is to reduce the complexity of the model.

由於卷積類神經網路訓練時要給予大量與各種影像資料，才能精確地將類神經網路模型進行訓練，而XGBoost在訓練時雖然不需要大量的資料，但必須先以需要人工定義特徵才能訓練。為了考量到在實際應用時，搜集大量犬隻種類影像相當困難，且硬體資源有限的狀況。因此，利用已經使用資料集訓練完成的InceptionV3模型，並分為兩個處理流程：特徵萃取與分類處理流程，特徵萃取處理係在輸入層與分類層之間；分類處理係在分類層。 InceptionV3的特徵萃取處理係將影像萃取出2048維度的特徵向量，最後結合至XGBoost進行分類。Since convolutional neural network training requires a large amount of and various image data to accurately train the neural network model, XGBoost does not require a large amount of data during training, but it must first be manually defined. train. In order to take into account the fact that it is quite difficult to collect a large number of images of dog species in practical applications, and the hardware resources are limited. Therefore, using the InceptionV3 model that has been trained using the data set, it is divided into two processing flows: feature extraction and classification processing. The feature extraction processing is between the input layer and the classification layer; the classification processing is in the classification layer. InceptionV3's feature extraction processing system extracts 2048-dimensional feature vectors from images, and finally combines them with XGBoost for classification.

由於有部分種類的犬隻，特徵集中於頭部，有些犬隻則分佈於身體部分。因此，為了能精確地辨識種類，將頭部與身體部分的特徵資訊進行融合後，再進行種類的辨識。首先將全身區域影像利用YOLO進行頭部偵測，在獲得頭部區域影像後，再利用InceptionV3進行特徵萃取，以獲得2048維度的頭部特徵向量。再將2048維度的全身特徵向量與2048維度的頭部特徵向量進行融合成4096維度的特徵向量後，最後使用XGBoost進行分類，以獲得種類辨識的結果。Because there are some types of dogs, the characteristics are concentrated on the head, and some dogs are distributed on the body part. Therefore, in order to accurately identify the type, the feature information of the head and the body part is fused, and then the type is identified. First, the whole body area image is detected by YOLO. After the head area image is obtained, InceptionV3 is used for feature extraction to obtain the 2048-dimensional head feature vector. Then, the 2048-dimensional whole body feature vector and the 2048-dimensional head feature vector are merged into a 4096-dimensional feature vector, and finally XGBoost is used for classification to obtain the result of category identification.

上述影像辨識可用在犬隻登記與管理，其目的係為了得知犬隻的相關資訊，例如：是否有飼主、施打疫苗的時間地點、是否結紮等資訊，除了透過晶片、鼻紋進行辨識外，影像取得是較為容易達到的，且解析度不需要太高，僅需臉部其特徵明顯，例如其臉部至少必須出兩隻眼睛。The above-mentioned image recognition can be used for dog registration and management. Its purpose is to know the relevant information of the dog, such as whether there is a owner, the time and place of vaccination, whether ligation, etc., in addition to the recognition through the chip and the nose pattern. , The image acquisition is relatively easy to achieve, and the resolution does not need to be too high, only the facial features are obvious, for example, the face must have at least two eyes.

由於拍攝犬隻時不像人類可以固定不動，所以犬隻的照片常常會有歪斜的狀況，在這種情況下，若進行辨識會導致精確度不高。因此，在進行臉部辨識時，先利用YOLO偵測出眼睛的位置，令兩眼中心連成一直線，並計算此直線之斜率m，已知m = tan(θ)，即可以利用反正切函數arctan求得與x軸(水平軸)之角度θ，得知角度θ後再將雙眼旋轉成與水平軸平行。此外，身份辨識是為了獲得犬隻的身份資訊，若是出現資料庫中未登記的影像資料，將無法直接使用分類器，原因是由於分類器遇到沒看過的影像時，仍然會將此影像分類至最為相似的類別。因此，使用 InceptionV3進行特徵萃取，獲得特徵向量後，再使用相似度計算，通過與閾值(threshold)相比，判斷是否相似。其相似度表示如下：Since dogs cannot be fixed when shooting, unlike humans, photos of dogs are often skewed. In this case, the accuracy of identification will be low. Therefore, when performing face recognition, first use YOLO to detect the position of the eyes, make the centers of the two eyes connected in a straight line, and calculate the slope m of this straight line. Knowing that m = tan(θ), you can use the arctangent function Arctan obtains the angle θ with the x-axis (horizontal axis), and then rotates the eyes parallel to the horizontal axis after knowing the angle θ. In addition, identification is to obtain the identity information of the dog. If there is unregistered image data in the database, the classifier cannot be used directly, because the classifier will still classify the image when it encounters an image that has not been seen before. To the most similar category. Therefore, InceptionV3 is used for feature extraction, and after the feature vector is obtained, the similarity calculation is used to determine whether it is similar or not by comparing with the threshold. The similarity is expressed as follows:

(10)

參閱第6圖，其係根據本發明的實施例的應用於動物的卷積神經網路偵測系統5之示意圖。如圖所示，其包含影像擷取裝置100、處理器200、儲存裝置300及影像輸出裝置400。此應用於動物的卷積神經網路偵測系統5可執行上述的應用於動物的卷積神經網路偵測方法(步驟S1~S6)。換句話說，應用於動物的卷積神經網路偵測系統5，具有對應執行步驟S1至S6之各元件。Refer to FIG. 6, which is a schematic diagram of a convolutional neural network detection system 5 applied to animals according to an embodiment of the present invention. As shown in the figure, it includes an image capture device 100, a processor 200, a storage device 300, and an image output device 400. The animal-applied convolutional neural network detection system 5 can perform the above-mentioned animal-applied convolutional neural network detection method (steps S1 to S6). In other words, the convolutional neural network detection system 5 applied to animals has components corresponding to steps S1 to S6.

應用於動物的卷積神經網路偵測系統5之影像擷取裝置100係產生影像10，此即對應至第1圖之步驟S1。The image capturing device 100 of the convolutional neural network detection system 5 applied to animals generates an image 10, which corresponds to step S1 in FIG. 1.

第6圖中之處理器200，可包含物件偵測及定位器201及物件種類辨識器202等子元件。處理器200即可對應進行第1圖之步驟S2至S6。The processor 200 in FIG. 6 may include sub-components such as an object detection and locator 201 and an object type recognizer 202. The processor 200 can correspondingly perform steps S2 to S6 in FIG. 1.

更具體地說，物件偵測及定位器201係接收影像10，進行第一物件偵測及定位程序之後，萃取出第一物件資訊21及對應的第一子影像30。物件偵測及定位器201係接收第一子影像30，進行第二物件偵測及定位程序之後，萃取出第二物件資訊31及對應的第二子影像40。More specifically, the object detection and locator 201 receives the image 10, and after performing the first object detection and positioning process, extracts the first object information 21 and the corresponding first sub-image 30. The object detection and locator 201 receives the first sub-image 30, and after performing the second object detection and positioning process, extracts the second object information 31 and the corresponding second sub-image 40.

更具體地說，物件種類辨識器202係接收第一子影像30，萃取出第一最終物件資訊22。物件種類辨識器202係接收第二子影像40，萃取出複數個第二最終物件資訊41。More specifically, the object type recognizer 202 receives the first sub-image 30 and extracts the first final object information 22. The object type identifier 202 receives the second sub-image 40 and extracts a plurality of second final object information 41.

更具體地說，處理器200可以合併第一最終物件資訊22或第二最終物件資訊41至影像10中，輸出偵測影像50。More specifically, the processor 200 can merge the first final object information 22 or the second final object information 41 into the image 10 to output the detection image 50.

更具體地說，儲存裝置300係連接處理器200，儲存裝置300可以儲存上述影像10及偵測影像50。影像輸出裝置400係連接儲存裝置300，影像輸出裝置400可以輸出偵測影像50。More specifically, the storage device 300 is connected to the processor 200, and the storage device 300 can store the above-mentioned image 10 and the detected image 50. The image output device 400 is connected to the storage device 300, and the image output device 400 can output the detected image 50.

以上所述僅為舉例性，而非為限制性者。任何未脫離本發明之精神與範疇，而對其進行之等效修改或變更，均應包含於後附之申請專利範圍中。The above descriptions are merely illustrative and not restrictive. Any equivalent modifications or alterations that do not depart from the spirit and scope of the present invention should be included in the scope of the appended patent application.

1a:狗 1b:行人 5:卷積神經網路偵測系統 10:影像 21:第一物件資訊 22:第一最終物件資訊 30:第一子影像 31:第二物件資訊 40:第二子影像 41:第二最終物件資訊 50:偵測影像 S1~S6:步驟 100:影像擷取裝置 200:處理器 201:物件偵測及定位器 202:物件種類辨識器 300:儲存裝置 400:顯示裝置1a: Dog 1b: pedestrian 5: Convolutional Neural Network Detection System 10: Image 21: First object information 22: First final object information 30: The first sub-image 31: Second object information 40: Second sub image 41: The second final object information 50: Detect image S1~S6: steps 100: Image capture device 200: processor 201: Object detection and locator 202: Object Type Identifier 300: storage device 400: display device

第1圖係根據本發明實施例之應用於動物的卷積神經網路偵測方法之步驟流程圖。第2圖係根據本發明的實施例的應用於動物的卷積神經網路偵測方法之欲偵測影像示意圖。第3圖係根據本發明的實施例的應用於動物的卷積神經網路偵測方法之進行第一物件偵測及定位程序及第一物件種類辨識程序之後的示圖。第4圖係根據本發明的實施例的應用於動物的卷積神經網路偵測方法之進行第二物件偵測及定位程序之後的示意圖。第5圖係根據本發明的實施例的應用於動物的卷積神經網路偵測方法之進行加權運算後的示意圖。第6圖係根據本發明的實施例的應用於動物的卷積神經網路偵測系統之示意圖。Figure 1 is a flow chart of the steps of the convolutional neural network detection method applied to animals according to an embodiment of the present invention. FIG. 2 is a schematic diagram of an image to be detected in the convolutional neural network detection method applied to animals according to an embodiment of the present invention. FIG. 3 is a diagram after the first object detection and positioning process and the first object type identification process of the convolutional neural network detection method applied to animals according to the embodiment of the present invention. FIG. 4 is a schematic diagram after the second object detection and positioning process of the convolutional neural network detection method applied to animals according to the embodiment of the present invention. FIG. 5 is a schematic diagram of the convolutional neural network detection method applied to animals after weighting operation according to an embodiment of the present invention. Figure 6 is a schematic diagram of a convolutional neural network detection system applied to animals according to an embodiment of the present invention.

S1~S6:步驟S1~S6: steps

Claims

A convolutional neural network detection method applied to animals, which includes the following steps: Generating an image by an image capturing device; Performing a first object detection and positioning process on the image by a processor, extracting a first object information from the image and a first sub-image corresponding to the first object information; The processor performs a first object type recognition process on the first sub-image to extract a plurality of first object type probability values, if the maximum value of the plurality of first object type probability values is greater than or equal to a first predetermined value If the value is set, the first object information and the maximum value of the probability value of the first object type are combined to generate a first final object information; If the maximum value of the probability values of the plurality of first object types is less than the first preset value, a second object detection and positioning process is performed on the first sub-image by the processor, from the first sub-image Extract a second object information and a second sub-image; Performing a second object type identification process on the second sub-image by the processor to extract a plurality of second object type probability values; and The processor performs a weighted calculation on the plurality of first object type probability values and corresponding plurality of first weight values, and the plurality of second object type probability values and corresponding plurality of second weight values to generate Corresponding probability values of a plurality of final object types, and merge the first object information, the second object information, and the maximum value of the probability values of the plurality of final object types to generate a second final object information;

The convolutional neural network detection method applied to animals as described in item 1 of the scope of patent application, wherein the first object detection and positioning procedure and the second object detection and positioning procedure are applied to the detection of dogs and Positioning, and the first object type identification procedure and the second object type identification procedure are applied to dog type identification.

The convolutional neural network detection method applied to animals as described in item 1 of the scope of patent application further includes the following steps: By the processor, merge the first final object information or the second final object information into the image to generate a detection image; Storing the above-mentioned image and the detected image by a storage device; and An image output device is used to output the detected image.

The convolutional neural network detection method applied to animals as described in item 1 of the scope of patent application, wherein the first object detection and location procedure and the second object detection and location procedure include five-layer maximum pooling Layer operation, 22-layer convolutional layer operation and corresponding one of the linear rectification activation function operation with leakage.

The convolutional neural network detection method applied to animals as described in item 1 of the scope of patent application, wherein the first object type identification process and the second object type identification process include a three-layer convolution kernel operation and a six-layer convolution Multi-layer operation and two-layer maximum pooling layer operation.

The convolutional neural network detection method applied to animals as described in item 3 of the scope of patent application, wherein the first object type identification process and the second object type identification process include a class identification method, and the type identification method includes A recall function, a precision function, and an average intersection and ratio function.

In the convolutional neural network detection method applied to animals as described in claim 1, wherein the first object detection and location process and the second object detection and location process include a loss function.

A convolutional neural network detection system applied to animals, which includes: An image capture device that generates an image; A processor is connected to the image capturing device and includes: An object detection and locator receives the image generated by the image capturing device, outputs a first object information and a first sub-image, inputs the first sub-image, and outputs a second object information; An object type identifier that receives the first object information and the second object information, and outputs a first final object information and a second final object information; Wherein, the processor merges the first final object information or the second final object information into the image to output a detection image; A storage device connected to the processor to store the image and the detected image; and An image output device is connected to the storage device to output the detected image.