TWI780409B

TWI780409B - Method and system for training object detection model

Info

Publication number: TWI780409B
Application number: TW109105277A
Authority: TW
Inventors: 陳逸夫; 柳恆崧; 王才沛
Original assignee: 中華電信股份有限公司
Priority date: 2020-02-19
Filing date: 2020-02-19
Publication date: 2022-10-11
Also published as: TW202133037A

Abstract

The invention provides a method and a system for training an object detection model. The method includes: obtaining a target object image and a labelled result, wherein the target object image includes a target object, and the labelled result includes a tag corresponding to the target object; obtaining a specific image captured by a fixed camera for a specific scene; obtaining a background image corresponding to a specific image based on the specific image; synthesizing the target object and the background image into a training image; training a target object detection model exclusive for the fixed camera based on the training image and the labelled result corresponding to the target object.

Description

Method and system for training object detection model

本發明是有關於一種模型訓練技術，且特別是有關於一種訓練物件偵測模型的方法及系統。The present invention relates to a model training technique, and in particular to a method and system for training an object detection model.

隨著攝影機、網路技術、與人工智慧的快速發展，智慧型視訊監控技術的應用有大幅度的成長。在監控系統的應用中，特定類型物件（如人物、車輛等）的偵測是一個核心功能。現有的偵測影像中特定類型物件的方法，包括使用深度類神經網路的技術，都是透過大量已標記資料的訓練，來習得目標類型物件的特徵。With the rapid development of cameras, network technology, and artificial intelligence, the application of intelligent video surveillance technology has grown significantly. In the application of monitoring system, the detection of specific types of objects (such as people, vehicles, etc.) is a core function. Existing methods for detecting specific types of objects in images, including the use of deep neural network-like techniques, learn the characteristics of target types of objects through training with a large amount of labeled data.

一般而言，標記資料僅包含正向樣本（欲偵測的物件類型），而負向樣本則是從訓練影像的背景中自動取得。然而，負向樣本的多樣性受限於訓練資料，因此對於多變的影像，常有誤判背景為目標類型物件的情形，進而導致誤偵測。In general, labeled data contains only positive samples (types of objects to be detected), while negative samples are automatically obtained from the background of the training images. However, the diversity of negative samples is limited by the training data. Therefore, for variable images, it is often misjudged that the background is a target type object, which leads to false detection.

此外，為了能夠判斷極為多樣化的負向樣本，辨識效果良好的模型傾向於複雜龐大，在使用上需要較多的運算資源，這在實際使用上所造成的限制就是攝影機裝置端無法負擔偵測模型的運算量，因此必須透過網路將影像資料傳輸至伺服器再進行偵測，造成網路頻寬與伺服器運算量的負擔。In addition, in order to be able to judge extremely diverse negative samples, models with good recognition effects tend to be complex and large, requiring more computing resources in use. The limitation caused by this in actual use is that the camera device cannot afford detection Due to the computational complexity of the model, the image data must be transmitted to the server through the network for detection, resulting in a burden on the network bandwidth and the computational complexity of the server.

大致上而言，現有技術在使用上的困難部分源自於對於物件偵測模型之泛用性的要求，也就是同一個模型可以用在多樣的影像與場景。對於視訊監控的應用，各攝影機各有所負責的場景，不同場景所容易產生的誤偵測是不同的，但對只負責單一場景（通常只有單一視角）的攝影機，其影像中所產生的誤偵測卻是相當重覆的。Generally speaking, the difficulties in the use of existing technologies partly stem from the requirement for the versatility of the object detection model, that is, the same model can be used in various images and scenes. For the application of video surveillance, each camera is responsible for the scene, and the false detections that are likely to occur in different scenes are different. Detection is quite repetitive.

因此，若是各攝影機可以有各自的偵測模型，且該偵測模型已經過針對該場景的訓練以避免該場景的誤偵測，則能夠使用較簡單的模型達到所需的偵測效能，並且提升將物件偵測的功能置於攝影機端而非伺服器端（也就是邊端運算的概念）的可行性，降低對伺服器運算能力與網路頻寬的需求。Therefore, if each camera can have its own detection model, and the detection model has been trained for the scene to avoid false detection of the scene, then the simpler model can be used to achieve the required detection performance, and Improve the feasibility of placing the object detection function on the camera side instead of the server side (that is, the concept of edge computing), and reduce the demand for server computing power and network bandwidth.

有鑑於此，本發明提供一種訓練物件偵測模型的方法及系統，其可用於解決上述技術問題。In view of this, the present invention provides a method and system for training an object detection model, which can be used to solve the above technical problems.

本發明提供一種訓練物件偵測模型的方法，包括：取得一目標物件影像及一標記結果，其中目標物件影像包括至少一目標物件，且標記結果包括對應於各目標物件的一標記；取得一第一固定式攝影機對一第一特定場景拍攝的至少一第一特定影像；依據至少一第一特定影像取得對應於至少一第一特定影像的一第一背景影像；將至少一目標物件與第一背景影像合成為一第一訓練影像；基於第一訓練影像及對應於各目標物件的標記結果訓練專屬於第一固定式攝影機的一第一目標物件偵測模型。The present invention provides a method for training an object detection model, comprising: obtaining a target object image and a marking result, wherein the target object image includes at least one target object, and the marking result includes a mark corresponding to each target object; obtaining a first object A fixed camera shoots at least one first specific image of a first specific scene; obtains a first background image corresponding to the at least one first specific image according to the at least one first specific image; combines at least one target object with the first The background image is synthesized into a first training image; a first object detection model dedicated to the first fixed camera is trained based on the first training image and the marking results corresponding to each object.

本發明提供一種訓練物件偵測模型的系統，其包括一資料子系統及一訓練子系統。資料子系統經配置以：取得一目標物件影像，其中目標物件影像包括至少一目標物件及對應於各目標物件的一標記結果；取得一第一固定式攝影機對一第一特定場景拍攝的至少一第一特定影像；依據至少一第一特定影像取得對應於至少一第一特定影像的一第一背景影像；將至少一目標物件與第一背景影像合成為一第一訓練影像。訓練子系統基於第一訓練影像及對應於各目標物件的標記結果訓練專屬於第一固定式攝影機的一第一目標物件偵測模型。The invention provides a system for training an object detection model, which includes a data subsystem and a training subsystem. The data subsystem is configured to: obtain an object image, wherein the object image includes at least one object and a labeling result corresponding to each object; obtain at least one image of a first specific scene captured by a first stationary camera. The first specific image; obtaining a first background image corresponding to the at least one first specific image according to the at least one first specific image; synthesizing at least one target object and the first background image into a first training image. The training subsystem trains a first object detection model dedicated to the first fixed camera based on the first training image and the marking results corresponding to each object.

請參照圖1，其是依據本發明之一實施例繪示的訓練物件偵測模型的系統示意圖。如圖1所示，系統10包括偵測子系統200、訓練子系統300及資料子系統400。在不同的實施例中，偵測子系統、訓練子系統300及資料子系統400可採用獨立的設備/裝置（例如個人電路、伺服器、工作站等）實現，或是整合地實現為單一個設備/裝置，但本發明可不限於此。Please refer to FIG. 1 , which is a schematic diagram of a system for training an object detection model according to an embodiment of the present invention. As shown in FIG. 1 , the system 10 includes a detection subsystem 200 , a training subsystem 300 and a data subsystem 400 . In different embodiments, the detection subsystem, the training subsystem 300, and the data subsystem 400 can be realized by independent devices/devices (such as personal circuits, servers, workstations, etc.), or integrated into a single device / device, but the present invention is not limited thereto.

在本發明的實施例中，偵測子系統200、訓練子系統300及資料子系統400可協同運作以實現本發明提出的訓練物件偵測模型的方法。概略而言，本發明提出的方法可基於通用的正向資料以及專屬於某固定式攝影機所提供的負向資料來產生新的訓練資料，並使用此訓練資料訓練專屬於上述固定式攝影機的目標物件偵測模型，相關細節將在之後詳述。In the embodiment of the present invention, the detection subsystem 200 , the training subsystem 300 and the data subsystem 400 can cooperate to implement the method for training the object detection model proposed by the present invention. In a nutshell, the method proposed by the present invention can generate new training data based on general positive data and negative data specific to a certain fixed camera, and use this training data to train the target specific to the fixed camera The object detection model, the relevant details will be detailed later.

為便於說明，以下將以圖1中的第一固定式攝影機100為例進行說明。在本發明的實施例中，第一固定式攝影機100例如是固定地設置於一特定地點，並經配置以基於一固定取像範圍對一第一特定場景199進行拍攝的攝影機。For the convenience of description, the first fixed camera 100 in FIG. 1 will be taken as an example for description below. In the embodiment of the present invention, the first fixed camera 100 is, for example, fixedly installed at a specific location and configured to shoot a first specific scene 199 based on a fixed imaging range.

請參照圖2，其是依據本發明之一實施例繪示的訓練物件偵測模型的方法流程圖。本實施例的方法可由圖1的系統10執行，以下即搭配圖1所示的元件說明圖2各步驟的細節。Please refer to FIG. 2 , which is a flowchart of a method for training an object detection model according to an embodiment of the present invention. The method of this embodiment can be executed by the system 10 in FIG. 1 , and the details of each step in FIG. 2 will be described below with the components shown in FIG. 1 .

首先，在步驟S210中，資料子系統400可取得目標物件影像及標記結果。為便於理解，以下另輔以圖3作說明。Firstly, in step S210, the data subsystem 400 can obtain the image of the target object and the marking result. For ease of understanding, Figure 3 is supplemented below for illustration.

請參照圖3，其是依據本發明的一實施例繪示的目標物件影像及標記結果的示意圖。在本實施例中，目標物件影像310例如可僅包括目標物件310a~310e（例如人物）而不包括其他的影像成分（例如背景）。此外，對應於目標物件影像310的標記結果320可包括對應於目標物件310a~310e的標記320a~320e。Please refer to FIG. 3 , which is a schematic diagram of a target object image drawn and a marking result according to an embodiment of the present invention. In this embodiment, the target object image 310 may, for example, only include the target objects 310 a - 310 e (such as people) without including other image components (such as background). In addition, the tagged result 320 corresponding to the target object image 310 may include tags 320a-320e corresponding to the target objects 310a-310e.

在本發明的實施例中，目標物件影像310中的目標物件310a~310e可理解為先前提及的正向資料，但本發明可不限於此。在其他實施例中，目標物件影像310（及標記結果320）除了可搭配對應於固定式攝影機100的負向資料訓練專屬於固定式攝影機100的第一目標物件偵測模型M1之外，還可用於訓練專屬於其他固定式攝影機的目標物件偵測模型。換言之，目標物件影像310（及標記結果320）可廣泛地用於訓練多個固定式攝影機的目標物件偵測模型，其相關細節將在之後另述。In the embodiment of the present invention, the target objects 310 a - 310 e in the target object image 310 can be understood as the aforementioned positive data, but the present invention is not limited thereto. In other embodiments, the target object image 310 (and the labeling result 320 ) can be used in addition to training the first target object detection model M1 dedicated to the fixed camera 100 with the negative data corresponding to the fixed camera 100 , and can also be used for training object detection models specific to other stationary cameras. In other words, the target object image 310 (and the labeling result 320 ) can be widely used to train a target object detection model of multiple fixed cameras, and the relevant details will be described later.

之後，在步驟S220中，資料子系統400可取得第一固定式攝影機100對第一特定場景199拍攝的第一特定影像。之後，在步驟S230中，資料子系統400可依據第一特定影像取得對應於第一特定影像的第一背景影像。為便於理解，以下另輔以圖4作說明。After that, in step S220 , the data subsystem 400 can obtain the first specific image captured by the first fixed camera 100 on the first specific scene 199 . Afterwards, in step S230 , the data subsystem 400 can obtain a first background image corresponding to the first specific image according to the first specific image. For ease of understanding, Figure 4 is supplemented below for illustration.

請參照圖4，其是依據本發明之一實施例繪示的第一特定影像及其對應的第一背景影像的示意圖。在本實施例中，第一特定影像410例如是第一固定式攝影機100對第一特定場景199所拍攝的影像。在不同的實施例中，第一固定式攝影機100可定期或不定期地將所拍攝的影像作為第一特定影像410傳送至資料子系統400，以降低相關的頻寬需求，但可不限於此。Please refer to FIG. 4 , which is a schematic diagram of a first specific image and its corresponding first background image according to an embodiment of the present invention. In this embodiment, the first specific image 410 is, for example, an image captured by the first fixed camera 100 on the first specific scene 199 . In different embodiments, the first fixed camera 100 may regularly or irregularly send the captured image as the first specific image 410 to the data subsystem 400 to reduce the related bandwidth requirement, but is not limited thereto.

在取得對應於同樣場景（即，第一特定場景199）的多個第一特定影像410之後，資料子系統400例如可透過影像平均、背景分離、或具有類似效果的影像處理技術來取得對應於前述一或多個第一特定影像410的第一背景影像420。如圖4所示，所取得的第一背景影像420中僅包括第一特定場景199中的背景物件（例如物件420a~420c）而未包括前景物件及/或目標物件（例如人物），但可不限於此。在本發明的實施例中，第一背景影像420可理解為先前提及的負向資料，但本發明可不限於此。After obtaining a plurality of first specific images 410 corresponding to the same scene (ie, the first specific scene 199 ), the data subsystem 400 can obtain images corresponding to The first background image 420 of the aforementioned one or more first specific images 410 . As shown in FIG. 4, the obtained first background image 420 only includes background objects (such as objects 420a~420c) in the first specific scene 199 and does not include foreground objects and/or target objects (such as people), but may not limited to this. In the embodiment of the present invention, the first background image 420 can be understood as the aforementioned negative data, but the present invention is not limited thereto.

之後，在步驟S240中，資料子系統400可將目標物件310a~310e與第一背景影像420合成為第一訓練影像，而此第一訓練影像可理解為同時包括正向資料及負向資料。接著，在步驟S250中，訓練子系統300可基於第一訓練影像及對應於各目標物件310a~310e的標記結果320訓練專屬於第一固定式攝影機100的第一目標物件偵測模型M1。為便於理解，以下另輔以圖5作說明。Afterwards, in step S240 , the data subsystem 400 can synthesize the target objects 310 a - 310 e and the first background image 420 into a first training image, and the first training image can be understood as including both positive data and negative data. Next, in step S250 , the training subsystem 300 can train the first target object detection model M1 dedicated to the first fixed camera 100 based on the first training image and the labeling results 320 corresponding to the target objects 310 a - 310 e . For ease of understanding, FIG. 5 is supplemented below for illustration.

請參照圖5，其是依據3及圖4繪示的第一訓練影像及對應的標記結果示意圖。在本實施例中，第一訓練影像510例如是由資料子系統400將目標物件310a~310e插入第一背景影像420而得，但可不限於此。在一些實施例中，在資料子系統400將目標物件310a~310e與第一背景影像420合成時，可同時對目標物件310a~310e進行相關的資料強化處理，例如縮放、旋轉、平移、顏色調整、部分裁切等。Please refer to FIG. 5 , which is a schematic diagram of the first training image and corresponding labeling results shown in FIG. 3 and FIG. 4 . In this embodiment, the first training image 510 is obtained, for example, by inserting the target objects 310 a - 310 e into the first background image 420 by the data subsystem 400 , but it is not limited thereto. In some embodiments, when the data subsystem 400 combines the target objects 310a-310e with the first background image 420, relevant data enhancement processing, such as scaling, rotation, translation, and color adjustment, can be performed on the target objects 310a-310e at the same time. , partial cropping, etc.

如此一來，資料子系統400即可產生專屬於第一固定式攝影機100的訓練資料，並可由訓練子系統300輔以對應的標記結果320訓練專屬於第一固定式攝影機100的第一目標物件偵測模型M1。In this way, the data subsystem 400 can generate training data specific to the first fixed camera 100 , and the training subsystem 300 can train the first target object specific to the first fixed camera 100 with the help of the corresponding labeling results 320 Detection model M1.

進一步而言，有別於習知具較高泛用性的物件偵測模型，由本發明的方法訓練而得的第一目標物件偵測模型M1係專用於偵測出現於第一特定場景199中的目標物件（例如人物），因而可有效地改善相關的偵測效能，並降低誤偵測的機率。Furthermore, different from the conventional object detection model with higher versatility, the first target object detection model M1 trained by the method of the present invention is specially used to detect objects appearing in the first specific scene 199 The target object (such as a person) can effectively improve the relevant detection performance and reduce the probability of false detection.

在一實施例中，為降低頻寬的需求，偵測子系統200亦可配置於第一固定式攝影機100中，但可不限於此。在訓練子系統300完成上述第一目標物件偵測模型M1的訓練之後，訓練子系統300可將第一目標物件偵測模型M1傳輸至偵測子系統200。之後，偵測子系統200可取得第一固定式攝影機100對第一特定場景199拍攝的第一影像IM1，並以第一目標物件偵測模型M1偵測出現於第一影像IM1中的目標物件（例如人物）。在一些實施例中，偵測子系統200可在偵測第一影像IM1中的目標物件之後產生相應的偵測結果，其可包括例如目標物件的類別、大小、位置、輪廓、信心值等資訊，但可不限於此。In one embodiment, in order to reduce the bandwidth requirement, the detection subsystem 200 can also be configured in the first fixed camera 100 , but it is not limited thereto. After the training subsystem 300 completes the training of the first object detection model M1 , the training subsystem 300 can transmit the first object detection model M1 to the detection subsystem 200 . Afterwards, the detection subsystem 200 can obtain the first image IM1 shot by the first fixed camera 100 on the first specific scene 199, and use the first target object detection model M1 to detect the target object appearing in the first image IM1 (such as a character). In some embodiments, the detection subsystem 200 can generate a corresponding detection result after detecting the target object in the first image IM1, which can include information such as the type, size, position, outline, and confidence value of the target object. , but not limited to this.

在其他實施例中，第一影像IM1還可進一步作為先前提及的第一特定影像410使用，藉以進一步優化所取得的第一背景影像420。之後，資料子系統400可將優化後的第一背景影像420再與其他目標物件影像中的目標物件合成為其他的訓練影像，以供訓練子系統300進一步訓練第一目標物件偵測模型M1，以得到具更佳偵測效能的第一目標物件偵測模型M1，但本發明可不限於此。In other embodiments, the first image IM1 can be further used as the aforementioned first specific image 410 to further optimize the obtained first background image 420 . Afterwards, the data subsystem 400 can synthesize the optimized first background image 420 with target objects in other target object images to form other training images for the training subsystem 300 to further train the first target object detection model M1, In order to obtain the first object detection model M1 with better detection performance, but the present invention is not limited thereto.

此外，如先前提及的，本發明的系統10還可基於目標物件影像310訓練專屬於其他固定式攝影機的目標物件偵測模型。舉例而言，資料子系統400可經配置以：取得第二固定式攝影機對第二特定場景拍攝的第二特定影像；依據第二特定影像取得對應於第二特定影像的第二背景影像；將目標物件與第二背景影像合成為第二訓練影像。之後，訓練子系統300可基於第二訓練影像及對應於各目標物件的標記結果訓練專屬於第二固定式攝影機的第二目標物件偵測模型。以上技術手段的相關細節可參考先前實施例中的說明，於此不另贅述。In addition, as mentioned earlier, the system 10 of the present invention can also train a target object detection model specific to other fixed cameras based on the target object image 310 . For example, the data subsystem 400 may be configured to: obtain a second specific image shot by a second fixed camera on a second specific scene; obtain a second background image corresponding to the second specific image according to the second specific image; The target object and the second background image are synthesized into a second training image. Afterwards, the training subsystem 300 can train a second object detection model dedicated to the second fixed camera based on the second training image and the labeling results corresponding to each object. Relevant details of the above technical means can refer to the descriptions in the previous embodiments, and will not be repeated here.

綜上所述，本發明提出的系統及方法可在不需透過人工為不同場景進行資料標記的情形下，為不同場景建立對應的偵測模型，具易用性且可有效地改善相關的偵測效能，並降低誤偵測的機率。In summary, the system and method proposed by the present invention can establish corresponding detection models for different scenes without manually marking data for different scenes, which is easy to use and can effectively improve related detection models. Test performance, and reduce the probability of false detection.

雖然本發明已以實施例揭露如上，然其並非用以限定本發明，任何所屬技術領域中具有通常知識者，在不脫離本發明的精神和範圍內，當可作些許的更動與潤飾，故本發明的保護範圍當視後附的申請專利範圍所界定者為準。Although the present invention has been disclosed above with the embodiments, it is not intended to limit the present invention. Anyone with ordinary knowledge in the technical field may make some changes and modifications without departing from the spirit and scope of the present invention. The scope of protection of the present invention should be defined by the scope of the appended patent application.

100:第一固定式攝影機 199:第一特定場景 200:偵測子系統 300:訓練子系統 310:目標物件影像 310a~310e:目標物件 320:標記結果 320a~320e:標記 400:資料子系統 410:第一特定影像 420:第一背景影像 420a~420c:物件 510:第一訓練影像 IM1:第一影像 M1:第一目標物件偵測模型 S210~S250:步驟100: The first fixed camera 199: The first specific scene 200: Detection Subsystem 300: training subsystem 310: target object image 310a~310e: target object 320: mark the result 320a~320e: mark 400: Data Subsystem 410: The first specific image 420: The first background image 420a~420c: object 510: First training image IM1: first image M1: The first object detection model S210~S250: steps

圖1是依據本發明之一實施例繪示的訓練物件偵測模型的系統示意圖。圖2是依據本發明之一實施例繪示的訓練物件偵測模型的方法流程圖。圖3是依據本發明的一實施例繪示的目標物件影像及標記結果的示意圖。圖4是依據本發明之一實施例繪示的第一特定影像及其對應的第一背景影像的示意圖。圖5是依據3及圖4繪示的第一訓練影像及對應的標記結果示意圖。FIG. 1 is a schematic diagram of a system for training an object detection model according to an embodiment of the present invention. FIG. 2 is a flowchart of a method for training an object detection model according to an embodiment of the present invention. FIG. 3 is a schematic diagram of a target object image drawn and a marking result according to an embodiment of the present invention. FIG. 4 is a schematic diagram of a first specific image and its corresponding first background image according to an embodiment of the present invention. FIG. 5 is a schematic diagram of the first training image and corresponding labeling results shown in FIG. 3 and FIG. 4 .

S210~S250:步驟S210~S250: steps

Claims

A method for training an object detection model, comprising: obtaining a target object image and a marking result, wherein the target object image includes at least one target object, and the marking result includes a mark corresponding to each of the target objects; obtaining a first A fixed camera to capture at least one first specific image of a first specific scene based on a fixed imaging range; obtain a first background image corresponding to the at least one first specific image according to the at least one first specific image ; synthesize the at least one target object and the first background image into a first training image; train a first fixed camera dedicated to the first fixed camera based on the first training image and the marking results corresponding to each of the target objects A target object detection model, and based on the target object image training a second target object detection model specific to a second fixed camera, the steps include: obtaining a second specific scene shot by the second fixed camera at least one second specific image; obtain a second background image corresponding to the at least one second specific image according to the at least one second specific image; and synthesize the at least one target object and the second background image into a first Two training images.

The method as claimed in claim 1, wherein the target object image only includes the at least one target object.

The method as described in claim 1, further comprising: obtaining a first image of the first specific scene shot by the first fixed camera; detecting objects appearing in the first image using the first target object detection model The at least one target object of .

The method as described in claim 1, wherein the step of training the second target object detection model specific to the second fixed camera based on the target object image further includes: based on the second training image and corresponding to each of the targets The labeled results of objects train the second object detection model specific to the second stationary camera.

A system for training an object detection model, comprising: a data subsystem configured to: obtain an image of a target object and a labeling result, wherein the image of the target object includes at least one target object, and the labeling result includes information corresponding to each A mark of the target object; obtain at least one first specific image taken by a first fixed camera to a first specific scene based on a fixed imaging range; obtain corresponding to the at least one specific image according to the at least one first specific image A first background image of the first specific image; combining the at least one target object and the first background image into a first training training image; a training subsystem, which is based on the first training image and the labeling results corresponding to each of the target objects to train a first target object detection model specific to the first fixed camera, and based on the target object Image training for a second target object detection model dedicated to a second fixed camera, the steps of which include: obtaining at least one second specific image shot by the second fixed camera on a second specific scene; according to the at least one The second specific image obtains a second background image corresponding to the at least one second specific image; and synthesizes the at least one target object and the second background image into a second training image.

The system as claimed in claim 5, wherein the target object image only includes the at least one target object.

The system as described in claim 5 further includes a detection subsystem configured to: acquire a first image of the first specific scene shot by the first fixed camera; detect the first target object The model detects the at least one target object appearing in the first image.

The system of claim 5, wherein the training subsystem is further configured to train the second target object specific to the second stationary camera based on the second training image and the labeling results corresponding to each of the target objects detection model.