TWI759534B

TWI759534B - Activity recognition system based on machine vision for multiple objects

Info

Publication number: TWI759534B
Application number: TW107130040A
Authority: TW
Inventors: 趙尚威
Original assignee: 趙尚威
Priority date: 2018-08-28
Filing date: 2018-08-28
Publication date: 2022-04-01
Also published as: TW202009872A

Abstract

This patent proposes an activity recognition system which can detect not only action of a single object but activity in the scene by machine vision. Vision information is input by vision sensor(s) and passes to object recognition unit to produce a data structure of classification of objects and spatial relationship amount them. Pose estimation unit detects posture of each object in vision information. All semantic information is collected and integrated in fusion analysis unit and then activity recognition is finished.

Description

Visual activity recognition system based on multiple objects

本發明係關於活動辨識之系統。特別是一種基於視覺進行活動識別之系統。The present invention relates to a system for activity recognition. In particular, a system based on vision for activity recognition.

基於視覺進行活動識別之系統係利用視覺資訊，對視覺資訊中可能活動進行辨別。活動並非單一物體於某一圖框中之動作(Action)，而指由一個或一個以上之物體，其所表現個別、或彼此之間之動作關係所構成之活動(Activity)；如單一物體，以人為例，其動作辨認為跑步；但若進行多物體偵測，如偵測發現一群人皆跑步，則其活動可辨識為馬拉松比賽；若多物體偵測其背景物體在監獄圍牆外，則活動可辨識為逃獄；若偵測人員手中持槍，其後有另一物體為警員在跑步追趕，則活動可辨識為追捕逃犯。而先前技術多為對特定追蹤物體作單一動作之辨識，而非偵測場景與各物體間關係之活動偵測，若僅是先前技術之動作識別，其識別結果於產業利用效益上有限。此外，先前技術專注於追蹤視覺資料中某範圍中某一特定物體，亦造成限制無法達成活動層級之判別能力。另，先前技術多基於以時間為序之動態影像，以時間為序切割為影像序列後，比對前後序列影像，鎖定追蹤物體之相關特徵相對關係後進行動作辨識，此追蹤物體之計算，即在基於時序之多個影像序列之間特徵追蹤比對，往往在動作辨識運算本身以外又額外耗費大量計算，在有限運算資源中難具有追蹤大量物體之能力，造成效能瓶頸與高資源門檻。Vision-based activity recognition systems use visual information to identify possible activities in visual information. Activity is not the action of a single object in a certain frame, but refers to the activity (Activity) composed of one or more objects, the individual or the action relationship between them; such as a single object, Taking a person as an example, its action is identified as running; however, if multi-object detection is performed, if a group of people is detected to be running, their activity can be identified as a marathon; if the background object of multi-object detection is outside the prison wall, the The activity can be identified as a prison escape; if the detective holds a gun and another object is followed by a police officer running and chasing, the activity can be identified as chasing a fugitive. The prior art mostly recognizes a single action of a specific tracking object, rather than activity detection that detects the relationship between the scene and each object. If it is only the action recognition of the prior art, the recognition results are limited in terms of industrial utilization benefits. In addition, the prior art's focus on tracking a specific object within a range of visual data also limits the ability to achieve activity-level discrimination. In addition, the prior art is mostly based on time-sequenced dynamic images. After the time-sequenced dynamic images are cut into image sequences, the before and after images are compared, and the relative relationship between the relevant features of the tracked object is locked to perform motion recognition. The calculation of the tracked object is: Feature tracking and comparison between multiple image sequences based on time series often consumes a lot of computation in addition to the motion recognition operation itself, and it is difficult to have the ability to track a large number of objects with limited computing resources, resulting in performance bottlenecks and high resource thresholds.

本發明即是提出一種活動識別系統，但同時亦可應用於動作辨識，可根據影像資訊產生對場景之各物體與彼此關係之全盤語意型描述，使用此全盤語意描述可產生不只是動作，更是活動層次的識別結果，對於產業在機器視覺的價值與運用層面有大幅提升。且其融合分析過程為靜態影像辨識結果集合之統計，一則不必然需要使用動態影像，二則不需額外在時序影像序列之間進行物體特徵追蹤，對於識別之效率能有大幅度提升，同時能具有辨識影像中大量物體之效能，達成更有效率、更省成本、更高功能的整體果效。本發明對於未來機器智能能具有識別人類較複雜活動、並採取相應反饋與行動，實是關鍵之發展，具有高度產業價值。The present invention proposes an activity recognition system, but it can also be applied to motion recognition, which can generate a comprehensive semantic description of the relationship between each object in the scene and each other according to the image information. Using this comprehensive semantic description can generate not only actions, but also It is the recognition result of the activity level, which has greatly improved the value and application of machine vision in the industry. And its fusion analysis process is the statistics of the static image recognition result set, one does not necessarily need to use dynamic images, and the other does not require additional object feature tracking between time-series image sequences, which can greatly improve the efficiency of recognition, and at the same time can It has the performance of recognizing a large number of objects in the image, and achieves the overall effect of more efficiency, lower cost and higher function. The present invention is a key development for future machine intelligence to recognize complex human activities and take corresponding feedback and actions, and has high industrial value.

視覺感測設備(如相機)接收影像資訊(可為單一靜態影像)之後，將影像資訊送入物體識別單元，物體識別單元運用相關演算法(如機器學習、或基於規則性之運算)，判讀出畫面中之各物件種類、並各物件在畫面中之位置，經處理後，可產生各物體彼此之對應關係(如前後左右上下、並相對距離比例等資訊)，如此便可獲得此影像資訊的場景訊息。另一方面，姿態識別單元可接收視覺感測設備之原始影像資訊，及/或物體識別單元判讀後之部分物體影像資訊，用以辨識各個物體之姿態資訊，如以人體而言，可經由相關演算法(如機器學習，或基於規則性之運算)，產生該人體之姿態判讀，如站躺坐臥；以車輛而言，可為車門打開，車窗搖下等等姿態資訊；姿態識別單元經運算後產生視覺資訊中各物體之姿態狀態。經物體識別單元產生之物體種類資訊、及/或空間相對關係資訊等，加上姿態識別單元產生之各物體狀態資訊，融合分析單元可將該資料進行整合，各物體的情況與彼此之間之相對關係將會明確，此為場景資訊，為語意型資料，而後融合分析單元對於場景資訊進行辨識演算法的運算(如機器學習，或基於規則性之運算)，產生一個活動判讀之結果，此結果便可以進行有價值之各類運用。此系統因此具有辨別整個場景的能力，而非單一物體的單一動作，得以產生活動層次的判讀結果。而產生活動判讀結果之後，系統亦可具有事件判斷之單元，設定相關活動及/或某動作發生時，產生特定的事件通報，事件通報傳輸至相關相應動作的產生設備或模組，以產生相應動作的處理，如發出警報、傳出警告短訊信息、啟動相關致動器或設備等。此外，此系統並不必然需要一連串之動態影像，使用單一靜態影像資訊即可進行判別，但若具有動態影像，亦可採用於增加活動判讀精準度，但不需進行追蹤比對運算，融合分析單元把基於時序為基礎之影像序列所產生出來之物體種類資訊、及/或空間相對關係資訊等，加上各物體狀態資訊，即場景資訊，依各影像序列一次或分次放入相關演算法中判讀，將各影像序列之活動判讀結果進行統計，無須針對各影像序列作物體特徵位置追蹤，若各影像序列判讀結果經簡單統計後可輸出最合適之活動判讀結果，舉例而言，若動態影像切割為若干個靜態影像，而若干個靜態影像產生之場景資訊經由融合分析單元判讀後，將產生相同數量或不同數量之活動判讀結果，將此多個活動判讀結果進行各式演算法統計(例如可用簡單大數法則，多者為贏，產生 100 個活動判讀結果，若90 個為馬拉松，10個為遊行時，取其結果為馬拉松)。此系統將減少物體追蹤之計算耗費，將計算資源保留給辨識計算本身。此外，此系統亦可與其他視覺類型感測器(如紅外線相機、深度相機等)，或非視覺感測器(如裝設在物體上之陀螺儀等)，將感測之相關資料一併帶入融合分析單元，由融合分析單元加入額外之相關演算法進行運算。除此之外，融合分析單元亦可為一獨立之伺服器及/或雲端系統，相關場景資料可結由有線網路、無線網路、或其他相關電波、光波等之傳輸方式彼此連結，進行更有效率或安全之計算。然而，應理解，雖然詳細說明及具體實例指示本發明之較佳實施例，但其僅以圖解說明之方式給出，此乃因熟習此項技術者將自此詳細說明明瞭本發明之範疇內之各種改變及修改。因此，應理解，本發明並不限於所闡述之裝置之特定組件部分或所闡述之方法之步驟，此乃因此類裝置及方法可變化。亦應理解，本文中所使用之術語僅係出於闡述特定實施例之目的並不意欲具限制性。必須注意，如本說明書及隨附申請專利範圍中所使用，除非內容脈絡另外明確規定，否則冠詞「一」及「該」意欲意指存在元件中之一或多者。因此，舉例而言，提及「一單元」或「該單元」可包含數個裝置等。此外，詞語「包括」、「包含」、「含有」及類似措辭不排除其他元件或步驟。After the visual sensing device (such as a camera) receives the image information (which can be a single static image), the image information is sent to the object recognition unit, and the object recognition unit uses relevant algorithms (such as machine learning, or arithmetic based on rules) to interpret The types of objects in the screen and the positions of the objects in the screen are output. After processing, the corresponding relationship between the objects can be generated (such as information such as front, back, left, right, up, down, and relative distance ratios, etc.), so that the image information can be obtained. scene information. On the other hand, the gesture recognition unit can receive the original image information of the visual sensing device, and/or part of the object image information after interpretation by the object recognition unit, so as to recognize the gesture information of each object. Algorithms (such as machine learning, or operations based on rules), generate the posture interpretation of the human body, such as standing, lying, sitting, and lying down; in the case of vehicles, it can be posture information such as door opening, window rolling down, etc.; posture recognition unit After the operation, the posture state of each object in the visual information is generated. The object type information and/or spatial relative relationship information, etc. generated by the object recognition unit, plus the status information of each object generated by the gesture recognition unit, the fusion analysis unit can integrate the data, the conditions of each object and the relationship between each other. The relative relationship will be clear, this is the scene information, which is the semantic data, and then the fusion analysis unit performs the operation of the identification algorithm (such as machine learning, or the operation based on the rules) on the scene information, and generates an activity interpretation result. As a result, valuable various applications can be made. The system thus has the ability to discriminate the entire scene, rather than a single action of a single object, resulting in activity-level interpretations. After the activity interpretation result is generated, the system can also have an event judging unit, which is set to generate a specific event notification when a related activity and/or a certain action occurs. The processing of actions, such as issuing an alarm, sending out a warning message, activating the relevant actuator or device, etc. In addition, this system does not necessarily need a series of dynamic images, and can use a single static image information for identification, but if there are dynamic images, it can also be used to increase the accuracy of activity interpretation, but does not require tracking and comparison operations, fusion analysis The unit adds the object type information, and/or spatial relative relationship information, etc. generated from the time-series-based image sequence, plus the state information of each object, that is, the scene information, and puts it into the relevant algorithm once or in stages according to each image sequence. In the middle interpretation, the activity interpretation results of each image sequence are counted, and there is no need to track the object feature position for each image sequence. If the interpretation results of each image sequence are simply counted, the most suitable activity interpretation results can be output. The image is divided into several static images, and after the scene information generated by several static images is interpreted by the fusion analysis unit, the same or different number of activity interpretation results will be generated. For example, the simple law of large numbers can be used, the more is the winner, and 100 event interpretation results are generated. If 90 are marathons and 10 are parades, the result is the marathon). This system will reduce the computational cost of object tracking and reserve computational resources for the identification calculation itself. In addition, this system can also be combined with other visual sensors (such as infrared cameras, depth cameras, etc.), or non-visual sensors (such as gyroscopes installed on objects, etc.) It is brought into the fusion analysis unit, and the fusion analysis unit adds additional related algorithms for operation. In addition, the fusion analysis unit can also be an independent server and/or cloud system, and the relevant scene data can be connected to each other by wired network, wireless network, or other related transmission methods such as radio waves and light waves. More efficient or secure computing. It should be understood, however, that the detailed description and specific examples, while indicating preferred embodiments of the invention, are given by way of illustration only, since those skilled in the art will understand the scope of the invention from this detailed description various changes and modifications. Therefore, it is to be understood that the invention is not limited to the particular component parts of the devices described or the steps of the methods described, as such devices and methods may vary. It is also to be understood that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting. It must be noted that, as used in this specification and the appended claims, the articles "a" and "the" are intended to mean that there are one or more of the elements unless the context clearly dictates otherwise. Thus, for example, reference to "a unit" or "the unit" can include several devices, etc. Furthermore, the words "comprising", "comprising", "comprising" and similar words do not exclude other elements or steps.

綜合以上所述，本發明基於基於多重物體之視覺活動識別系統，進行物體種類與空間相對關係辨認、各物體姿態辨識與彙整，產生語意型的場景資訊，進而進行活動層次之識別，達成機器智能基於人類較複雜活動之辨識、而能採取相應反饋與行動，實是非常關鍵的發展。其特性不但增加現有產業對於活動識別之技術，並且可擴大現有影像辨識技術可完成之應用，如偵蒐、探勘、保安、生產監控、居家醫療等等，實具高度產業利用性，其融合分析，不僅使用靜態影像即可運作，且得以在動態影像中使用非追蹤之簡單統計方式產生結果，更是現有技術中尚未達到者，實具有新穎性與進步性。總結而言，本案實為具新穎性、進步性，及產業上應用價值之發明，已符合發明專利之要件，遂依法提出專利申請。惟上述者只為本發明之較佳實施方式，不能以此限制本發明之申請專利範圍，舉凡援依本發明之精神而實施等效變化或修飾者，仍皆應涵蓋於本發明專利涵蓋之範圍內。Based on the above, the present invention is based on a visual activity recognition system based on multiple objects, which can identify the type of objects and the relative relationship in space, identify and integrate the gestures of each object, generate semantic scene information, and then perform activity level recognition to achieve machine intelligence. The ability to take corresponding feedback and action based on the recognition of more complex human activities is a very critical development. Its characteristics not only increase the technology of activity recognition in the existing industry, but also expand the application of existing image recognition technology, such as detection, exploration, security, production monitoring, home medical treatment, etc. It is highly industrially applicable. , not only can be operated by using static images, but also can generate results using a simple non-tracking statistical method in moving images, which has not yet been achieved in the prior art, which is novel and progressive. In conclusion, this case is indeed an invention with novelty, progress, and industrial application value, and has met the requirements of an invention patent, so a patent application was filed in accordance with the law. However, the above are only the preferred embodiments of the present invention, and cannot limit the scope of the patent application of the present invention. Any equivalent changes or modifications implemented in accordance with the spirit of the present invention should still be covered by the patent of the present invention. within the range.

現將以所附圖式更全面闡述本發明，其中展示本發明之當前較佳實施例之。但本發明可以諸多不同形式體現且不應被解釋為限於本文中所陳述之實施例；此等實施例係為透徹及完整起見而提供，將本發明之範疇完全傳達給熟習此項技術者。圖1在說明可基於多重物體之視覺活動識別系統10，該系統包括一數位網路相機110，其影像資料經由傳輸介面傳至物體識別單元120，傳輸介面係使用有線網路或無線網路或匯流排或其他光波電波方式連接。物體識別單元120將影像資訊中物體進行識別，產生一物體識別資料結構為例如圖2，為一圖論(graph theory)中之圖(graph)結構，包含物體種類與空間位置為節點，透過邊(edge)彼此相連，邊具有其特性表示關係(如前後左右上下)，並可帶有距離資訊(以該靜態圖片比例計算，如百分之一即代表兩物體相距距離為圖片之百分之一)，此資料結構可連同影像資訊傳至融合分析單元140，傳輸介面係前述數位網路相機110與物體識別單元120之連結方式。姿態識別單元130係接收影像資訊從數位網路相機110及/或物體識別資料結構從物體識別單元120，而後經運算後產生各物體的姿態資訊，並將各物體的姿態資訊經由傳輸介面傳至融合分析單元140，融合分析單元140將圖2之資料結構加上各物體姿態資訊進行整合之後，即產生場景資訊，為一語意型而非影像型資料，場景資訊即可使用相關演算法(如seq2seq, RNN等語意型演算法)進行辨識，取得活動識別之結果。此結果透過傳輸介面傳輸至事件判斷單元20，事件判斷單元中存有規則資料庫，若動作識別之結果符合規則資料庫中觸發事件之條件，事件判斷單元20即可啟動相關對應動作。The present invention will now be described more fully with reference to the accompanying drawings, in which presently preferred embodiments of the invention are shown. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; these embodiments are provided for thoroughness and completeness, and fully convey the scope of the invention to those skilled in the art . FIG. 1 illustrates a visual activity recognition system 10 that can be based on multiple objects. The system includes a digital network camera 110, and its image data is transmitted to the object recognition unit 120 through a transmission interface using a wired network or a wireless network or Bus bar or other light wave connection. The object recognition unit 120 recognizes the objects in the image information, and generates an object recognition data structure as shown in FIG. 2 , which is a graph structure in graph theory, including object types and spatial positions as nodes, and through edges (edge) are connected to each other, the edge has its characteristic representation relationship (such as front, back, left, right, up, down), and can carry distance information (calculated based on the proportion of the static picture, such as 1% means that the distance between two objects is % of the picture) 1), the data structure can be transmitted to the fusion analysis unit 140 together with the image information, and the transmission interface is the connection method between the digital network camera 110 and the object recognition unit 120 described above. The gesture recognition unit 130 receives the image information from the digital network camera 110 and/or the object recognition data structure from the object recognition unit 120, and then generates the gesture information of each object after calculation, and transmits the gesture information of each object to the transmission interface. The fusion analysis unit 140, after the fusion analysis unit 140 integrates the data structure of FIG. 2 and the posture information of each object, it generates scene information, which is a semantic type rather than an image type data, and the scene information can use the relevant algorithm (such as seq2seq, RNN and other semantic type algorithms) to identify and obtain the result of activity identification. The result is transmitted to the event judging unit 20 through the transmission interface. The event judging unit has a rule database. If the action recognition result meets the conditions for triggering the event in the rule database, the event judging unit 20 can start the corresponding action.

10‧‧‧視覺活動識別系統110‧‧‧視覺感測設備(如數位網路相機)120‧‧‧物體識別單元130‧‧‧姿態識別單元140‧‧‧融合分析單元20‧‧‧事件判斷單元10‧‧‧Visual Activity Recognition System 110‧‧‧Visual Sensing Equipment (such as Digital Network Camera) 120‧‧‧Object Recognition Unit 130‧‧‧Gestation Recognition Unit 140‧‧‧Fusion Analysis Unit 20‧‧‧Event Judgment unit

[第1圖]係根據本發明之一視覺活動識別系統架構圖。 [第2圖]係本發明之物體識別單元產生之物體種類與空間關係資料結構示例圖。[Fig. 1] is an architecture diagram of a visual activity recognition system according to the present invention. [Fig. 2] is an example diagram of the structure of the object types and spatial relationship data generated by the object recognition unit of the present invention.

10‧‧‧視覺活動識別系統 10‧‧‧Visual Activity Recognition System

110‧‧‧視覺感測設備(如數位網路相機) 110‧‧‧Visual sensing equipment (such as digital network cameras)

120‧‧‧物體識別單元 120‧‧‧Object recognition unit

130‧‧‧姿態識別單元 130‧‧‧ Gesture Recognition Unit

140‧‧‧融合分析單元 140‧‧‧Fusion Analysis Unit

20‧‧‧事件判斷單元 20‧‧‧Event Judgment Unit

Claims

A visual activity recognition system that can be based on multiple objects at least includes: at least one visual sensing device; an object recognition unit, which is connected to the visual sensing device, and is used to include in visual information according to the data provided by it. Object identification is performed on one or more objects to generate an object identification data structure, and the object identification data structure includes one or more object types and/or spatial information and/or spatial relationships between objects in the visual information, wherein The spatial relationship between the objects can be obtained by calculating the image information in proportion; a gesture recognition unit is connected with the visual sensing device and/or the object recognition unit, and is used for recognizing the visual action from the visual action recognition device and/or the object recognition unit. Or the visual information output by the object recognition unit to perform gesture discrimination including one or more objects; a fusion analysis unit, which is connected with the gesture recognition unit and/or the object recognition unit, is used to determine the type and type of one or more objects and After all or part of the information of the gesture discrimination data and/or the spatial information and/or the spatial relationship between objects is integrated, a scene information is generated, and the scene information is a semantic type rather than an image type data, and the scene information is Recognition can be performed using a semantic algorithm to generate activity and/or action discrimination results for the visual data.

The visual activity recognition system according to claim 1 is further connected to an event judging unit, and when the activity and/or the action result meets the set conditions, a corresponding processing action is triggered.

According to the visual activity recognition system described in claim 2, the event judgment unit sets one or more conditions, and the objects corresponding to the conditions are also one or more.

The visual activity recognition system of claim 1, wherein the visual sensing device is a dynamic image sensing device.

The visual activity recognition system according to claim 4, wherein the fusion analysis unit calculates and analyzes the results output by the object recognition unit and/or the gesture recognition unit of a certain number of still images segmented based on time series.

The visual activity recognition system according to claim 5, wherein the calculation is based on the total number of samples and the result number of each sample activity for statistical calculation, so as to generate the final activity recognition result.

The visual activity recognition system according to claim 1 is connected with other visual and/or non-visual type sensors, and is analyzed together by a fusion analysis unit.

The visual activity recognition system according to claim 1, wherein the fusion analysis unit is an independent server and/or a cloud computing environment, and is connected by means of network, radio wave, or other carrier transmission.