TW201828158A

TW201828158A - Method of video object tracking and apparatus thereof

Info

Publication number: TW201828158A
Application number: TW107101732A
Authority: TW
Inventors: 余三思
Original assignee: 大陸商騰訊科技（深圳）有限公司
Priority date: 2017-01-17
Filing date: 2018-01-17
Publication date: 2018-08-01
Also published as: CN106845385A; WO2018133666A1; TWI677825B

Abstract

A method of video object tracking and an apparatus are provided. The method comprises following steps. Video streams are acquired. The face area is recognized according to the face detection algorithm, and a first to-be-tracked target corresponding to the first video frame is obtained. The first to-be-tracked target is extracted through the face features to obtain the first face features, and the first face features are stored in a feature library corresponding to the target to be tracked. The face region is recognized in the current video frame for obtaining a current to-be-tracked target corresponding to the current video frame. The current to-be-tracked target is extract through the face features to obtain second face features. The current to-be-tracked target is matched with the first to-be-tracked target according to the second face feature and the feature library, so that the first to-be-tracked target is tracked from the first video frame. The feature library is updated according to the extracted updated face features in the tracking process.

Description

Video target tracking method and device

本申請涉及電腦技術領域，特別是涉及一種視頻目標跟蹤方法和裝置。本申請要求於2017年1月17日提交中國專利局、申請號為201710032132.6，發明名稱為“視頻目標跟蹤的方法和裝置”的中國專利申請的優先權，其全部內容通過引用結合在本申請中。 The present application relates to the field of computer technology, and in particular, to a video target tracking method and device. This application claims priority from a Chinese patent application filed on January 17, 2017, with application number 201710032132.6, and the invention name is "Video Target Tracking Method and Device", the entire contents of which are incorporated herein by reference. .

目標跟蹤技術一直以來都是電腦視覺和影像處理領域的重點之一，被廣泛應用在智慧監控、智慧交通、視覺導航、人機交互、國防偵察等領域。 Target tracking technology has always been one of the focuses in the field of computer vision and image processing, and is widely used in intelligent monitoring, intelligent transportation, visual navigation, human-machine interaction, and defense reconnaissance.

目標跟蹤演算法通常使用一種或數種簡單的傳統特徵匹配演算法來區分目標，如利用圖像本身的顏色、形狀等特徵。 Target tracking algorithms usually use one or several simple traditional feature matching algorithms to distinguish targets, such as using the color, shape and other features of the image itself.

本申請實施例提供一種視頻目標跟蹤方法和裝置，能夠提高跟蹤的連續性和穩健(robust)性。 The embodiments of the present application provide a video target tracking method and device, which can improve tracking continuity and robustness.

本申請實施例提供一種視頻目標跟蹤的方法，應用於終端或伺服器，所述方法包括：獲取視頻流，根據人臉檢測演算法識別人臉區域，得到第一視頻幀對應的第一待跟蹤目標；對所述第一待跟蹤目標通過基於深度神經網路的人臉特徵提取得到第一人臉特徵，並將所述第一人臉特徵存入所述第一待跟蹤目標對應的特徵庫；在當前視頻幀根據人臉檢測演算法識別人臉區域，得到當前視頻幀對應的當前待跟蹤目標，對所述當前待跟蹤目標通過基於深度神經網路的人臉特徵提取得到第二人臉特徵，根據所述第二人臉特徵和所述特徵庫將所述當前待跟蹤目標與第一待跟蹤目標進行特徵匹配，以從所述第一視頻幀開始跟蹤所述第一待跟蹤目標，在跟蹤過程中根據提取的更新的人臉特徵更新所述特徵庫。 An embodiment of the present application provides a method for video target tracking, which is applied to a terminal or a server. The method includes: acquiring a video stream, identifying a face area according to a face detection algorithm, and obtaining a first to-be-tracked corresponding to a first video frame. A target; a first facial feature is obtained by extracting a facial feature based on a deep neural network on the first target to be tracked, and the first facial feature is stored in a feature library corresponding to the first target to be tracked ; Identify the face area according to the face detection algorithm in the current video frame, obtain the current target to be tracked corresponding to the current video frame, and obtain a second face from the deep neural network-based face feature extraction for the current target to be tracked Feature, matching the current target to be tracked with the first target to be tracked according to the second face feature and the feature library to track the first target to be tracked from the first video frame, During the tracking process, the feature library is updated according to the extracted updated face features.

本申請實施例還提供一種視頻目標跟蹤裝置，所述裝置包括：處理器以及與所述處理器相連接的記憶體，所述記憶體中儲存有可由所述處理器執行的機器可讀指令模組；所述機器可讀指令模組包括：檢測模組，用於獲取視頻流，根據人臉檢測演算法識別人臉區域，得到第一視頻幀對應的第一待跟蹤目標；人臉特徵提取模組，用於對所述第一待跟蹤目標通過基於深度神經網路的人臉特徵提取得到第一人臉特徵，並將所述第一人臉特徵存入所述第一待跟蹤目標對應的特徵庫；所述檢測模組還用於在當前視頻幀根據人臉檢測演算法識別人臉區域，得到當前視頻幀對應的當前待跟蹤目標；所述人臉特徵提取模組還用於對所述當前待跟蹤目標通過基於深度神經網路的人臉特徵提取得到第二人臉特徵；跟蹤模組，用於根據所述第二人臉特徵和所述特徵庫將所述當前待跟蹤目標與第一待跟蹤目標進行特徵匹配，以從所述第一視頻幀開始跟蹤所述第一待跟蹤目標；學習模組，用於在跟蹤過程中根據提取的更新的人臉特徵更新所述特徵庫。 An embodiment of the present application further provides a video target tracking device. The device includes a processor and a memory connected to the processor. The memory stores a machine-readable instruction module executable by the processor. The machine-readable instruction module includes: a detection module for obtaining a video stream, identifying a face area according to a face detection algorithm, and obtaining a first target to be tracked corresponding to a first video frame; and extracting facial features A module for extracting a first facial feature from the first neural network-based facial feature extraction for the first target to be tracked, and storing the first facial feature into the first target to be tracked Feature library; the detection module is further configured to identify a face area according to a face detection algorithm in the current video frame to obtain a current target to be tracked corresponding to the current video frame; the face feature extraction module is further configured to A second facial feature is obtained by extracting a facial feature based on a deep neural network from the current target to be tracked; a tracking module is configured to convert the target according to the second facial feature and the feature database. Feature matching between the former target to be tracked and the first target to be tracked to track the first target to be tracked from the first video frame; a learning module for tracking the updated facial features based on the extracted features Update the feature library.

本申請實施例還提供一種非易失性電腦可讀儲存介質，所述儲存介質中儲存有機器可讀指令，所述機器可讀指令可以由處理器執行以完成以下操作：獲取視頻流，根據人臉檢測演算法識別人臉區域，得到第一視頻幀對應的第一待跟蹤目標；對所述第一待跟蹤目標通過基於深度神經網路的人臉特徵提取得到第一人臉特徵，並將所述第一人臉特徵存入所述第一待跟蹤目標對應的特徵庫；在當前視頻幀根據人臉檢測演算法識別人臉區域，得到當前視頻幀對應的當前待跟蹤目標，對所述當前待跟蹤目標通過基於深度神經網路的人臉特徵提取得到第二人臉特徵，根據所述第二人臉特徵和所述特徵庫將所述當前待跟蹤目標與第一待跟蹤目標進行特徵匹配，以從所述第一視頻幀開始跟蹤所述第一待跟蹤目標，在跟蹤過程中根據提取的更新的人臉特徵更新所述特徵庫。 An embodiment of the present application further provides a non-volatile computer-readable storage medium. The storage medium stores machine-readable instructions, and the machine-readable instructions can be executed by a processor to complete the following operations: acquiring a video stream, according to The face detection algorithm recognizes a face region, and obtains a first target to be tracked corresponding to the first video frame. The first target to be tracked is obtained through facial feature extraction based on a deep neural network, and Storing the first face feature into a feature library corresponding to the first target to be tracked; identifying a face area according to a face detection algorithm at the current video frame, obtaining the current target to be tracked corresponding to the current video frame, The current target to be tracked obtains a second face feature through facial feature extraction based on a deep neural network, and the current target to be tracked and the first target to be tracked are performed according to the second face feature and the feature library. Feature matching to track the first target to be tracked from the first video frame, and update the features according to the extracted updated face features during the tracking process Library.

110‧‧‧終端 110‧‧‧Terminal

120‧‧‧伺服器 120‧‧‧Server

130‧‧‧視頻採集裝置 130‧‧‧video capture device

140‧‧‧網路 140‧‧‧Internet

1101‧‧‧系統匯流排 1101‧‧‧System Bus

1102‧‧‧處理器 1102‧‧‧Processor

1103‧‧‧圖形處理單元 1103‧‧‧Graphics Processing Unit

1104‧‧‧儲存介質 1104‧‧‧Storage media

1105‧‧‧記憶體 1105‧‧‧Memory

1106‧‧‧網路介面 1106‧‧‧Interface

1107‧‧‧顯示螢幕 1107‧‧‧display

1108‧‧‧輸入裝置 1108‧‧‧Input device

11041‧‧‧作業系統 11041‧‧‧Operating System

11042‧‧‧第一視頻目標跟蹤裝置 11042‧‧‧The first video target tracking device

1201‧‧‧系統匯流排 1201‧‧‧System Bus

1202‧‧‧處理器 1202‧‧‧Processor

1203‧‧‧儲存介質 1203‧‧‧Storage media

1204‧‧‧記憶體 1204‧‧‧Memory

1205‧‧‧網路介面 1205‧‧‧Interface

12031‧‧‧作業系統 12031‧‧‧ Operating System

12032‧‧‧資料庫 12032‧‧‧Database

12033‧‧‧第二視頻目標跟蹤裝置 12033‧‧‧Second video target tracking device

310‧‧‧跟蹤模組 310‧‧‧Tracking Module

320‧‧‧檢測模組 320‧‧‧Detection Module

330‧‧‧學習模組 330‧‧‧Learning Module

410‧‧‧檢測模組 410‧‧‧testing module

411‧‧‧圖像特徵提取單元 411‧‧‧Image Feature Extraction Unit

412‧‧‧身份匹配單元 412‧‧‧Identity matching unit

413‧‧‧第一跟蹤目標確定單元 413‧‧‧First tracking target determination unit

414‧‧‧第一推薦單元 414‧‧‧The first recommended unit

415‧‧‧第二推薦單元 415‧‧‧Second recommended unit

416‧‧‧第二跟蹤目標確定單元 416‧‧‧Second tracking target determination unit

420‧‧‧人臉特徵提取模組 420‧‧‧Face feature extraction module

430‧‧‧跟蹤模組 430‧‧‧Tracking Module

440‧‧‧學習模組 440‧‧‧Learning Module

450‧‧‧特徵身份處理模組 450‧‧‧ Feature Identity Processing Module

510‧‧‧處理器 510‧‧‧ processor

520‧‧‧記憶體 520‧‧‧Memory

521‧‧‧檢測模組 521‧‧‧Detection Module

522‧‧‧人臉特徵提取模組 522‧‧‧Face feature extraction module

523‧‧‧跟蹤模組 523‧‧‧Tracking Module

524‧‧‧學習模組 524‧‧‧Learning Module

525‧‧‧特徵身份處理模組 525‧‧‧ Feature Identity Processing Module

530‧‧‧介面 530‧‧‧Interface

AdaBoost‧‧‧疊代演算法 AdaBoost‧‧‧ Iterative Algorithm

conv‧‧‧卷積層 conv‧‧‧ convolution layer

FC‧‧‧表示完全連接層 FC‧‧‧ means fully connected layer

I(x,y,t)‧‧‧圖元 I (x, y, t) ‧‧‧ primitive

(dx,dy)‧‧‧距離 (dx, dy) ‧‧‧distance

dt‧‧‧時間 dt‧‧‧time

LRN‧‧‧區域響應歸一化層 LRN‧‧‧Regional Response Normalization Layer

max pool‧‧‧最大池化層 max pool‧‧‧Maximum pooling layer

NPD‧‧‧歸一化的圖元差異特徵 NPD‧‧‧normalized feature differences

ROM‧‧‧唯讀儲存記憶體 ROM‧‧‧Read Only Memory

RAM‧‧‧隨機儲存記憶體 RAM‧‧‧ random storage memory

S210、S220、S230‧‧‧步驟 S210, S220, S230‧‧‧ steps

S231、S232、S233、S234、S235、S236、S237‧‧‧步驟 S231, S232, S233, S234, S235, S236, S237‧‧‧ steps

TLD‧‧‧單目標長時間跟蹤 TLD‧‧‧ single target long-term tracking

VGG‧‧‧視覺化幾何群 VGG‧‧‧Visualized Geometric Group

VGG-S‧‧‧人臉特徵提取演算法 VGG-S‧‧‧Face Feature Extraction Algorithm

為了更清楚地說明本發明實施例中的技術方案，下面將對實施例描述中所需要使用的附圖作簡單地介紹：圖1為本申請一個實施例中視頻目標跟蹤方法的應用環境圖；圖2為本申請一個實施例中圖1中終端的內部結構圖；圖3為本申請一個實施例中圖1中伺服器的內部結構圖；圖4為本申請一個實施例中視頻目標跟蹤方法的流程圖；圖5為本申請一個實施例中得到當前待跟蹤目標的流程圖；圖6為本申請一個實施例中更新特徵庫的流程圖；圖7為本申請一個實施例中視頻目標跟蹤演算法與範本匹配演算法匹配對比示意圖；圖8為本申請一個實施例中得到當前待跟蹤目標的另一流程圖；圖9為本申請一個實施例中視頻目標跟蹤方法對應的目標跟蹤系統示意圖；圖10為本申請一個實施例中視頻目標跟蹤演算法得到的視頻跟蹤結果示意圖；圖11為本申請一個實施例中TLD跟蹤演算法得到的視頻跟蹤結果示意圖；圖12為本申請一個實施例中視頻目標跟蹤裝置的結構示意圖；圖13為本申請一個實施例中視頻目標跟蹤裝置的另一結構示意圖；圖14為本申請一個實施例中視頻目標跟蹤裝置的另一結構示意圖；圖15為本申請一個實施例中視頻目標跟蹤裝置的另一結構示意圖；圖16為本申請一個實施例中視頻目標跟蹤裝置的另一結構示意圖。 In order to explain the technical solution in the embodiments of the present invention more clearly, the drawings used in the description of the embodiments are briefly introduced below: FIG. 1 is an application environment diagram of a video target tracking method in an embodiment of the present application; FIG. 2 is an internal structure diagram of the terminal in FIG. 1 in an embodiment of the present application; FIG. 3 is an internal structure diagram of the server in FIG. 1 in an embodiment of the present application; FIG. 4 is a video target tracking method in an embodiment of the present application Fig. 5 is a flowchart of obtaining a current target to be tracked in one embodiment of the application; Fig. 6 is a flowchart of updating a feature database in one embodiment of the application; Fig. 7 is a video target tracking in one embodiment of the application Schematic comparison between algorithm and template matching algorithm; Figure 8 is another flowchart of obtaining the current target to be tracked in one embodiment of the application; Figure 9 is a schematic diagram of the target tracking system corresponding to the video target tracking method in one embodiment of the application ; FIG. 10 is a schematic diagram of a video tracking result obtained by a video target tracking algorithm in an embodiment of the present application; FIG. 11 is a TLD following Schematic diagram of video tracking results obtained by the tracking algorithm; FIG. 12 is a schematic diagram of the structure of a video target tracking device in an embodiment of the application; FIG. 13 is another schematic diagram of the structure of a video target tracking device in an embodiment of the application; Another schematic diagram of the structure of the video target tracking device in one embodiment of the application; FIG. 15 is another schematic diagram of the structure of the video target tracking device in one embodiment of the application; FIG. 16 is another schematic diagram of the video target tracking device in one embodiment of the application Schematic.

請參照圖式，其中相同的元件符號代表相同的元件或是相似的元件，本發明的原理是以實施在適當的運算環境中來舉例說明。以下的說明是基於所例示的本發明具體實施例，其不應被視為限制本發明未在此詳述的其它具體實施例。 Please refer to the drawings, wherein the same component symbols represent the same components or similar components. The principle of the present invention is exemplified by being implemented in a suitable computing environment. The following description is based on the exemplified specific embodiments of the present invention, which should not be construed as limiting other specific embodiments of the present invention which are not described in detail herein.

圖1為本申請一個實施例中視頻目標跟蹤方法運行的應用環境圖。如圖1所示，該應用環境包括終端110、伺服器120、以及視頻採集裝置130，其中，終端110、伺服器120、視頻採集裝置130通過網路140進行通信。 FIG. 1 is an application environment diagram of a video target tracking method running in an embodiment of the present application. As shown in FIG. 1, the application environment includes a terminal 110, a server 120, and a video capture device 130. The terminal 110, the server 120, and the video capture device 130 communicate through a network 140.

在本申請一些實施例中，終端110可為智慧手機、平板電腦、筆記型電腦、臺式電腦等，但並不局限於此。視頻採集裝置130可為攝像頭，佈置在建築物入口處等位置。網路140可以是有線網路也可以是無線網路。在本申請一些實施例中，視頻採集裝置130可將採集的視頻流發送至終端110或伺服器120，終端110或伺服器120可對視頻流進行目標跟蹤。在本申請另一些實施例中，視頻採集裝置130也可直接對視頻流進行目標跟蹤，並將跟蹤結果發送至終端110進行顯示。 In some embodiments of the present application, the terminal 110 may be a smart phone, a tablet computer, a notebook computer, a desktop computer, etc., but is not limited thereto. The video capture device 130 may be a camera, which is arranged at a position such as an entrance of a building. The network 140 may be a wired network or a wireless network. In some embodiments of the present application, the video capture device 130 may send the collected video stream to the terminal 110 or the server 120, and the terminal 110 or the server 120 may perform target tracking on the video stream. In other embodiments of the present application, the video capture device 130 may also directly perform target tracking on the video stream, and send the tracking result to the terminal 110 for display.

在本申請一個實施例中，圖1中的終端110的內部結構如圖2所示，該終端110包括通過系統匯流排1101連接的處理器1102、圖形處理單元1103、儲存介質1104、記憶體1105、網路介面1106、顯示螢幕1107和輸入裝置1108。其中，終端110的儲存介質1104儲存有作業系統11041以及第一視頻目標跟蹤裝置11042，該裝置11042用於實現一種適用於終端110的視頻目標跟蹤方法。處理器1102用於提供計算和控制能力，支撐整個終端110的運行。終端110中的圖形處理單元1103用於至少提供顯示介面的繪製能力。記憶體1105為儲存介質1104中的第一視頻目標跟蹤裝置11042的運行提供環境。網路介面1106用於與視頻採集裝置130進行網路通信，如接收視頻採集裝置130採集的視頻流等。顯示螢幕1107用於顯示跟蹤結果等。輸入裝置1108用於接收使用者輸入的命令或資料等。對於帶觸控式的終端110，顯示螢幕1107和輸入裝置1108可為觸控式。圖2中示出的結構，僅僅是與本申請方案相關的部分結構的框圖，並不構成對本申請方案所應用於其上的終端110的限定，具體的終端110可以包括比圖2中所示更多或更少的部件，或者組合某些部件，或者具有不同的部件佈置。 In an embodiment of the present application, the internal structure of the terminal 110 in FIG. 1 is shown in FIG. 2. The terminal 110 includes a processor 1102, a graphics processing unit 1103, a storage medium 1104, and a memory 1105 connected through a system bus 1101. , Network interface 1106, display screen 1107, and input device 1108. The storage medium 1104 of the terminal 110 stores an operating system 11041 and a first video target tracking device 11042. The device 11042 is configured to implement a video target tracking method suitable for the terminal 110. The processor 1102 is used to provide computing and control capabilities to support the operation of the entire terminal 110. The graphics processing unit 1103 in the terminal 110 is configured to provide at least a drawing capability of a display interface. The memory 1105 provides an environment for the operation of the first video object tracking device 11042 in the storage medium 1104. The network interface 1106 is configured to perform network communication with the video capture device 130, such as receiving a video stream collected by the video capture device 130. A display screen 1107 is used to display tracking results and the like. The input device 1108 is used to receive commands or data input by a user. For the touch-sensitive terminal 110, the display screen 1107 and the input device 1108 may be touch-sensitive. The structure shown in FIG. 2 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the terminal 110 to which the solution of the present application is applied. The specific terminal 110 may include Show more or fewer components, or combine certain components, or have different component arrangements.

在本申請一個實施例中，圖1中伺服器120的內部結構如圖3 所示，該伺服器120包括通過系統匯流排1201連接的處理器1202、儲存介質1203、記憶體1204和網路介面1205。其中，該伺服器120的儲存介質1203儲存有作業系統12031、資料庫12032、第二視頻目標跟蹤裝置12033。資料庫12032用於儲存資料。第二視頻目標跟蹤裝置12033用於實現一種適用於伺服器120的視頻目標跟蹤方法。該伺服器120的處理器1202用於提供計算和控制能力，支撐整個伺服器120的運行。該伺服器120的記憶體1204為儲存介質1203中的第二視頻目標跟蹤裝置12033的運行提供環境。該伺服器120的網路介面1205用於與外部的視頻採集裝置130通過網路連接通信，比如接收視頻採集裝置130發送的視頻流等。 In an embodiment of the present application, the internal structure of the server 120 in FIG. 1 is shown in FIG. 3. The server 120 includes a processor 1202, a storage medium 1203, a memory 1204, and a network interface connected through a system bus 1201. 1205. The storage medium 1203 of the server 120 stores an operating system 12031, a database 12032, and a second video target tracking device 12033. The database 12032 is used for storing data. The second video target tracking device 12033 is configured to implement a video target tracking method suitable for the server 120. The processor 1202 of the server 120 is used to provide computing and control capabilities to support the operation of the entire server 120. The memory 1204 of the server 120 provides an environment for the operation of the second video target tracking device 12033 in the storage medium 1203. The network interface 1205 of the server 120 is used to communicate with an external video capture device 130 through a network connection, such as receiving a video stream sent by the video capture device 130.

如圖4所示，在本申請一個實施例中，提供了一種視頻目標跟蹤方法，其應用于上述應用環境中的終端110、伺服器120或視頻採集裝置130，該方法可由本申請任一實施例提供的視頻目標跟蹤裝置執行，包括如下步驟： As shown in FIG. 4, in an embodiment of the present application, a video target tracking method is provided, which is applied to the terminal 110, the server 120, or the video capture device 130 in the above application environment, and the method may be implemented by any one of the present application. The video target tracking device provided by the example is executed and includes the following steps:

步驟S210，獲取視頻流，根據人臉檢測演算法識別人臉區域，得到第一視頻幀對應的第一待跟蹤目標。 Step S210: Obtain a video stream, identify a face region according to a face detection algorithm, and obtain a first target to be tracked corresponding to a first video frame.

具體的，視頻流可由分佈在建築物入口處的視頻採集裝置採集得到。如果視頻目標跟蹤方法應用於視頻採集裝置，則可直接從視頻採集裝置的記憶體中獲得到視頻流。如果視頻目標跟蹤方法應用於終端或伺服器，則視頻採集裝置可即時將採集到的視頻流發送至終端或伺服器。 Specifically, the video stream can be acquired by a video acquisition device distributed at the entrance of the building. If the video target tracking method is applied to a video capture device, the video stream can be obtained directly from the memory of the video capture device. If the video target tracking method is applied to a terminal or a server, the video capture device can immediately send the captured video stream to the terminal or the server.

人臉檢測是指對於任意一幅給定的圖像，採用一定的策略對其進行搜索以確定其中是否含有人臉，如果是，則返回人臉的位置、大小和姿態。在本申請一些實施例中，可通過推薦框的方式顯示人臉區域(如圖10中所示的矩形框)，得到第一視頻幀對應的第一待跟蹤目標。通過不斷地對視頻流進行人臉檢測，直到檢測到有人臉出現，將人臉區域確定為第一待跟蹤目標。由於一幀中可能檢測到多個人臉，故第一待跟蹤目標可能為多個。如果有多個第一待跟蹤目標，則可通過不同的標識資訊標識不同的人臉區域，如通過不同顏色的推薦框標識不同的人臉區域。人臉檢測演算法可根據需要自訂，如採用NPD(Normalized Pixel Difference，歸一化的圖元差異特徵)人臉檢測演算法，或將NPD人臉檢測演算法與其它演算法結合以提高確定待跟蹤目標的準確性。 Face detection refers to searching for a given image using a certain strategy to determine whether it contains a face, and if so, returning the position, size and pose of the face. In some embodiments of the present application, a face area (such as a rectangular frame shown in FIG. 10) may be displayed in the form of a recommended frame to obtain a first target to be tracked corresponding to the first video frame. Face detection is performed on the video stream until a human face is detected, and the face area is determined as the first target to be tracked. Since multiple faces may be detected in one frame, there may be multiple first targets to be tracked. If there are multiple first targets to be tracked, different face areas may be identified through different identification information, for example, different face areas may be identified through different color recommendation frames. Face detection algorithms can be customized as needed, such as using NPD (Normalized Pixel Difference) face detection algorithms, or combining NPD face detection algorithms with other algorithms to improve determination The accuracy of the target to be tracked.

步驟S220，對第一待跟蹤目標通過基於深度神經網路的人臉特徵提取得到第一人臉特徵，並將所述第一人臉特徵存入第一待跟蹤目標對應的特徵庫。 In step S220, the first facial feature is extracted from the facial feature based on the deep neural network for the first target to be tracked, and the first facial feature is stored in a feature database corresponding to the first target to be tracked.

具體的，深度神經網路是一種深度學習下的機器學習模型。深度學習是機器學習的分支，是使用包含複雜結構或由多重非線性變換構成的多個處理層對資料進行高層抽象的演算法。深度神經網路可採用視覺化幾何群(VGG,Visual Geometry Group)網路結構，通過VGG網路結構比通過特徵匹配演算法進行區分目標的召回率和準確率高。 Specifically, the deep neural network is a machine learning model under deep learning. Deep learning is a branch of machine learning. It is a high-level abstraction algorithm that uses multiple processing layers that contain complex structures or consist of multiple non-linear transformations. Deep neural networks can use the Visual Geometry Group (VGG) network structure. The VGG network structure has a higher recall and accuracy rate than distinguishing targets by feature matching algorithms.

為第一待跟蹤目標分配一個目標標識並建立特徵庫，為所述目標標識和特徵庫建立關聯關係並保存所述關聯關係。當第一待跟蹤目標為多個時，可為每個第一待跟蹤目標分配目標標識並建立特徵庫，為每個第一待跟蹤目標和其對應的第一人臉特徵建立關聯關係，將所述關聯關係以及第一人臉特徵儲存至該第一待跟蹤目標對應的特徵庫。通過引用人臉特徵進行特徵匹配，可解決目標跟蹤演算法由於沒有較好地利用人臉特徵，故而頻繁出現跟錯、跟偏和跟丟後無法重新正確找回跟蹤目標的問題。 Assign a target identifier to the first target to be tracked and establish a feature database, establish an association relationship between the target identifier and the feature database, and save the association relationship. When there are multiple first to-be-tracked targets, a target identifier can be assigned to each first to-be-tracked target and a feature database can be established, and an association relationship can be established between each first to-be-tracked target and its corresponding first face feature. The association relationship and the first face feature are stored in a feature library corresponding to the first target to be tracked. By referencing face features for feature matching, the problem that the target tracking algorithm does not make good use of the face features, so frequent tracking errors, follow-up errors, and follow-up losses cannot be retrieved correctly after the tracking target is solved.

步驟S230，在當前視頻幀根據人臉檢測演算法識別人臉區域，得到當前視頻幀對應的當前待跟蹤目標，對當前待跟蹤目標通過基於深度神經網路的人臉特徵提取得到第二人臉特徵，根據第二人臉特徵和特徵庫將當前待跟蹤目標與第一待跟蹤目標進行特徵匹配，以從第一視頻幀開始跟蹤第一待跟蹤目標，在跟蹤過程中根據提取的更新的人臉特徵更新特徵庫。 Step S230: Identify the face area according to the face detection algorithm in the current video frame, obtain the current target to be tracked corresponding to the current video frame, and obtain a second face by extracting facial features based on the deep neural network for the current target to be tracked. Feature, matching the current target to be tracked with the first target to be tracked according to the second face feature and the feature database to track the first target to be tracked from the first video frame, and according to the extracted updated person during the tracking process Face features update feature library.

具體的，將第二人臉特徵與特徵庫中第一待跟蹤目標對應的各個第一人臉特徵進行特徵匹配。特徵匹配的具體演算法可自訂，如可直接計算人臉特徵對應的向量的歐式距離，根據歐式距離判斷是否能匹配成功。如果第二人臉特徵與第一人臉特徵匹配成功，則確定當前待跟蹤目標為第一待跟蹤目標的連續運動目標。如果當前待跟蹤目標有多個，則每個當前待跟蹤目標群組成當前待跟蹤目標集合，將當前待跟蹤目標集合中的各個當前待跟蹤目標對應的第二人臉特徵分別與特徵庫中各個歷史待跟蹤目標對應的人臉特徵進行匹配，如果匹配成功，則將歷史待跟蹤目標的目標標識作為當前待跟蹤目標的目標標識，當前待跟蹤目標的位置為歷史待跟蹤目標運動後的位置。 Specifically, feature matching is performed on the second face feature and each first face feature corresponding to the first target to be tracked in the feature database. The specific algorithm for feature matching can be customized. For example, the Euclidean distance of the vector corresponding to the facial features can be directly calculated, and whether the matching can be determined based on the Euclidean distance. If the second facial feature matches the first facial feature successfully, it is determined that the current target to be tracked is a continuous moving target of the first target to be tracked. If there are multiple current targets to be tracked, each current target group to be tracked constitutes the current target set to be tracked, and the second face features corresponding to each current target to be tracked in the current target set to be tracked are respectively stored in the feature database. The facial features corresponding to each historical target to be tracked are matched. If the matching is successful, the target identifier of the historical target to be tracked is used as the target identifier of the current target to be tracked, and the position of the current target to be tracked is the position of the historical target to be tracked after movement. .

在本申請一些實施例中，可在跟蹤過程中根據提取的更新的人臉特徵更新特徵庫，如在光照連續變化或側臉時，會得到第一待跟蹤目標在其它幀的更新的人臉特徵，如果該更新的人臉特徵與第一人臉特徵存在差異，可將存在差異的更新的人臉特徵加入第一待跟蹤目標對應的特徵庫，並為該更新的人臉特徵與第一待跟蹤目標的目標標識建立關聯關係，並將所述關聯關係儲存在特徵庫中，從而在第一待跟蹤目標在其它幀中存在更大角度的側臉或更大光強的光線變化時，可將當前待跟蹤目標對應的第二人臉特徵與第一待跟蹤目標的更新的人臉特徵進行特徵匹配，比直接與第一人臉特徵進行特徵匹配時的差異小，從而加大特徵匹配成功的概率，減小目標跟蹤過程對跟蹤目標的變化、傾斜、遮蓋、光照變化的敏感度，提高跟蹤的連續性和穩健(robust)性。且通過特徵庫可保存大量第一待跟蹤目標在不同幀對應的人臉特徵，在第一待跟蹤目標消失後又出現的情況下，可利用第一待跟蹤目標對應的特徵庫中之前已保存的第一待跟蹤目標消失前的人臉特徵進行特徵匹配，從而對間斷出現的目標達到良好的跟蹤效果。更新特徵庫是通過跟蹤和檢測來更新一個正負樣本庫，相當於一個半線上的跟蹤演算法，相比于完全離線的跟蹤演算法有更好的召回率，相比于完全線上的跟蹤演算法則能表現出更高的準確率。 In some embodiments of the present application, the feature database may be updated during the tracking process based on the extracted updated face features. For example, when the lighting continuously changes or the face is updated, updated faces of the first target to be tracked in other frames may be obtained. Feature, if the updated face feature is different from the first face feature, the updated face feature with the difference may be added to a feature library corresponding to the first target to be tracked, and the updated face feature and the first face feature may be different. Establish an association relationship between the target identifiers of the target to be tracked, and store the association relationship in a feature database, so that when the first target to be tracked has a larger-angle side face or a larger light intensity change in other frames, Feature matching between the second face feature corresponding to the current target to be tracked and the updated face feature of the first target to be tracked is smaller than when the feature is directly matched with the first face feature, thereby increasing feature matching The probability of success reduces the sensitivity of the target tracking process to changes in the tracking target, tilt, cover, and light changes, and improves the continuity and robustness of the tracking. Moreover, a large number of facial features corresponding to the first target to be tracked in different frames can be saved through the feature database. In the case that the first target to be tracked disappears and then appears, the feature library corresponding to the first target to be tracked can be used to save Feature matching is performed on the facial features before the first target to be tracked disappears, so as to achieve a good tracking effect on the intermittently appearing targets. Updating the feature database is to update a positive and negative sample database through tracking and detection, which is equivalent to a half-line tracking algorithm. Compared with a completely offline tracking algorithm, it has a better recall rate. Compared to a completely online tracking algorithm, Can show higher accuracy.

在本申請實施例中，通過獲取視頻流，根據人臉檢測演算法識別人臉區域，得到第一視頻幀對應的第一待跟蹤目標，對第一待跟蹤目標通過基於深度神經網路的人臉特徵提取得到第一人臉特徵，將所述第一人臉特徵加入特徵庫，在當前視頻幀根據人臉檢測演算法識別人臉區域，得到當前視頻幀對應的當前待跟蹤目標，對當前待跟蹤目標通過基於深度神經網路的人臉特徵提取得到第二人臉特徵，根據第二人臉特徵和所述特徵庫將當前待跟蹤目標與第一待跟蹤目標進行特徵匹配，以從第一視頻幀開始跟蹤第一待跟蹤目標，在跟蹤過程中根據提取的更新的人臉特徵更新特徵庫，通過引用基於深度神經網路的人臉特徵進行特徵匹配，可解決目標跟蹤演算法由於沒有較好地利用人臉特徵，頻繁出現跟錯、跟偏和跟丟後無法重新正確找回跟蹤目標的問題，從而節省了終端或伺服器設備的資源，提升了終端或伺服器的處理器的處理速度。同時，特徵庫在跟蹤過程中不斷更新，可保存待跟蹤目標在不同狀態下對應的不同人臉特徵，從而提高人臉特徵匹配的成功率，減小目標跟蹤過程對跟蹤目標的變化、傾斜、遮蓋、光照變化的敏感度，提高跟蹤的連續性和穩健(robust)性，進而提升了終端或伺服器的處理器的處理速度。 In the embodiment of the present application, by obtaining a video stream and identifying a face region according to a face detection algorithm, a first target to be tracked corresponding to a first video frame is obtained, and a person based on a deep neural network is used to pass the first target to the tracked person. Face features are extracted to obtain the first face feature, and the first face feature is added to the feature database. The face area is identified according to the face detection algorithm in the current video frame, and the current target to be tracked corresponding to the current video frame is obtained. The target to be tracked obtains a second face feature through facial feature extraction based on a deep neural network, and the current target to be tracked is matched with the first target to be tracked according to the second face feature and the feature library, so that A video frame starts to track the first target to be tracked. During the tracking process, the feature database is updated based on the extracted updated face features, and feature matching is performed by referring to deep neural network-based face features. Make good use of facial features, frequently failing to retrieve the tracking target correctly after following errors, follow-ups, and follow-ups. It saves the resources of the terminal or server equipment, and improves the processing speed of the processor of the terminal or server. At the same time, the feature database is continuously updated during the tracking process, which can save different facial features corresponding to the target to be tracked in different states, thereby improving the success rate of facial feature matching, reducing the changes, tilt, The sensitivity of cover and light changes improves the continuity and robustness of the tracking, thereby increasing the processing speed of the processor of the terminal or server.

在本申請一個實施例中，上述方法還包括：根據每個待跟蹤目標的人臉狀態通過人臉識別演算法識別得到每個待跟蹤目標對應的人臉身份資訊，通過圖像特徵提取演算法得到人臉身份資訊對應的目標特徵。 In an embodiment of the present application, the method further includes: identifying the facial identity information corresponding to each target to be tracked through a face recognition algorithm according to the face state of each target to be tracked, and extracting the algorithm through image features The target features corresponding to the facial identity information are obtained.

在本申請一些實施例中，人臉狀態是指人臉的偏轉角度狀態。當檢測到人臉為正臉時，可通過人臉識別演算法識別得到對應的人臉身份資訊。人臉身份資訊用於描述人臉對應的身份。人臉識別是指將提取的人臉圖像的特徵資料與資料庫中儲存的特徵範本比如人臉特徵範本進行搜索匹配，根據相似程度確定人臉身份資訊。如在對進入企業的員工進行人臉識別時，在資料庫中提前儲存了企業中各個員工的特徵範本，例如人臉特徵範本，從而通過將當前提取的人臉圖像的特徵資料與資料庫中儲存的人臉特徵範本比對得到員工的人臉身份資訊。人臉身份資訊的具體內容可根據需要自訂，如員工名字、工號、所屬部門等。 In some embodiments of the present application, the state of the face refers to the state of the deflection angle of the face. When a human face is detected as a positive face, the corresponding face identity information can be obtained through recognition by a face recognition algorithm. Face identity information is used to describe the identity corresponding to the face. Face recognition refers to searching and matching feature information of the extracted face image with a feature template stored in a database, such as a face feature template, and determining face identity information based on similarity. For example, when performing face recognition on employees who enter the enterprise, a feature template of each employee in the enterprise, such as a face feature template, is stored in advance in the database, so that the feature data and database of the currently extracted face image are stored in the database. The facial feature templates stored in the comparison are used to obtain the employee's facial identity information. The specific content of the face identity information can be customized according to needs, such as employee name, job number, and department.

圖像特徵提取演算法是根據圖像本身的特徵，如顏色特徵、紋理特徵、形狀特徵、空間關係特徵等提取特徵資料，得到目標特徵，其中，所述目標特徵是提取得到的所有特徵資料的集合。為目標特徵與人臉身份資訊建立關聯關係，如衣服顏色、衣服紋理、人體形狀，身高比例等特徵，並將關聯關係儲存在資料庫中。這樣，當人臉存在偏轉、遮蓋時，可通過其它的目標特徵進行身份的識別和確定人臉區域。在本申請一個實施例中，如圖5所示，步驟S230中在當前視頻幀根據人臉檢測演算法識別人臉區域，得到當前視頻幀對應的當前待跟蹤目標的步驟包括： The image feature extraction algorithm extracts feature data based on the features of the image itself, such as color features, texture features, shape features, and spatial relationship features, to obtain the target features, where the target features are all the feature data extracted. set. Establish associations between target features and face identity information, such as clothing color, clothing texture, body shape, height ratio, and other characteristics, and store the associations in the database. In this way, when the face is deflected and covered, other target features can be used to identify the identity and determine the face area. In an embodiment of the present application, as shown in FIG. 5, in step S230, the face region is identified according to the face detection algorithm in the current video frame, and the steps of obtaining the current target to be tracked corresponding to the current video frame include:

步驟S231，判斷當前視頻幀根據人臉檢測演算法是否識別到人臉區域，如果沒有識別到人臉區域，則根據圖像特徵提取演算法獲取當前視頻幀對應的當前圖像特徵。 In step S231, it is determined whether the current video frame recognizes a face area based on the face detection algorithm. If no face area is recognized, the current video frame corresponding to the current video frame is obtained according to the image feature extraction algorithm.

具體的，如果根據人臉檢測演算法在當前視頻幀中沒有識別到人臉區域，也有可能是由於人臉偏側導致檢測失敗，此時需要根據圖像特徵提取演算法獲取當前視頻幀對應的當前圖像特徵。 Specifically, if the facial area is not recognized in the current video frame according to the face detection algorithm, the detection may also be due to the side of the face. At this time, the algorithm corresponding to the current image frame needs to be obtained to obtain the corresponding video frame. The current image feature.

步驟S232，將當前圖像特徵與目標特徵對比得到匹配的目標人臉身份資訊，根據目標人臉身份資訊得到當前視頻幀對應的當前待跟蹤目標。 Step S232: Compare the current image features with the target features to obtain matching target face identity information, and obtain the current target to be tracked corresponding to the current video frame according to the target face identity information.

具體的，由於之前已經將目標特徵與人臉身份資訊關聯，此時可將當前圖像特徵與目標特徵對比，計算相似度，如果相似度超過閾值，則匹配成功，可獲取匹配的目標特徵對應的目標人臉身份資訊，從而根據目標人臉身份資訊得到當前視頻幀對應的當前待跟蹤目標。然後，通過人臉身份資訊將當前待跟蹤目標與第一待跟蹤目標進行匹配，從而實現對第一待跟蹤目標的跟蹤。 Specifically, since the target feature has been associated with face identity information before, the current image feature can be compared with the target feature to calculate the similarity. If the similarity exceeds the threshold, the matching is successful and the corresponding target feature correspondence can be obtained. Target face identity information of, so as to obtain the current target to be tracked corresponding to the current video frame according to the target face identity information. Then, the current target to be tracked is matched with the first target to be tracked through the face identity information, so that the first target to be tracked is tracked.

本申請實施例中，將人臉身份資訊引入目標跟蹤，在人臉識別的同時結合圖像特徵，在人臉檢測演算法無法識別人臉區域時也能達到對目標的跟蹤，進一步提高跟蹤的連續性和穩健(robust)性。 In the embodiment of the present application, face identity information is introduced into target tracking, and image features are combined with face recognition, and the target tracking can be achieved when the face detection algorithm cannot identify the face area, which further improves the tracking accuracy. Continuity and robustness.

在本申請一個實施例中，步驟S220可包括：獲取第一待跟蹤目標對應的第一人臉身份資訊，建立第一人臉身份資訊對應的第一人臉特徵集合，將第一人臉特徵加入所述第一人臉特徵集合並將所述第一人臉特徵集合儲存至第一待跟蹤目標對應的特徵庫。 In an embodiment of the present application, step S220 may include: acquiring first facial identity information corresponding to the first target to be tracked, establishing a first facial feature set corresponding to the first facial identity information, and integrating the first facial feature Adding the first face feature set and storing the first face feature set to a feature library corresponding to the first target to be tracked.

具體的，可對第一待跟蹤目標進行人臉識別得到第一待跟蹤目標對應的第一人臉身份資訊。第一人臉特徵集合用於儲存第一待跟蹤目標在運動過程中不同狀態下的第一人臉特徵，不同狀態包括不同角度、不同光照、不同遮蓋範圍等。將人臉特徵提取後得到的第一人臉特徵加入第一人臉特徵集合，並為所述第一人臉特徵集合與第一人臉身份資訊建立關聯關係，將所述關聯關係以及第一人臉特徵集合儲存至第一待跟蹤目標對應的特徵庫。 Specifically, face recognition may be performed on the first target to be tracked to obtain first face identity information corresponding to the first target to be tracked. The first face feature set is used to store the first face feature of the first target to be tracked in different states during the movement, and the different states include different angles, different illuminations, different coverage ranges, and the like. The first facial feature obtained after the extraction of facial features is added to the first facial feature set, and an association relationship is established between the first facial feature set and the first facial identity information, and the association relationship and the first The facial feature set is stored in a feature library corresponding to the first target to be tracked.

在本申請一個實施例中，如圖6所示，步驟S230中在跟蹤過程中根據提取的更新的人臉特徵更新特徵庫的步驟可包括： In an embodiment of the present application, as shown in FIG. 6, the step of updating the feature database according to the extracted updated face features in the tracking process in step S230 may include:

步驟S233，獲取當前待跟蹤目標對應的當前人臉身份資訊，從特徵庫獲取當前人臉身份資訊對應的第一人臉特徵集合。 Step S233: Acquire the current facial identity information corresponding to the current target to be tracked, and acquire the first facial feature set corresponding to the current facial identity information from the feature database.

具體的，在一個實施例中，可通過對當前待跟蹤目標進行人臉識別得到當前待跟蹤目標對應的當前人臉身份資訊。在另外一個實施例中，也可通過對當前待跟蹤目標應用圖像特徵提取演算法得到當前待跟蹤目標對應的當前圖像特徵，再將當前圖像特徵與目標特徵進行匹配，將匹配的目標特徵對應的人臉身份資訊作為當前人臉身份資訊，從而在當前待跟蹤目標無法識別到人臉區域時也能得到當前人臉身份資訊。根據人臉身份資訊與人臉特徵集合的關聯對應關係，得到當前人臉身份資訊對應的第一人臉特徵集合，表明當前待跟蹤目標與第一待跟蹤目標是同一目標。 Specifically, in one embodiment, the current face identity information corresponding to the current target to be tracked may be obtained by performing face recognition on the current target to be tracked. In another embodiment, the current image feature corresponding to the current target to be tracked can also be obtained by applying an image feature extraction algorithm to the current target to be tracked, and then the current image feature and the target feature are matched to match the matched target. The facial identity information corresponding to the feature is used as the current facial identity information, so that the current facial identity information can also be obtained when the target area to be tracked cannot recognize the facial area. According to the association between the facial identity information and the facial feature set, the first facial feature set corresponding to the current facial identity information is obtained, indicating that the current target to be tracked and the first target to be tracked are the same target.

步驟S234，計算第一人臉特徵集合中的第一人臉特徵與第二人臉特徵的差異量，如果差異量超過預設閾值，則在第一人臉特徵集合中增加第二人臉特徵。 Step S234: Calculate the difference between the first face feature and the second face feature in the first face feature set. If the difference exceeds a preset threshold, add a second face feature to the first face feature set .

具體的，可自訂演算法計算第二人臉特徵與第一人臉特徵集合中的第一人臉特徵的差異量。如果第一人臉特徵集合中的第一人臉特徵為多個，則分別計算第二人臉特徵與每個第一人臉特徵的差異量，得到多個差異量。差異量表明了第二人臉特徵與特徵庫中已經保存的同一跟蹤目標的人臉特徵之間的差異，差異越大表明跟蹤目標的人臉狀態變化越大。如果差異量超過預設閾值，則在第一人臉特徵集合中增加第二人臉特徵，增加的第二人臉特徵可用於後續進行的特徵匹配。在人臉特徵集合中儲存的人臉特徵越多，就越能表徵同一跟蹤目標在不同狀態下的特徵，只要其中任何一個特徵能在特徵匹配時匹配成功，就認為當前待跟蹤目標與第一待跟蹤目標的匹配成功，從而加大了匹配成功的概率，減小目標跟蹤過程對跟蹤目標的變化、傾斜、遮蓋、光照變化的敏感度，提高跟蹤的連續性和穩健(robust)性。 Specifically, a custom algorithm may be used to calculate the difference between the second face feature and the first face feature in the first face feature set. If there are multiple first facial features in the first facial feature set, the difference between the second facial feature and each first facial feature is calculated separately to obtain multiple differences. The amount of difference indicates the difference between the second face feature and the face feature of the same tracking target that has been saved in the feature database. A larger difference indicates a greater change in the face state of the tracking target. If the difference exceeds a preset threshold, a second face feature is added to the first face feature set, and the added second face feature can be used for subsequent feature matching. The more face features stored in the face feature set, the more it can represent the features of the same tracking target in different states. As long as any one of the features can be successfully matched when the features are matched, the current target to be tracked is considered to be the first The matching of the target to be tracked is successful, thereby increasing the probability of successful matching, reducing the sensitivity of the target tracking process to changes in the tracking target, tilt, cover, and lighting changes, and improving the continuity and robustness of tracking.

在本申請一個實施例中，步驟S220可包括：對第一待跟蹤目標通過深度神經網路進行人臉特徵提取得到第一特徵向量。 In an embodiment of the present application, step S220 may include: performing facial feature extraction on the first target to be tracked through a deep neural network to obtain a first feature vector.

具體的，對深度神經網路進行訓練後得到人臉特徵提取模型，輸入第一待跟蹤目標對應的圖元值，則得到第一特徵向量，第一特徵向量的維度由人臉特徵提取模型決定。 Specifically, a facial feature extraction model is obtained after training a deep neural network, and the first feature vector is obtained by inputting the primitive value corresponding to the target to be tracked, and the dimension of the first feature vector is determined by the facial feature extraction model. .

步驟S230包括：對當前待跟蹤目標通過深度神經網路進行人臉特徵提取得到第二特徵向量，計算第一特徵向量與第二特徵向量的歐氏距離，如果歐氏距離小於預設閾值，則確定第一待跟蹤目標與當前待跟蹤目標特徵匹配成功。 Step S230 includes: performing facial feature extraction on the current target to be tracked through a deep neural network to obtain a second feature vector, and calculating the Euclidean distance between the first feature vector and the second feature vector. If the Euclidean distance is less than a preset threshold, then It is determined that the characteristics of the first target to be tracked and the current target to be tracked are successfully matched.

具體的，輸入當前待跟蹤目標對應的圖元值至上述人臉特徵提取模型，則可得到第二特徵向量。第一特徵向量與第二特徵向量的歐氏距離代表了當前待跟蹤目標與第一待跟蹤目標的相似度。如果歐氏距離小於預設閾值，則確定當前待跟蹤目標與第一待跟蹤目標特徵匹配成功，表明當前待跟蹤目標與第一待跟蹤目標是同一目標，達到跟蹤目的。 Specifically, by inputting the primitive value corresponding to the current target to be tracked into the aforementioned facial feature extraction model, a second feature vector can be obtained. The Euclidean distance between the first feature vector and the second feature vector represents the similarity between the current target to be tracked and the first target to be tracked. If the Euclidean distance is less than a preset threshold, it is determined that the characteristics of the current target to be tracked and the first target to be tracked are successfully matched, indicating that the current target to be tracked is the same target as the first target to be tracked, and the tracking purpose is achieved.

在本申請一個實施例中，深度神經網路的網路結構可以為11層網路層，包括堆疊式的卷積神積網路和完全連接層，堆疊式的卷積神積網路由多個卷積層和maxpool層組成，具體網路結構為：conv3-64*2+LRN+max pool In an embodiment of the present application, the network structure of the deep neural network may be an 11-layer network layer, including a stacked convolutional neural network and a fully connected layer. The stacked convolutional neural network routes multiple Convolution layer and maxpool layer, the specific network structure is: conv 3-64 * 2 + LRN + max pool

conv3-128+max pool conv 3-128 + max pool

conv3-256*2+max pool conv 3-256 * 2 + max pool

conv3-512*2+max pool conv 3-512 * 2 + max pool

FC2048 FC 2048

FC1024,其中conv3表示半徑為3的卷積層，LRN表示LRN層，”maxpool”表示最大池化層(max pool)，FC表示完全連接層。在本申請一個實施例中，LRN(Local Response Normalization)層定義為一區域響應歸一化層，其用於進行歸一化的處理。在本申請一個實施例中，最大池化層定義為圖案被不重疊的分割成若干個同樣大小的小塊(pooling size)，每個小塊內只取最大的數字，再捨棄其他節點後，保持原有的平面結構。 FC 1024, where conv 3 represents a convolution layer with a radius of 3, LRN represents the LRN layer, "maxpool" represents the max pooling layer, and FC represents a fully connected layer. In an embodiment of the present application, the LRN (Local Response Normalization) layer is defined as a local response normalization layer, which is used for normalization processing. In an embodiment of the present application, the maximum pooling layer is defined as a pattern that is non-overlapping and divided into several small blocks of the same size (pooling size). Only the largest number is taken in each small block, and other nodes are discarded. Keep the original plane structure.

具體的，此網路結構為簡化的深度神經網路VGG網路結構，其中64*2表示2個64組，LRN層是一種幫助訓練的無參數層，FC2048表示輸出為2048維度向量的完全連接層，最後一個完全連接層FC1024的輸出為特徵提取得到的人臉特徵，是1024維向量。通過簡化的VGG網路結構得到的優化後的人臉特徵在測試集的隨機塊匹配上的表現遠優於TLD(Tracking-Learning-Detection，單目標長時間跟蹤)中的匹配模組的匹配表現，且大大提高了人臉特徵提取的效率，達到跟蹤演算法所要求的即時性。在本申請一個實施例中，可控制待跟蹤目標的解析度為112*112圖元，以減少計算複雜度。圖7為此VGG網路結構對應的人臉特徵提取演算法VGG-S與範本匹配演算法(match template)的匹配比對示意圖。如圖7所示，橫坐標代表召回率，縱坐標代表準確率，可見此VGG網路結構對應的人臉特徵提取演算法在進行特徵匹配時有更好的準確率，提高了目標跟蹤的正確率。 Specifically, this network structure is a simplified deep neural network VGG network structure, where 64 * 2 represents 2 64 groups, the LRN layer is a parameterless layer to help training, and FC2048 represents a full connection of 2048-dimensional vectors. Layer, the output of the last fully connected layer FC1024 is the face feature obtained by feature extraction, which is a 1024-dimensional vector. The optimized facial features obtained through the simplified VGG network structure perform much better on the random block matching of the test set than the matching performance of the matching module in the TLD (Tracking-Learning-Detection). , And greatly improve the efficiency of facial feature extraction, and achieve the instantaneousness required by the tracking algorithm. In one embodiment of the present application, the resolution of the target to be tracked can be controlled to be 112 * 112 graphics primitives to reduce the computational complexity. FIG. 7 is a schematic diagram of the matching comparison between the facial feature extraction algorithm VGG-S and the template matching algorithm (match template) corresponding to the VGG network structure. As shown in Figure 7, the abscissa represents the recall rate and the ordinate represents the accuracy rate. It can be seen that the facial feature extraction algorithm corresponding to this VGG network structure has a better accuracy rate when performing feature matching, which improves the accuracy of target tracking rate.

在本申請一個實施例中，步驟S230中在當前視頻幀根據人臉檢測演算法識別人臉區域，得到當前視頻幀對應的當前待跟蹤目標的步驟可包括：基於歸一化的圖元差異特徵和人體半身識別演算法在當前視頻幀中識別人臉區域，得到當前視頻幀對應的當前待跟蹤目標。 In an embodiment of the present application, in step S230, the step of identifying a face region based on a face detection algorithm in the current video frame to obtain a current target to be tracked corresponding to the current video frame may include: based on normalized feature differences of the primitives And the human body half body recognition algorithm recognizes the face area in the current video frame, and obtains the current target to be tracked corresponding to the current video frame.

具體的，基於歸一化的圖元差異特徵(Normalized Pixel Difference，NPD)進行人臉檢測，將得到的返回值作為人臉區域推薦框，如可基於NPD特徵使用AdaBoost(疊代演算法)構造強分類器用以識別和區分人臉。人體半身識別演算法可根據需要定義，可進行上半身檢測，根據上半身檢測篩選人臉區域推薦框，可過濾掉部分識別錯誤的人臉區域推薦框，極大地提高了人臉區域檢測的召回率和準確率，提升了目標跟蹤的整體表現。 Specifically, face detection is performed based on a normalized pixel difference feature (NPD), and the obtained return value is used as a face region recommendation frame. For example, AdaBoost (iterative algorithm) can be used to construct the NPD feature A strong classifier is used to recognize and distinguish faces. The human body half body recognition algorithm can be defined as required, and upper body detection can be performed. The face area recommendation frame can be filtered based on the upper body detection, which can filter out some of the incorrectly recognized face area recommendation frames, which greatly improves the recall rate and Accuracy improves the overall performance of target tracking.

在本申請一個實施例中，如圖8所示，步驟S230中在當前視頻幀根據人臉檢測演算法識別人臉區域，得到當前視頻幀對應的當前待跟蹤目標的步驟可包括： In an embodiment of the present application, as shown in FIG. 8, in step S230, the face region is identified according to the face detection algorithm in the current video frame, and the step of obtaining the current target to be tracked corresponding to the current video frame may include:

步驟S235，基於歸一化的圖元差異特徵識別人臉區域，在當前視頻幀得到第一推薦區域。 In step S235, a face region is identified based on the normalized feature differences of the primitives, and a first recommended region is obtained in the current video frame.

步驟S236，根據光流分析演算法計算得到所述第一待跟蹤目標在當前視頻幀對應的第二推薦區域。 In step S236, a second recommended region corresponding to the first target to be tracked in the current video frame is obtained according to an optical flow analysis algorithm.

具體的，光流分析演算法假設一個圖元I(x,y,t)在第一幀的光強度，它移動了(dx,dy)的距離到下一幀，用了dt時間。因為圖元點是一樣的，光強度也沒有發生變化。根據歷史第一待跟蹤目標的運動軌跡採用光流分析原理計算得到第一待跟蹤目標對應的向量速度模型，向向量速度模型輸入當前視頻幀和當前視頻幀的前一幀以及第一待跟蹤目標在前一幀的位置，可得到第一待跟蹤目標在當前視頻幀對應的第二推薦區域，即第一待跟蹤目標在當前視頻幀可能出現的位置。 Specifically, the optical flow analysis algorithm assumes the light intensity of a picture element I ( x , y , t ) in the first frame, which moves a distance of ( dx , dy ) to the next frame, using dt time. Because the primitive points are the same, the light intensity has not changed. According to the historical trajectory of the first target to be tracked, the vector velocity model corresponding to the first target to be tracked is calculated using the optical flow analysis principle. The current video frame and the previous frame of the current video frame and the first target to be tracked are input to the vector speed model At the position of the previous frame, a second recommended area corresponding to the first target to be tracked in the current video frame, that is, the position where the first target to be tracked may appear in the current video frame.

步驟S237，根據第一推薦區域和第二推薦區域得到當前待跟蹤目標。 Step S237: Obtain the current target to be tracked according to the first recommendation area and the second recommendation area.

具體的，根據光流分析演算法得出的第二推薦區域為第一待跟蹤目標基於歷史運動速度可能運動至的區域，可根據第二推薦區域的位置排除與第二推薦區域位置距離超過預設範圍的第一推薦區域，從而得到當前待跟蹤目標。也可將第一推薦區域和第二推薦區域全部作為當前待跟蹤目標。如果第一待跟蹤目標為多個，則每個第一待跟蹤目標分別有對應的第二推薦區域。 Specifically, the second recommended area obtained according to the optical flow analysis algorithm is an area to which the first target to be tracked may move based on the historical motion speed, and the distance between the position of the second recommended area and the distance of the second recommended area may exceed the predetermined Set the first recommendation area of the range to get the current target to be tracked. The first recommendation area and the second recommendation area may also be used as the current target to be tracked. If there are multiple first targets to be tracked, each first target to be tracked has a corresponding second recommendation area.

本實施例中，將歸一化的圖元差異特徵與光流分析演算法結合得到當前待跟蹤目標，因為先驗資訊的加入使得後續進行特徵匹配時準確率提高。 In this embodiment, the normalized feature differences of the picture elements are combined with the optical flow analysis algorithm to obtain the current target to be tracked, because the addition of a priori information makes the accuracy rate increase in subsequent feature matching.

在一個實施例中，步驟S237可包括：根據幀間相關性進行運動預測得到預期運動範圍，根據預期運動範圍篩選第一推薦區域和第二推薦區域得到當前待跟蹤目標。 In one embodiment, step S237 may include: performing motion prediction according to the inter-frame correlation to obtain an expected motion range, and filtering the first recommendation region and the second recommendation region according to the expected motion range to obtain a current target to be tracked.

具體的，幀間相關性利用歷史位置資訊和運動軌跡來預測目標在下一幀或數幀內的位置，相當於利用先驗資訊來調整NPD演算法的可信度。將預期運動範圍外的第一推薦區域和第二推薦區域過濾掉，得到當前待跟蹤目標，減少了後續計算特徵匹配的匹配數量，提高了匹配效率和準確率。 Specifically, inter-frame correlation uses historical position information and motion trajectories to predict the position of the target in the next frame or frames, which is equivalent to using prior information to adjust the credibility of the NPD algorithm. The first recommendation area and the second recommendation area outside the expected motion range are filtered to obtain the current target to be tracked, which reduces the number of matching of subsequent feature matching calculations, and improves the matching efficiency and accuracy.

在本申請一個實施例中，視頻目標跟蹤方法可通過如圖9所示的三個模組完成視頻目標跟蹤，包括跟蹤模組310、檢測模組320、以及學習模組330。具體地，獲取視頻流，根據人臉檢測演算法識別人臉區域，得到第一視頻幀對應的第一待跟蹤目標，從第一待跟蹤目標所在的視頻幀開始跟蹤，跟蹤模組310對第一待跟蹤目標通過基於深度神經網路的人臉特徵提取得到第一人臉特徵，並將所述第一人臉特徵加入特徵庫，學習模組330根據跟蹤情況更新特徵庫，檢測模組320不斷從當前視頻幀中查找更好的當前待跟蹤目標，以防跟錯和跟丟，跟蹤模組310根據更新的特徵庫將當前待跟蹤目標和第一待跟蹤目標進行匹配，以跟蹤第一待跟蹤目標。 In an embodiment of the present application, the video target tracking method may complete video target tracking through three modules as shown in FIG. 9, including a tracking module 310, a detection module 320, and a learning module 330. Specifically, a video stream is obtained, a face area is identified according to a face detection algorithm, a first target to be tracked corresponding to the first video frame is obtained, and tracking is started from the video frame where the first target to be tracked is located. A target to be tracked obtains a first face feature by extracting a face feature based on a deep neural network, and adds the first face feature to a feature database. The learning module 330 updates the feature database according to the tracking situation, and the detection module 320 Constantly find better current targets to be tracked from the current video frame to prevent errors and follow-ups. The tracking module 310 matches the current target to be tracked with the first target to be tracked according to the updated feature library to track the first Target to be tracked.

在本申請一個實施例中，採用上述視頻目標跟蹤方法得到的跟蹤區域示意圖可如圖10所示，採用TLD跟蹤演算法得到的跟蹤區域示意圖可如圖11所示。通過對比可以發現，在人臉偏側時，本申請實施例提出的視頻目標跟蹤方法的跟蹤區域比TLD跟蹤演算法的跟蹤區域更為精確，且TLD跟蹤演算法在人臉完全偏轉時會出現跟蹤失敗的現象，而本申請實施例提出的視頻目標跟蹤方法在人臉完全偏轉時仍然能夠跟蹤成功。正確率和召回率相比於TLD跟蹤演算法均有提升，具體資料如下：無人頭檢測版本：準確率提升5個百分點左右，錯誤率降低100%，目標跟蹤丟失率下降25%。 In an embodiment of the present application, a schematic diagram of a tracking area obtained by using the foregoing video target tracking method may be shown in FIG. 10, and a schematic diagram of a tracking area obtained by using a TLD tracking algorithm may be shown in FIG. 11. By comparison, it can be found that when the face is sideways, the tracking area of the video target tracking method proposed in the embodiment of the present application is more accurate than the tracking area of the TLD tracking algorithm, and the TLD tracking algorithm will appear when the face is completely deflected The tracking failure phenomenon, and the video target tracking method proposed in the embodiment of the present application can still track success when the human face is completely deflected. The accuracy rate and recall rate are improved compared to the TLD tracking algorithm. The specific information is as follows: No head detection version: The accuracy rate is increased by about 5 percentage points, the error rate is reduced by 100%, and the target tracking loss rate is reduced by 25%.

有人頭檢測版本：準確率提升1個百分點左右，錯誤率降低100%，目標跟蹤丟失率下降15%。 Human head detection version: the accuracy rate is increased by about 1 percentage point, the error rate is reduced by 100%, and the target tracking loss rate is reduced by 15%.

在性能方面，例如在640*480的解析度下，3.5G主頻的CPU和NvidiaGeforceGtx 775m顯示卡型號的機器，單幀處理時間在40ms左右，幀率在25FPS(幀/每秒)以上。 In terms of performance, for example, at a resolution of 640 * 480, a CPU with a 3.5G frequency and a Nvidia GeforceGtx 775m graphics card model machine, the single frame processing time is about 40ms, and the frame rate is above 25FPS (frames per second).

上述視頻目標跟蹤方法比傳統方法更精準，給後續的人員人流統計、身份識別和行為分析等需求提供了可能和便利，性能上的良好表現也滿足了線上處理的需求，提高了監控分析系統的準確性、拓展性和適用性，進而提高了硬體處理器的處理速度，提高了處理器的處理性能。 The above video target tracking method is more accurate than the traditional method, which provides the possibility and convenience for the follow-up personnel flow statistics, identity recognition and behavior analysis. The good performance also meets the needs of online processing and improves the monitoring and analysis system. Accuracy, scalability, and applicability have further improved the processing speed of the hardware processor and the processing performance of the processor.

在本申請一個實施例中，如圖12所示，提供了一種視頻目標跟蹤裝置，該裝置可包括：檢測模組410，用於獲取視頻流，根據人臉檢測演算法識別人臉區域，得到第一視頻幀對應的第一待跟蹤目標。 In an embodiment of the present application, as shown in FIG. 12, a video target tracking device is provided. The device may include a detection module 410 for acquiring a video stream, identifying a face area according to a face detection algorithm, and obtaining The first target to be tracked corresponding to the first video frame.

人臉特徵提取模組420，用於對所述第一待跟蹤目標通過基於深度神經網路的人臉特徵提取得到第一人臉特徵，並將所述第一人臉特徵存入所述第一待跟蹤目標對應的特徵庫。 A face feature extraction module 420 is configured to obtain a first face feature by extracting a face feature based on a deep neural network on the first target to be tracked, and store the first face feature in the first A feature library corresponding to the target to be tracked.

檢測模組410還用於在當前視頻幀根據人臉檢測演算法識別人臉區域，得到當前視頻幀對應的當前待跟蹤目標。 The detection module 410 is further configured to identify a face area according to a face detection algorithm in a current video frame, and obtain a current target to be tracked corresponding to the current video frame.

人臉特徵提取模組420還用於對當前待跟蹤目標通過基於深度神經網路的人臉特徵提取得到第二人臉特徵。 The facial feature extraction module 420 is further configured to obtain a second facial feature by extracting a facial feature based on a deep neural network for the current target to be tracked.

跟蹤模組430，用於根據第二人臉特徵和所述特徵庫將當前待跟蹤目標與第一待跟蹤目標進行特徵匹配，以從第一視頻幀開始跟蹤第一待跟蹤目標。 The tracking module 430 is configured to perform feature matching between the current target to be tracked and the first target to be tracked according to the second face feature and the feature library, so as to track the first target to be tracked from the first video frame.

學習模組440，用於在跟蹤過程中根據提取的更新的人臉特徵更新所述特徵庫。 A learning module 440 is configured to update the feature database according to the extracted updated face features during the tracking process.

在本申請一個實施例中，如圖13所示，該裝置還包括：特徵身份處理模組450，用於根據待跟蹤目標的人臉狀態通過人臉識別演算法識別得到對應的人臉身份資訊，根據圖像特徵提取演算法得到人臉身份資訊對應的目標特徵，並為所述目標特徵和人臉身份資訊建立關聯關係。 In an embodiment of the present application, as shown in FIG. 13, the device further includes a feature identity processing module 450 configured to obtain corresponding face identity information through a face recognition algorithm according to the face state of the target to be tracked. According to the image feature extraction algorithm, a target feature corresponding to the facial identity information is obtained, and an association relationship is established between the target feature and the facial identity information.

檢測模組410可包括：圖像特徵提取單元411，用於判斷在當前視頻幀根據人臉檢測演算法是否識別到人臉區域，如果沒有識別到人臉區域，則根據圖像特徵提取演算法獲取當前視頻幀對應的當前圖像特徵。 The detection module 410 may include: an image feature extraction unit 411, configured to determine whether a face area is recognized according to a face detection algorithm in the current video frame; if no face area is identified, an algorithm is extracted based on the image features Get the current image feature corresponding to the current video frame.

身份匹配單元412，用於基於所述關聯關係，將當前圖像特徵與目標特徵對比得到匹配的目標人臉身份資訊。 An identity matching unit 412 is configured to compare current image features with target features to obtain matching target face identity information based on the association relationship.

第一跟蹤目標確定單元413，用於根據目標人臉身份資訊得到當前視頻幀對應的當前待跟蹤目標。 The first tracking target determining unit 413 is configured to obtain a current target to be tracked corresponding to the current video frame according to the target face identity information.

在本申請一個實施例中，人臉特徵提取模組420還用於獲取第一待跟蹤目標對應的第一人臉身份資訊，建立第一人臉身份資訊對應的第一人臉特徵集合，將第一人臉特徵加入第一人臉特徵集合並將所述第一人臉特徵集合儲存至所述特徵庫。 In an embodiment of the present application, the facial feature extraction module 420 is further configured to obtain first facial identity information corresponding to the first target to be tracked, establish a first facial feature set corresponding to the first facial identity information, and The first face feature is added to the first face feature set and the first face feature set is stored in the feature database.

學習模組440還用於獲取當前待跟蹤目標對應的當前人臉身份資訊，從特徵庫獲取當前人臉身份資訊對應的第一人臉特徵集合，計算第一人臉特徵集合中的第一人臉特徵與第二人臉特徵的差異量，如果差異量超過預設閾值，則在第一人臉特徵集合中增加第二人臉特徵。 The learning module 440 is also used to obtain the current face identity information corresponding to the current target to be tracked, obtain the first face feature set corresponding to the current face identity information from the feature library, and calculate the first person in the first face feature set. The amount of difference between the face feature and the second face feature. If the difference exceeds a preset threshold, a second face feature is added to the first face feature set.

在本申請一個實施例中，檢測模組410還用於基於歸一化的圖元差異特徵和人體半身識別演算法在當前視頻幀中識別人臉區域，得到當前視頻幀對應的當前待跟蹤目標。 In an embodiment of the present application, the detection module 410 is further configured to identify a face region in a current video frame based on a normalized feature difference of a picture element and a human body half-length recognition algorithm to obtain a current target to be tracked corresponding to the current video frame. .

在本申請一個實施例中，如圖14所示，檢測模組410可包括：第一推薦單元414，用於基於歸一化的圖元差異特徵識別人臉區域，在當前視頻幀得到第一推薦區域。 In an embodiment of the present application, as shown in FIG. 14, the detection module 410 may include a first recommendation unit 414 for identifying a face region based on the normalized feature differences of the primitives, and obtaining the first region in the current video frame. Recommended area.

第二推薦單元415，根據光流分析演算法計算得到第一待跟蹤目標在當前視頻幀對應的第二推薦區域。 The second recommendation unit 415 obtains, according to the optical flow analysis algorithm, a second recommendation region corresponding to the first target to be tracked in the current video frame.

第二跟蹤目標確定單元416，用於根據第一推薦區域和第二推薦區域得到當前待跟蹤目標。 The second tracking target determination unit 416 is configured to obtain a current target to be tracked according to the first recommendation area and the second recommendation area.

在本申請一個實施例中，第二跟蹤目標確定單元416還用於根據幀間相關性進行運動預測得到預期運動範圍，根據預期運動範圍篩選第一推薦區域和第二推薦區域得到當前待跟蹤目標。 In an embodiment of the present application, the second tracking target determination unit 416 is further configured to perform motion prediction according to the inter-frame correlation to obtain an expected motion range, and filter the first recommended region and the second recommended region to obtain the current target to be tracked according to the expected motion range. .

在本申請一個實施例中，深度神經網路的網路結構為11層網路層，包括堆疊式的卷積神積網路和完全連接層，堆疊式的卷積神積網路由多個卷積層和maxpool層組成，具體網路結構為：conv3-64*2+LRN+max pool In an embodiment of the present application, the network structure of the deep neural network is an 11-layer network layer, including a stacked convolutional neural network and a fully connected layer. The stacked convolutional neural network routes multiple volumes. It consists of multi-layer and maxpool layer. The specific network structure is: conv 3-64 * 2 + LRN + max pool

conv3-128+max pool conv 3-128 + max pool

conv3-256*2+max pool conv 3-256 * 2 + max pool

conv3-512*2+max pool conv 3-512 * 2 + max pool

FC2048 FC 2048

FC1024,其中conv3表示半徑為3的卷積層，LRN表示LRN層，maxpool表示最大池化層，FC表示完全連接層。 FC 1024, where conv 3 represents a convolution layer with a radius of 3, LRN represents the LRN layer, maxpool represents the maximum pooling layer, and FC represents a fully connected layer.

在本申請一個實施例中，人臉特徵提取模組420還用於對第一待跟蹤目標通過深度神經網路進行人臉特徵提取得到第一特徵向量，對當前待跟蹤目標通過深度神經網路進行人臉特徵提取得到第二特徵向量。 In an embodiment of the present application, the facial feature extraction module 420 is further configured to perform facial feature extraction on the first target to be tracked through a deep neural network to obtain a first feature vector, and to perform a deep neural network on the current target to be tracked. Face feature extraction is performed to obtain a second feature vector.

跟蹤模組430還用於計算第一特徵向量與第二特徵向量的歐氏距離，如果所述歐氏距離小於預設閾值，則確定所述第一待跟蹤目標與當前待跟蹤目標特徵匹配成功。 The tracking module 430 is further configured to calculate the Euclidean distance between the first feature vector and the second feature vector. If the Euclidean distance is less than a preset threshold, it is determined that the features of the first target to be tracked and the current target to be tracked are successfully matched. .

圖15是本申請實施例提供的視頻目標跟蹤裝置的另一結構示意圖。如圖15所示，該視頻目標跟蹤裝置包括：處理器510，與所述處理器510相連接的記憶體520，以及用於發送和接收資料的介面530。所述記憶體520中儲存有可由所述處理器510執行的機器可讀指令模組，所述所述機器可讀指令模組包括：檢測模組521，用於獲取視頻流，根據人臉檢測演算法識別人臉區域，得到第一視頻幀對應的第一待跟蹤目標。 FIG. 15 is another schematic structural diagram of a video target tracking device according to an embodiment of the present application. As shown in FIG. 15, the video target tracking device includes a processor 510, a memory 520 connected to the processor 510, and an interface 530 for sending and receiving data. The memory 520 stores a machine-readable instruction module that can be executed by the processor 510. The machine-readable instruction module includes a detection module 521 for acquiring a video stream, and detecting the video stream based on a face. The algorithm recognizes the face area and obtains the first target to be tracked corresponding to the first video frame.

人臉特徵提取模組522，用於對所述第一待跟蹤目標通過基於深度神經網路的人臉特徵提取得到第一人臉特徵，並將所述第一人臉特徵存入所述第一待跟蹤目標對應的特徵庫。 A face feature extraction module 522 is configured to obtain the first face feature by extracting the face feature based on a deep neural network from the first target to be tracked, and store the first face feature into the first A feature library corresponding to the target to be tracked.

檢測模組521還用於在當前視頻幀根據人臉檢測演算法識別人臉區域，得到當前視頻幀對應的當前待跟蹤目標。 The detection module 521 is further configured to identify a face area according to a face detection algorithm in a current video frame, and obtain a current target to be tracked corresponding to the current video frame.

人臉特徵提取模組522還用於對當前待跟蹤目標通過基於深度神經網路的人臉特徵提取得到第二人臉特徵。 The facial feature extraction module 522 is further configured to obtain a second facial feature by using a deep neural network-based facial feature extraction for the current target to be tracked.

跟蹤模組523，用於根據第二人臉特徵和所述特徵庫將當前待跟蹤目標與第一待跟蹤目標進行特徵匹配，以從第一視頻幀開始跟蹤第一待跟蹤目標。 The tracking module 523 is configured to perform feature matching between the current target to be tracked and the first target to be tracked according to the second face feature and the feature library, so as to track the first target to be tracked from the first video frame.

學習模組524，用於在跟蹤過程中根據提取的更新的人臉特徵更新所述特徵庫。 A learning module 524 is configured to update the feature database according to the extracted updated face features during the tracking process.

在本申請一個實施例中，如圖16所示，所述機器可讀指令模組還可包括：特徵身份處理模組525，用於根據待跟蹤目標的人臉狀態通過人臉識別演算法識別得到對應的人臉身份資訊，根據圖像特徵提取演算法得到人臉身份資訊對應的目標特徵，並為所述目標特徵和人臉身份資訊建立關聯關係。 In an embodiment of the present application, as shown in FIG. 16, the machine-readable instruction module may further include: a feature identity processing module 525, configured to recognize by a face recognition algorithm according to a face state of a target to be tracked. The corresponding face identity information is obtained, and the target feature corresponding to the face identity information is obtained according to the image feature extraction algorithm, and an association relationship is established between the target feature and the face identity information.

在本申請實施例中，上述檢測模組521、人臉特徵提取模組522、跟蹤模組523、學習模組524以及特徵身份處理模組525的具體功能和實現方式可參照前述的模組410至450的相關描述，在此不再贅述。 In the embodiment of the present application, the specific functions and implementation methods of the detection module 521, face feature extraction module 522, tracking module 523, learning module 524, and feature identity processing module 525 can refer to the aforementioned module 410. Relevant descriptions up to 450 are not repeated here.

本領域普通技術人員可以理解實現上述實施例方法中的全部或部分流程，是可以通過電腦程式來指令相關的硬體來完成，所述程式可儲存于一非易失性電腦可讀取儲存介質中，如本申請實施例中，該程式可儲存於電腦系統的儲存介質中，並被該電腦系統中的至少一個處理器執行，以實現包括如上述各方法的實施例的流程。其中，所述儲存介質可為磁碟、光碟、唯讀儲存記憶體(Read-Only Memory，ROM)或隨機儲存記憶體(Random Access Memory，RAM)等。 A person of ordinary skill in the art can understand that all or part of the processes in the method of the above embodiments can be implemented by using a computer program to instruct related hardware. The program can be stored in a non-volatile computer-readable storage medium. In the embodiment of the present application, the program may be stored in a storage medium of a computer system and executed by at least one processor in the computer system to implement the processes including the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM).

通過以上的實施例的描述，本領域的技術人員可以清楚地瞭解到本申請實施例可借助軟體加必需的通用硬體平臺的方式來實現，即通過機器可讀指令來指令相關的硬體來實現，當然也可以通過硬體，但很多情況下前者是更佳的實施方式。基於這樣的理解，本申請實施例的技術方案本質上或者說對現有技術做出貢獻的部分可以以軟體產品的形式體現出來，該電腦軟體產品儲存在一個儲存介質中，包括若干指令用以使得一台終端設備(例如是手機，個人電腦，伺服器，或者網路設備等)執行本申請各個實施例所述的方法。 Through the description of the above embodiments, those skilled in the art can clearly understand that the embodiments of the present application can be implemented by means of software plus necessary universal hardware platform, that is, the relevant hardware is instructed by machine-readable instructions. Implementation can of course also be done in hardware, but in many cases the former is a better implementation. Based on such an understanding, the technical solutions of the embodiments of the present application that are essentially or contribute to the existing technology can be embodied in the form of a software product. The computer software product is stored in a storage medium and includes several instructions to make A terminal device (for example, a mobile phone, a personal computer, a server, or a network device) executes the methods described in the embodiments of the present application.

以上所述實施例的各技術特徵可以進行任意的組合，為使描述簡潔，未對上述實施例中的各個技術特徵所有可能的組合都進行描述，然而，只要這些技術特徵的組合不存在矛盾，都應當認為是本說明書記載的範圍。 The technical features of the embodiments described above can be arbitrarily combined. In order to make the description concise, all possible combinations of the technical features in the above embodiments have not been described. However, as long as there is no contradiction in the combination of these technical features, It should be considered as the scope described in this specification.

雖然本申請已用較佳實施例揭露如上，然其並非用以限定本申請，本申請所屬技術領域中具有通常知識者，在不脫離本申請之精神和範圍內，當可作各種之更動與潤飾，因此本申請之保護範圍當視後附之申請專利範圍所界定者為準。 Although the present application has been disclosed as above with preferred embodiments, it is not intended to limit the present application. Those with ordinary knowledge in the technical field to which this application belongs may make various changes and modifications without departing from the spirit and scope of this application. Retouching, so the protection scope of this application shall be determined by the scope of the appended patent application.

Claims

A video target tracking method applied to a terminal or a server. The method includes: acquiring a video stream, identifying a face area according to a face detection algorithm, and obtaining a first target to be tracked corresponding to a first video frame; A target to be tracked obtains a first face feature through facial feature extraction based on a deep neural network, and the first face feature is stored in a feature library corresponding to the first target to be tracked; The face detection algorithm recognizes a face region, obtains a current target to be tracked corresponding to the current video frame, and obtains a second face feature for the current target to be tracked based on facial feature extraction based on a deep neural network. Two face features and the feature library perform feature matching between the current target to be tracked and the first target to be tracked to track the first target to be tracked from the first video frame, and are extracted according to the tracking process. The updated face features update the feature library.

According to the method described in item 1 of the patent application scope, the method further comprises: identifying the corresponding face identity information through a face recognition algorithm according to the face state of the target to be tracked, and extracting the algorithm to obtain Describe the target feature corresponding to the face identity information, and establish an association relationship between the target feature and the face identity information; and identify the face area according to the face detection algorithm in the current video frame to obtain the current target corresponding to the current video frame. The step of tracking the target includes: judging whether a face area is recognized according to a face detection algorithm in the current video frame, and if no face area is identified, obtaining a current image feature corresponding to the current video frame according to an image feature extraction algorithm ; Comparing the current image features with the target features to obtain matching target face identity information based on the association relationship; and obtaining the current target to be tracked corresponding to the current video frame based on the target face identity information.

The method according to item 1 of the scope of patent application, wherein the first face feature is extracted from the deep neural network-based face feature to the first target to be tracked, and the first face feature is obtained. The step of storing the feature database corresponding to the first target to be tracked includes: obtaining first face identity information corresponding to the first target to be tracked; establishing a first face feature set corresponding to the first face identity information, Adding the first facial feature to the first facial feature set and storing the first facial feature set to the feature database; and updating the updated facial feature based on the extracted updated facial feature during the tracking process The steps of the feature database include: obtaining current face identity information corresponding to the current target to be tracked; obtaining a first face feature set corresponding to the current face identity information from the feature database; and calculating the first face feature The amount of difference between the first face feature in the set and the second face feature. If the difference exceeds a preset threshold, the second face feature is added to the first face feature set.

According to the method described in item 1 of the scope of patent application, the step of identifying a face region according to a face detection algorithm in a current video frame to obtain a current target to be tracked corresponding to the current video frame includes: based on a normalized primitive The difference feature and the human body half body recognition algorithm identify the face area in the current video frame, and obtain the current target to be tracked corresponding to the current video frame.

According to the method described in item 1 of the scope of patent application, the step of identifying a face region according to a face detection algorithm in a current video frame to obtain a current target to be tracked corresponding to the current video frame includes: based on a normalized primitive Differential features identify the face area, and obtain a first recommended area in the current video frame; calculate a second recommended area corresponding to the first target to be tracked in the current video frame according to the optical flow analysis algorithm; and according to the first recommended area And the second recommendation area to obtain the current target to be tracked.

According to the method described in claim 5, the step of obtaining the current target to be tracked according to the first recommendation area and the second recommendation area includes: performing motion prediction based on inter-frame correlation to obtain an expected motion. Range, and filtering the first recommendation region and the second recommendation region according to the expected motion range to obtain the current target to be tracked.

According to the method described in any of claims 1 to 6, the network structure of the deep neural network is an 11-layer network layer, including a stacked convolutional neural network and a fully connected layer. The stacked convolutional neural network consists of multiple convolutional layers and maxpool layers. The specific network structure is: conv 3-64 * 2 + LRN + max pool conv 3-128 + max pool conv 3-256 * 2 + max pool conv 3-512 * 2 + max pool conv 3-512 * 2 + max pool FC 2048 FC 1024, where conv 3 represents a convolution layer with a radius of 3, LRN represents the LRN layer, maxpool represents the maxpool layer, and FC represents a fully connected layer .

According to the method described in any one of claims 1 to 6, the first target feature is obtained by extracting a facial feature based on a deep neural network on the first target to be tracked, and the first The step of storing facial features into a feature library corresponding to the first target to be tracked includes: performing facial feature extraction on the first target to be tracked to obtain a first feature vector through a deep neural network; The target to be tracked obtains a second face feature by extracting a facial feature based on a deep neural network, and performs feature matching between the current target to be tracked and the first target to be tracked according to the second face feature and the feature library. The step of tracking the first target to be tracked from the first video frame includes: performing facial feature extraction on the current target to be tracked through a deep neural network to obtain a second feature vector; and calculating the first feature vector. The Euclidean distance between the feature vector and the second feature vector. If the Euclidean distance is less than a preset threshold, it is determined that the features of the first target to be tracked and the current target to be tracked are successfully matched.

A video target tracking device includes a processor and a memory connected to the processor. The memory stores a machine-readable instruction module executable by the processor; the machine may The read instruction module includes: a detection module for obtaining a video stream, identifying a face area according to a face detection algorithm, and obtaining a first target to be tracked corresponding to a first video frame; a face feature extraction module for detecting The first target to be tracked obtains a first face feature by extracting a facial feature based on a deep neural network, and stores the first face feature in a feature library corresponding to the first target to be tracked; The detection module is further configured to identify a face area according to a face detection algorithm in the current video frame to obtain a current target to be tracked corresponding to the current video frame; the face feature extraction module is further configured to perform a target detection on the current target A second facial feature is obtained through facial feature extraction based on a deep neural network; a tracking module is configured to compare the current target to be tracked with the first facial feature according to the second facial feature and the feature library. Feature matching is performed on a target to be tracked to track the first target to be tracked from the first video frame; a learning module is used to update the feature database according to the extracted updated face features during the tracking process.

According to the device described in claim 9 of the patent application scope, the device further includes: a feature identity processing module, which is used to obtain corresponding face identity information through a face recognition algorithm according to the face state of the target to be tracked. An image feature extraction algorithm obtains a target feature corresponding to the face identity information, and establishes an association relationship between the target feature and the face identity information; the detection module includes an image feature extraction unit for determining whether Whether the current video frame recognizes a facial area based on the face detection algorithm. If no facial area is recognized, the algorithm extracts the current image features corresponding to the current video frame according to the image feature extraction algorithm; the identity matching unit is used to The association relationship compares the current image feature with the target feature to obtain matching target face identity information; a first tracking target determination unit is configured to obtain a corresponding video frame corresponding to the current video frame according to the target face identity information; The target currently being tracked.

According to the device described in item 9 of the scope of patent application, the face feature extraction module is further configured to obtain first face identity information corresponding to a first target to be tracked, and establish a first face identity information corresponding to the first face identity information. A face feature set, adding the first face feature to the first face feature set and storing the first face feature set to the feature database; the learning module is further configured to obtain the current The current face identity information corresponding to the target to be tracked, obtain a first face feature set corresponding to the current face identity information from the feature database, and calculate a first face feature and a first face feature set in the first face feature set. A difference amount of the second face feature, and if the difference amount exceeds a preset threshold, adding the second face feature to the first face feature set.

According to the device described in item 9 of the scope of patent application, the detection module is further configured to identify a face region in a current video frame based on a normalized feature difference of a picture element and a human half-length recognition algorithm to obtain a current video frame correspondence. The current target to track.

According to the device described in claim 9 of the patent application scope, the detection module includes: a first recommendation unit for identifying a face region based on the normalized feature differences of the picture elements, and obtaining the first recommendation region in the current video frame; A second recommendation unit calculates and obtains, according to an optical flow analysis algorithm, a second recommendation area corresponding to the first target to be tracked in the current video frame; a second tracking target determination unit is configured to, according to the first recommendation area and the The second recommendation area obtains the current target to be tracked.

According to the device described in claim 13 of the patent application scope, the second tracking target determination unit is further configured to perform motion prediction based on inter-frame correlation to obtain an expected motion range, and filter the first recommended region and The second recommendation region obtains the current target to be tracked.

According to the device described in any of claims 9 to 14, the facial feature extraction module is further configured to perform facial feature extraction on the first target to be tracked through a deep neural network to obtain a first feature vector. , Performing facial feature extraction on the current target to be tracked through a deep neural network to obtain a second feature vector; the tracking module is further configured to calculate an Euclidean distance between the first feature vector and the second feature vector, if If the Euclidean distance is less than a preset threshold, it is determined that the features of the first target to be tracked and the characteristics of the current target to be tracked are successfully matched.

A non-transitory computer-readable storage medium storing machine-readable instructions stored in the storage medium. The machine-readable instructions can be executed by a processor to complete the following operations: acquiring a video stream and identifying the video stream based on a face detection algorithm. A face region to obtain a first target to be tracked corresponding to a first video frame; to obtain a first face feature from a deep neural network-based face feature extraction for the first target to be tracked, and Face features are stored in a feature library corresponding to the first target to be tracked; the face region is identified according to a face detection algorithm in the current video frame to obtain the current target to be tracked corresponding to the current video frame, and the current target to be tracked is passed A second facial feature is obtained by extracting a facial feature based on a deep neural network, and the current target to be tracked and the first target to be tracked are subjected to feature matching according to the second facial feature and the feature library to obtain The first video frame starts tracking the first target to be tracked, and the feature database is updated according to the extracted updated face features during the tracking process.

As the non-volatile computer-readable storage medium described in item 16 of the scope of patent application, the machine-readable instructions may be executed by the processor to complete the following operations: face recognition through face recognition according to the face state of the target to be tracked The corresponding facial identity information is obtained by algorithm recognition, and the target feature corresponding to the facial identity information is obtained by extracting the algorithm according to image features, and an association relationship is established between the target feature and the facial identity information; The frame recognizes the face area according to the face detection algorithm and obtains the current target to be tracked corresponding to the current video frame. The steps include: judging whether the face area is recognized according to the face detection algorithm in the current video frame, and if no face is recognized Region, the current image feature corresponding to the current video frame is obtained according to the image feature extraction algorithm; based on the association relationship, the current image feature is compared with the target feature to obtain matching target face identity information; according to The target face identity information obtains the current target to be tracked corresponding to the current video frame.

According to the non-volatile computer-readable storage medium described in item 16 of the scope of the patent application, the first face feature is extracted from the face feature based on a deep neural network for the first target to be tracked, and The step of storing the first face feature into a feature library corresponding to the first target to be tracked includes: obtaining first face identity information corresponding to the first target to be tracked; and establishing a first face identity information corresponding to the first face to be tracked. A first face feature set, adding the first face feature to the first face feature set, and storing the first face feature set to the feature database; the tracking process is based on the extracted Updated face features The steps of updating the feature database include: obtaining current face identity information corresponding to the current target to be tracked; obtaining the first face feature set corresponding to the current face identity information from the feature database; calculating The amount of difference between the first face feature and the second face feature in the first face feature set, and if the difference exceeds a preset threshold, adding the Describe The two facial features.

According to the non-volatile computer-readable storage medium described in item 16 of the scope of the patent application, the step of identifying a face area based on a face detection algorithm in a current video frame and obtaining a current target to be tracked corresponding to the current video frame includes : Recognize the face area in the current video frame based on the normalized feature differences of the primitives and the human half-length recognition algorithm to obtain the current target to be tracked corresponding to the current video frame.

According to the non-volatile computer-readable storage medium described in item 16 of the scope of the patent application, the step of identifying a face area based on a face detection algorithm in a current video frame and obtaining a current target to be tracked corresponding to the current video frame includes : Recognizing a face region based on normalized feature differences of pixels and obtaining a first recommended region in the current video frame; calculating a second recommended region corresponding to the first target to be tracked in the current video frame according to an optical flow analysis algorithm Obtaining the current target to be tracked according to the first recommendation area and the second recommendation area.

According to the non-transitory computer-readable storage medium described in item 20 of the scope of patent application, the step of obtaining the current target to be tracked according to the first recommended area and the second recommended area includes: Correlation is used to perform motion prediction to obtain an expected motion range, and the first recommendation region and the second recommendation region are filtered according to the expected motion range to obtain the current target to be tracked.

According to the non-transitory computer-readable storage medium according to any one of claims 16 to 21, the first human face is obtained by extracting facial features based on a deep neural network on the first target to be tracked. And storing the first face feature into a feature library corresponding to the first target to be tracked includes: performing facial feature extraction on the first target to be tracked to obtain the first feature through a deep neural network A vector; the second facial feature is obtained by extracting facial features based on a deep neural network for the current target to be tracked, and comparing the current target to be tracked with the second facial feature and the feature library The first target to be tracked is subjected to feature matching to track the first target to be tracked from the first video frame. The step includes: performing facial feature extraction on the current target to be tracked through a deep neural network to obtain a second Feature vector; calculating the Euclidean distance between the first feature vector and the second feature vector, and if the Euclidean distance is less than a preset threshold, determining the first target to be tracked and the current target to be tracked The target feature matches successfully.