JP2022112831A

JP2022112831A - Face tracking apparatus and program

Info

Publication number: JP2022112831A
Application number: JP2021008803A
Authority: JP
Inventors: 貴裕望月; Takahiro Mochizuki; 吉彦河合; Yoshihiko Kawai; 昌秀苗村; Masahide Naemura
Original assignee: Nippon Hoso Kyokai NHK; Japan Broadcasting Corp; NHK Engineering System Inc
Current assignee: Japan Broadcasting Corp; NHK Engineering System Inc
Priority date: 2021-01-22
Filing date: 2021-01-22
Publication date: 2022-08-03

Abstract

To provide a face tracking apparatus which improves accuracy by merging a tracking processing result with a face recognition processing result.SOLUTION: A video analysis processing unit outputs information on motion prediction result of a face region by executing processing of motion prediction of the face region included in a video. A face recognition processing unit extracts a feature quantity of the face region included in the video, and executes face recognition processing of the face region on the basis of the feature quantity, to output first position information which is position information of the face region and a first pair of person identification information, which is information for identifying a person as a result of face recognition processing, and a recognition score corresponding to the person identification information. A tracking processing unit determines a track (tracking information) in a time direction of the face region. The track has second position information which is position information of the face region, and a second pair of person identification information of the face region and recognition score.SELECTED DRAWING: Figure 1

Description

本発明は、顔追跡装置およびプログラムに関する。 The present invention relates to a face tracking device and program.

画像中の顔を認識して人物を特定する顔認識技術は、近年のＡＩ（人工知能）技術の進展によって高精度化している。顔認識技術は、セキュリティ分野では実用化の域に達している。 Face recognition technology for identifying a person by recognizing a face in an image has become highly accurate due to recent advances in AI (artificial intelligence) technology. Face recognition technology has reached the stage of practical use in the security field.

特許文献１は、顔認識技術について記載している。特許文献１によれば、顔認識技術は、学習用の顔画像を収集し、人物の違いを認識するための顔特徴量を画像に基づいて計算し、その顔特徴量の類似度に基づいて顔の認識を行うというのが基本である。 Patent Literature 1 describes a face recognition technique. According to Patent Document 1, face recognition technology collects face images for learning, calculates facial feature amounts for recognizing differences between people based on the images, and based on the similarity of the face feature amounts The basic idea is to recognize faces.

上記のような画像での顔認識技術を映像に適用することによって、人物識別情報などのメタデータを効率的に映像に付与する取り組みが、放送事業者等を中心に広がりを見せている。画像処理を時間的なつながりの有る映像処理に拡張する際に、有益な処理が物体領域の追跡処理である。 Efforts to efficiently add metadata such as personal identification information to videos by applying face recognition technology for images as described above to videos are spreading, mainly among broadcasters and the like. When image processing is extended to temporally connected video processing, object region tracking processing is useful.

特許文献２は、追跡処理の結果を顔認識に利用する手順を記載している。 Patent Literature 2 describes a procedure for using the result of tracking processing for face recognition.

特許文献３は、予め準備した画像とのテンプレートマッチングによる処理を利用した追跡処理を採用している。 Japanese Patent Laid-Open No. 2002-200001 employs a tracking process that utilizes template matching with an image that has been prepared in advance.

対象を人の顔に特定しない、映像中の一般的な物体追跡技術についても、これまでに種々の提案が行われてきている。一般的な追跡処理の代表例は、特許文献４や特許文献５に記載されている。 Various proposals have been made so far for a general object tracking technique in a video that does not specify the target as a human face. Representative examples of general tracking processing are described in Patent Document 4 and Patent Document 5.

特許文献４には、異なるフレームで検出した物体領域の特徴量の類似度に応じて、領域の対応関係を明らかにして、物体の追跡を行う処理が記載されている。 Patent Literature 4 describes processing for tracking an object by clarifying the correspondence between regions according to the similarity of feature amounts of object regions detected in different frames.

特許文献５では、パーティクルフィルターと呼ばれる統計的な処理を行う手法が記載されている。特許文献５に記載の追跡処理の手法は、パーティクルフィルターによって、領域間の対応関係に重み付けを行うことによって追跡の信頼度を向上させようとしている。 Patent Literature 5 describes a method of performing statistical processing called particle filtering. The method of tracking processing described in Patent Document 5 attempts to improve the reliability of tracking by weighting the correspondence between regions using a particle filter.

近年では、追跡処理技術の分野でも、深層学習処理を活用した処理の導入が図られている。深層学習処理の活用により、追跡精度が大きく向上している。 In recent years, even in the field of tracking processing technology, attempts have been made to introduce processing utilizing deep learning processing. Tracking accuracy is greatly improved by utilizing deep learning processing.

非特許文献１は、深層学習処理を用いた追跡処理の手法を多く記載している。また、非特許文献１においても引用されている非特許文献２もまた、深層学習処理を用いた追跡処理について記載している。 Non-Patent Document 1 describes many methods of tracking processing using deep learning processing. Non-Patent Document 2, which is also cited in Non-Patent Document 1, also describes tracking processing using deep learning processing.

特開２０１７－０３３３７２号公報JP 2017-033372 A 特開２０１４－１９１４７９号公報JP 2014-191479 A 特開２０１４－０２１８９６号公報JP 2014-021896 A 国際公開ＷＯ２０１８／１６３２４３Ａ１International publication WO2018/163243 A1 特開２０１９－１５３１１２号公報JP 2019-153112 A

Gioele Ciaparrone，Francisco Luque Sanchez，Siham Tabik，Luigi Troiano，Roberto Tagliaferri，Francisco Herrera，Deep Learning in video multi-object tracking:A survey，ArXiv:1907.12740，2019年，［令和３年（西暦２０２１年）１月１日検索］，インターネット＜URL：https://arxiv.org/pdf/1907.12740.pdf＞Gioele Ciaparrone, Francisco Luque Sanchez, Siham Tabik, Luigi Troiano, Roberto Tagliaferri, Francisco Herrera, Deep Learning in video multi-object tracking: A survey, ArXiv:1907.12740, 2019, January 2021 Daily Search], Internet <URL: https://arxiv.org/pdf/1907.12740.pdf> Nicolai Wojke，Alex Bewley，Dietrich Paulus，Simple Online and Realtime Tracking with a Deep Association Metric，arXiv:1703.07402，2017年，［令和３年（西暦２０２１年）１月１日検索］，インターネット＜URL：https://arxiv.org/pdf/1703.07402.pdf＞Nicolai Wojke, Alex Bewley, Dietrich Paulus, Simple Online and Realtime Tracking with a Deep Association Metric, arXiv:1703.07402, 2017, [Searched January 1, 2021], Internet <URL: https: //arxiv.org/pdf/1703.07402.pdf>

しかしながら、それぞれの背景技術には課題が存在する。 However, there are problems with each background art.

特許文献２や特許文献３に記載されている追跡処理は、顔認識処理の付随的な処理にとどまっており、認識結果と追跡とのセマンティックな結合はない。特許文献２や特許文献３の技術では、顔認識処理の出力を基に映像中の人物特定を簡単に行えるようにはなっていないという問題がある。 The tracking processing described in Patent Documents 2 and 3 is only ancillary processing of face recognition processing, and there is no semantic connection between recognition results and tracking. The techniques disclosed in Patent Documents 2 and 3 have a problem that it is not possible to easily identify a person in a video based on the output of face recognition processing.

非特許文献１や非特許文献２に記載された追跡技術は、あくまで同一物体領域のトラック情報を求めるものである。非特許文献１や非特許文献２では、顔認識などの認識結果と追跡結果とでの、高度なセマンティック情報の融合処理については記載されていない。 The tracking techniques described in Non-Patent Document 1 and Non-Patent Document 2 only obtain track information of the same object area. Non-Patent Document 1 and Non-Patent Document 2 do not describe fusion processing of advanced semantic information between recognition results such as face recognition and tracking results.

また、従来技術では、映像におけるボケ領域や、シーンチェンジに対応することはできないという問題もある。本発明は、映像内の人物が出現している区間で、一貫した顔認識を実現しようとするものである。 In addition, the conventional technology also has a problem that it cannot cope with blurred areas in images and scene changes. The present invention aims to realize consistent face recognition in a section in which a person appears in an image.

本発明は、上記のような事情を考慮して為されたものであり、追跡処理結果を顔認識処理結果と融合することにより、トラック情報に人物名などのセマンティック性の高い情報を与えることのできる顔追跡装置およびプログラムを提供しようとするものである。 The present invention has been made in consideration of the above circumstances, and by combining the result of tracking processing with the result of face recognition processing, it is possible to provide track information with highly semantic information such as a person's name. It is an object of the present invention to provide a face tracking device and program capable of

［１］上記の課題を解決するため、本発明の一態様による顔追跡装置は、映像に含まれる顔領域の動き予測の処理を行うことによって、前記顔領域の動き予測結果の情報を出力する映像解析処理部と、前記映像に含まれる前記顔領域の特徴量を抽出するとともに、前記特徴量に基づいて前記顔領域の顔認識処理を行うことによって、前記顔領域の位置情報である第１位置情報と、顔認識処理の結果として人物を識別する情報である人物識別情報と前記人物識別情報に対応する認識スコアとの組である第１の組と、を出力する顔認識処理部と、前記顔領域の時間方向のトラックを求める追跡処理部と、を備え、前記トラックは、前記顔領域の位置情報である第２位置情報と、前記顔領域の人物識別情報と認識スコアとの組である第２の組と、を持ち、前記追跡処理部は、前記映像解析処理部から渡される前記顔領域の動き予測結果の情報に基づいて前記顔領域が動いた後の前記第２位置情報を求め、前記顔認識処理部から渡される前記第１の組と、前記トラックが持つ前記第２の組と、に基づいて、且つ前記第１位置情報と前記第２位置情報との関係にも基づいて、前記トラックを更新することによって、前記トラックに、新たな前記第２位置情報と新たな前記第２の組と、を付与する、ものである。 [1] In order to solve the above problems, a face tracking device according to one aspect of the present invention performs motion prediction processing for a face region included in an image, and outputs information on the result of motion prediction for the face region. A video analysis processing unit extracts a feature amount of the face area included in the video and performs face recognition processing of the face area based on the feature amount, thereby obtaining first position information of the face area. a face recognition processing unit that outputs position information and a first set that is a set of person identification information that is information that identifies a person as a result of face recognition processing and a recognition score that corresponds to the person identification information; a tracking processing unit that obtains a track in the time direction of the face region, the track being a set of second position information that is position information of the face region, person identification information of the face region, and a recognition score. and a second set, wherein the tracking processing unit obtains the second position information after the face region has moved based on the information of the motion prediction result of the face region passed from the video analysis processing unit. based on the first set passed from the face recognition processing unit and the second set held by the track, and also based on the relationship between the first positional information and the second positional information. Then, by updating the track, the new second position information and the new second set are given to the track.

［２］また、本発明の一態様は、上記の顔追跡装置において、前記顔追跡装置は、前記映像解析処理部による処理を、前記顔認識処理部による処理よりも高頻度で行い、前記映像解析処理部は、複数回の前記動き予測の処理の結果である前記動き予測結果の情報を蓄積し、前記追跡処理部は、前記映像解析処理部から渡される蓄積された前記動き予測結果の情報に基づいて処理を行う、というものである。 [2] In one aspect of the present invention, in the face tracking device described above, the face tracking device performs the processing by the video analysis processing unit more frequently than the processing by the face recognition processing unit, The analysis processing unit accumulates the information of the motion prediction result which is the result of the motion prediction processing a plurality of times, and the tracking processing unit receives the information of the accumulated motion prediction result passed from the video analysis processing unit. It is said that processing is performed based on

［３］また、本発明の一態様は、上記の顔追跡装置において、前記映像解析処理部は、さらに、前記映像内のシーンチェンジの検出を行い、前記追跡処理部は、前記シーンチェンジが検出された場合には、前記トラックを初期状態にリセットする、ものである。 [3] In one aspect of the present invention, in the face tracking device described above, the video analysis processing unit further detects a scene change in the video, and the tracking processing unit detects the scene change. If so, it resets the track to its initial state.

［４］また、本発明の一態様は、上記の顔追跡装置において、前記映像解析処理部は、さらに、前記顔領域がボケ領域であるか否かの判定を行い、前記追跡処理部は、前記ボケ領域である前記顔領域に関しては、前記顔領域のトラックを更新する際には前記認識スコアの値を最低値（最悪スコア値）とする、ものである。 [4] In one aspect of the present invention, in the face tracking device described above, the video analysis processing unit further determines whether the face region is a blurred region, and the tracking processing unit Regarding the face area which is the blurred area, the value of the recognition score is set to the lowest value (worst score value) when the track of the face area is updated.

［５］また、本発明の一態様は、上記の顔追跡装置において、前記顔認識処理部は、前記顔領域がボケ領域である場合の前記特徴量と、当該ボケ領域の前記特徴量に関連付けられた特殊な前記人物識別情報と、を関連付けた情報にアクセス可能と予めしておき、前記映像解析処理部は、前記顔認識処理部が前記顔領域についての顔認識処理を行った結果として前記ボケ領域の前記特徴量に関連付けられた特殊な前記人物識別情報を同定した場合に、当該顔領域が前記ボケ領域であると判定する、というものである。 [5] In one aspect of the present invention, in the face tracking device described above, the face recognition processing unit associates the feature amount when the face area is a blurred area with the feature amount of the blurred area. The video analysis processing unit preliminarily enables access to information associated with the special person identification information obtained from the above, and the video analysis processing unit performs the face recognition processing on the face region by the face recognition processing unit. When the special person identification information associated with the feature quantity of the blurred area is identified, the face area is determined to be the blurred area.

［６］また、本発明の一態様は、上記の顔追跡装置において、前記追跡処理部は、前記第１位置情報と前記第２位置情報との重なり度合いに基づいて、前記トラックを更新する、というものである。 [6] In one aspect of the present invention, in the face tracking device described above, the tracking processing unit updates the track based on the degree of overlap between the first position information and the second position information. That's what it means.

［７］また、本発明の一態様は、上記の顔追跡装置において、前記トラックの情報に前記人物識別情報と前記認識スコアとの組の情報を付与した形の、前記トラックの集合のデータを整形して出力する出力整形処理部、をさらに備えるものである。 [7] Further, in one aspect of the present invention, in the face tracking device described above, data of the set of tracks is obtained in a form in which information of a set of the person identification information and the recognition score is added to the information of the tracks. An output shaping processing unit for shaping and outputting is further provided.

［８］また、本発明の一態様は、上記［７］の顔追跡装置において、前記人物識別情報と、前記人物識別情報に関連付けられた人物名表記文字列と、前記人物識別情報に関連付けられたサムネール画像の指定と、のいずれかの検索キーを取得し、取得した前記検索キーに基づいて、前記出力整形処理部が出力した前記トラックの集合のデータを検索することによって、特定の人物識別情報に関連付けられた前記トラックを特定する、検索処理部、をさらに備えるものである。 [8] Further, in one aspect of the present invention, in the face tracking device of [7] above, the person identification information, the person name notation character string associated with the person identification information, and the character string associated with the person identification information a specified thumbnail image, and obtaining a search key, and based on the obtained search key, searching the data of the set of tracks output by the output shaping processing unit, thereby identifying a specific person. A search process for identifying the track associated with the information.

［９］また、本発明の一態様は、映像に含まれる顔領域の動き予測の処理を行うことによって、前記顔領域の動き予測結果の情報を出力する映像解析処理部と、前記映像に含まれる前記顔領域の特徴量を抽出するとともに、前記特徴量に基づいて前記顔領域の顔認識処理を行うことによって、前記顔領域の位置情報である第１位置情報と、顔認識処理の結果として人物を識別する情報である人物識別情報と前記人物識別情報に対応する認識スコアとの組である第１の組と、を出力する顔認識処理部と、前記顔領域の時間方向のトラックを求める追跡処理部と、を備え、前記トラックは、前記顔領域の位置情報である第２位置情報と、前記顔領域の人物識別情報と認識スコアとの組である第２の組と、を持ち、前記追跡処理部は、前記映像解析処理部から渡される前記顔領域の動き予測結果の情報に基づいて前記顔領域が動いた後の前記第２位置情報を求め、前記顔認識処理部から渡される前記第１の組と、前記トラックが持つ前記第２の組と、に基づいて、且つ前記第１位置情報と前記第２位置情報との関係にも基づいて、前記トラックを更新することによって、前記トラックに、新たな前記第２位置情報と新たな前記第２の組と、を付与する、顔追跡装置、としてコンピューターを機能させるプログラムである。 [9] Further, one aspect of the present invention includes: a video analysis processing unit that performs motion prediction processing of a face region included in a video and outputs information on the motion prediction result of the face region; and extracting a feature amount of the face area from the face area, and performing face recognition processing on the face area based on the feature amount, thereby obtaining first position information, which is position information of the face area, and as a result of the face recognition process, a face recognition processing unit that outputs a first set of person identification information that identifies a person and a recognition score corresponding to the person identification information; and a track in the time direction of the face region. a tracking processing unit, wherein the track has second position information that is position information of the face region and a second set that is a set of person identification information and a recognition score of the face region; The tracking processing unit obtains the second position information after the movement of the face region based on the information of the motion prediction result of the face region passed from the video analysis processing unit, and passes the second position information from the face recognition processing unit. By updating the track based on the first set and the second set possessed by the track, and also based on the relationship between the first position information and the second position information, The program causes a computer to function as a face tracking device that provides the track with the new second position information and the new second set.

本発明によれば、顔領域の画像としての特徴量の情報と、位置や動き等も考慮したトラック（複数フレームにまたがる顔領域の追跡）の情報との、両方に基づいて、新たなフレームにおいて、トラックを同定したり、顔認識を行ったりすることができる。これにより、追跡の精度の向上と、顔認識の精度の向上の、両方が期待できる。 According to the present invention, in a new frame, based on both the information of the feature amount as the image of the face area and the information of the track (tracking of the face area across multiple frames) considering the position, movement, etc. , track identification and facial recognition. As a result, both improved tracking accuracy and improved face recognition accuracy can be expected.

本発明の一実施形態による顔追跡装置の概略機能構成を示すブロック図である。1 is a block diagram showing a schematic functional configuration of a face tracking device according to one embodiment of the present invention; FIG. 同実施形態による顔追跡装置における、映像解析処理と追跡処理との時間的関係を示す概略図である。4 is a schematic diagram showing the temporal relationship between video analysis processing and tracking processing in the face tracking device according to the same embodiment; FIG. 同実施形態において、顔認識処理と一体化したボケ判定処理のための、顔データベースの構築方法を模式的に示す概略図である。FIG. 5 is a schematic diagram schematically showing a method of constructing a face database for blur determination processing integrated with face recognition processing in the same embodiment. 同実施形態における追跡処理部による追跡処理の手順を示すフローチャートである。It is a flowchart which shows the procedure of the tracking process by the tracking process part in the same embodiment. 同実施形態における、あるトラック（顔領域のトラック）についての認識スコアの時間推移の例を示す概略図である。FIG. 10 is a schematic diagram showing an example of a time transition of a recognition score for a certain track (face area track) in the same embodiment; 同実施形態の顔追跡装置による追跡処理の結果として得られるデータ構成の一例を示す概略図である。FIG. 4 is a schematic diagram showing an example of a data configuration obtained as a result of tracking processing by the face tracking device of the same embodiment; 同実施形態による人物名あるいはサムネール画像を起点として、当該人物が登場する映像コンテンツのシーンを取り出すための情報経路の例を示す概略図である。FIG. 10 is a schematic diagram showing an example of an information path for retrieving a video content scene in which a person appears, starting from a person's name or a thumbnail image according to the same embodiment; 同実施形態における、上記（図７）の映像シーン検索のためのＧＵＩ（グラフィカルユーザーインターフェース）の一例を示す概略図である。FIG. 8 is a schematic diagram showing an example of a GUI (graphical user interface) for searching for the video scene described above (FIG. 7) in the same embodiment; 同実施形態による顔追跡装置の内部構成の例を示すブロック図である。It is a block diagram showing an example of the internal configuration of the face tracking device according to the same embodiment.

次に、本発明の一実施形態について、図面を参照しながら説明する。本実施形態の顔追跡装置は、時間的に連続して出現する人物の顔認識、オクルージョンや一時的な人物の消失にも対応した頑健な顔領域追跡手法、映像中のシーンチェンジや動きボケなどの変化に対応した頑健な手法といった技術要素を実現する。これにより、顔認識の結果に基づいて、トラック情報に人物名などのセマンティック性の高い情報を与えることを可能とする。また、本実施形態の顔追跡装置は、映像を対象とする人物検索を容易にするための、認識処理結果のデータを出力する。これにより、顔認識結果を活用したアプリケーションソフトウェアの開発が容易になる。本実施形態の顔追跡装置は、そのために、追跡処理と顔認識処理を統合したデータ構造で、処理結果を出力する。 An embodiment of the present invention will now be described with reference to the drawings. The face tracking device of this embodiment is capable of recognizing the face of a person that appears continuously over time, a robust face area tracking method that can handle occlusion and temporary disappearance of a person, scene changes in images, motion blur, etc. Realize technical elements such as a robust method that responds to changes in This makes it possible to give highly semantic information such as a person's name to the track information based on the result of face recognition. In addition, the face tracking device of the present embodiment outputs recognition processing result data for facilitating person search for video. This facilitates the development of application software that utilizes face recognition results. For this reason, the face tracking device of this embodiment outputs a processing result with a data structure that integrates tracking processing and face recognition processing.

本実施形態は、時間的に連続する画像（言い換えれば、時間的に前後するフレーム画像）に基づいて、時間的な関係性を考慮して、顔を追跡する。また、顔追跡処理と顔認識処理とを統合する。これにより、本実施形態の顔追跡装置は、顔追跡に関する情報（顔領域の位置の情報や、顔領域の動き予測に関する情報）を、顔認識処理の精度向上のために利用する。また、本実施形態の顔追跡装置は、顔認識処理に関する情報（顔領域の特徴量の類似性（特徴量間の距離））を、顔追跡（トラッキング）の精度向上のために利用する。 In this embodiment, the face is tracked in consideration of temporal relationships based on temporally consecutive images (in other words, temporally consecutive frame images). Also, face tracking processing and face recognition processing are integrated. As a result, the face tracking device of the present embodiment uses information on face tracking (information on the position of the face region and information on motion prediction of the face region) to improve the accuracy of face recognition processing. In addition, the face tracking apparatus of the present embodiment uses information (similarity of feature amounts of face regions (distance between feature amounts)) regarding face recognition processing to improve the accuracy of face tracking.

図１は、本実施形態による顔追跡装置の概略機能構成を示すブロック図である。図示するように、顔追跡装置１は、映像取得部１０と、映像解析処理部２０と、顔データベース３０と、顔認識処理部４０と、追跡処理部５０と、出力整形処理部６０と、を含んで構成される。これらの各機能部は、例えば、コンピューターと、プログラムとで実現することが可能である。また、各機能部は、必要に応じて、記憶手段を有する。記憶手段は、例えば、プログラム上の変数や、プログラムの実行によりアロケーションされるメモリーである。また、必要に応じて、磁気ハードディスク装置やソリッドステートドライブ（ＳＳＤ）といった不揮発性の記憶手段を用いるようにしてもよい。また、各機能部の少なくとも一部の機能を、プログラムではなく専用の電子回路として実現してもよい。 FIG. 1 is a block diagram showing a schematic functional configuration of a face tracking device according to this embodiment. As illustrated, the face tracking device 1 includes an image acquisition unit 10, an image analysis processing unit 20, a face database 30, a face recognition processing unit 40, a tracking processing unit 50, and an output shaping processing unit 60. composed of Each of these functional units can be realized by, for example, a computer and a program. In addition, each functional unit has storage means as required. The storage means are, for example, program variables and memory allocated by program execution. Also, if necessary, non-volatile storage means such as a magnetic hard disk drive or a solid state drive (SSD) may be used. Also, at least part of the function of each functional unit may be realized as a dedicated electronic circuit instead of a program.

顔追跡装置１は、映像を入力し、映像内に含まれる顔を認識し、顔認識結果と映像情報とを統合したデータを出力する。顔追跡装置１が出力するデータは、所定のデータ構造で構成されるものである。顔追跡装置１を構成する各部の機能は、次に説明する通りである。 The face tracking device 1 receives an image, recognizes a face included in the image, and outputs data that integrates the face recognition result and image information. The data output by the face tracking device 1 has a predetermined data structure. The function of each unit that configures the face tracking device 1 is as described below.

映像取得部１０は、処理対象（顔追跡対象）の映像を外部から取得する。ここで「外部」とは、他の装置等であってもよいし、データを記録する記録媒体等であってもよい。なお、映像は、所定のフレームレート（例えば、３０フレーム毎秒、あるいは６０フレーム毎秒等）で連続するフレーム画像のシーケンスである。 The video acquisition unit 10 externally acquires a video to be processed (face tracking target). Here, "external" may be another device or the like, or may be a recording medium or the like for recording data. The video is a sequence of continuous frame images at a predetermined frame rate (eg, 30 frames per second, 60 frames per second, etc.).

映像解析処理部２０は、映像取得部１０が取得した映像の解析を行う。具体的には、映像解析処理部２０は、映像を構成するフレーム画像間での物体や人物等の動き予測処理（動きパラメーター値の抽出）を行ったり、シーンチェンジ解析処理を行ったり、フレーム画像内の領域ごとのボケ検出を行ったりする。映像解析処理部２０は、映像解析処理の結果を追跡処理部５０に供給する。 The image analysis processing unit 20 analyzes the image acquired by the image acquisition unit 10 . Specifically, the video analysis processing unit 20 performs motion prediction processing (extraction of motion parameter values) of objects, people, etc. between frame images constituting video, scene change analysis processing, and frame image processing. For example, blur detection is performed for each area inside. The video analysis processing unit 20 supplies the result of video analysis processing to the tracking processing unit 50 .

言い換えれば、映像解析処理部２０は、映像に含まれる顔領域の動き予測の処理を行うことによって、前記顔領域の動き予測結果の情報（動き予測のモデルの情報）を出力する。 In other words, the video analysis processing unit 20 outputs motion prediction result information (motion prediction model information) for the face region by performing motion prediction processing for the face region included in the video.

顔データベース３０は、人物を識別するための情報である人物ＩＤと、顔画像の特徴量と、を関連付けて記憶するデータベースである。データベース自体は既存の技術を用いて実現される。データベースは、記憶手段として、例えば磁気ディスク装置や半導体記憶装置などを用いることができる。なお、顔データベース３０に登録される特徴量は、収集された大量の顔画像を基に、機械学習処理等によって得られる情報である。言い換えれば、顔データベース３０は、顔認識モデルの情報を記憶する。 The face database 30 is a database that associates and stores a person ID, which is information for identifying a person, and a feature amount of a face image. The database itself is implemented using existing technology. The database can use, for example, a magnetic disk device, a semiconductor storage device, or the like as storage means. The feature amount registered in the face database 30 is information obtained by machine learning processing or the like based on a large number of collected face images. In other words, the face database 30 stores information of face recognition models.

なお、顔データベース３０が、人物ＩＤの他に、当該人物の属性情報（「政治家」、「俳優」などといったジャンル等の情報）を合わせて保持するようにしてもよい。 Note that the face database 30 may store attribute information of the person (information such as genre such as "politician" and "actor") in addition to the person ID.

顔認識処理部４０は、映像取得部１０が取得した映像中の顔領域の部分を検出し、その顔領域部分から計算される特徴量と、顔データベース３０が記憶している特徴量とを照合することによって顔認識の結果を出力する。顔認識処理部４０は、顔認識処理の結果として、人物を特定する情報である人物ＩＤを出力する。あるいは、顔認識処理部４０は、映像内の特定の顔領域に対応して、人物ＩＤごとのスコアを出力するようにしてもよい。このスコアは、当該顔領域に含まれる顔が、その人物ＩＤによって識別される人の顔であることの尤度を表す。なお、顔認識処理自体は、既存技術を用いて行うことができる。 The face recognition processing unit 40 detects a face area portion in the image acquired by the image acquisition unit 10, and collates the feature amount calculated from the face area portion with the feature amount stored in the face database 30. to output the result of face recognition. The face recognition processing unit 40 outputs a person ID, which is information identifying a person, as a result of face recognition processing. Alternatively, the face recognition processing section 40 may output a score for each person ID corresponding to a specific face area in the video. This score represents the likelihood that the face included in the face area is the face of the person identified by the person ID. Note that the face recognition processing itself can be performed using existing technology.

顔認識処理部４０は、言い換えれば、映像に含まれる顔領域の特徴量を抽出するとともに、その特徴量に基づいて顔領域の顔認識処理を行うことによって、顔領域の位置情報（第１位置情報と呼ぶ）と、顔認識処理の結果として人物を識別する情報である人物識別情報と人物識別情報に対応する認識スコアとの組（第１の組と呼ぶ）と、を出力する。 In other words, the face recognition processing unit 40 extracts the feature amount of the face area included in the video, and performs face recognition processing of the face area based on the feature amount to obtain the position information of the face area (first position and a set (referred to as a first set) of person identification information, which is information for identifying a person, and a recognition score corresponding to the person identification information as a result of face recognition processing.

追跡処理部５０は、映像解析処理部２０からの映像解析処理結果を活用して、顔認識処理部４０が出力するフレーム画像内の顔領域の追跡処理を行う。これにより、追跡処理部５０は、時間的に連続する顔領域トラックを求める。ここで「時間的に連続する」とは、隣接し合うフレーム画像間で連続することを意味する。これにより、顔認識の精度が向上する。 The tracking processing unit 50 uses the video analysis processing result from the video analysis processing unit 20 to track the face area in the frame image output by the face recognition processing unit 40 . As a result, the tracking processing unit 50 obtains temporally continuous face area tracks. Here, "temporally continuous" means continuous between adjacent frame images. This improves the accuracy of face recognition.

つまり、追跡処理部５０は、顔領域の時間方向のトラックを求める。ここで、トラックは、追跡情報（軌跡情報）のデータである。トラックは、顔領域の位置情報（第２位置情報と呼ぶ）と、顔領域の人物識別情報と認識スコアとの組（第２の組と呼ぶ）と、を持つ。追跡処理部５０は、映像解析処理部２０から渡される顔領域の動き予測結果の情報に基づいて顔領域が動いた後の第２位置情報を求め、顔認識処理部から渡される第１の組と、元々トラックが持っている第２の組と、に基づいて、且つ前記第１位置情報と前記第２位置情報との関係（重なり度合い）にも基づいて、トラックを更新する。これによって、追跡処理部は、トラックに、新たな第２位置情報と新たな第２の組（人物識別情報と認識スコアの組）とを付与する。 That is, the tracking processing unit 50 obtains a track in the time direction of the face area. Here, the track is data of tracking information (trajectory information). The track has position information of the face area (referred to as second position information) and a set of the person identification information of the face area and the recognition score (referred to as a second set). The tracking processing unit 50 obtains second position information after the facial region has moved based on the information on the motion prediction result of the facial region passed from the video analysis processing unit 20, and calculates the second position information passed from the face recognition processing unit. and the second set that the track originally has, and also based on the relationship (degree of overlap) between the first position information and the second position information. As a result, the tracking processing unit provides the track with new second position information and a new second set (person identification information and recognition score set).

出力整形処理部６０は、所定のデータ構造に整形した形で、追跡結果データを出力する。出力整形処理部６０が出力するデータの構造は、映像検索処理等の映像インタラクションシステムから利用しやすい構造であり、顔認識結果の情報と、映像に関する情報とを含むものである。出力整形処理部６０が出力するデータ構造により、出力データを利用する機能（アプリケーションソフトウェア等）は、映像中の人物をキーとした検索などの操作を効果的に実現する。 The output shaping processing unit 60 outputs the tracking result data in a form shaped into a predetermined data structure. The structure of the data output by the output shaping processing unit 60 is a structure that is easy to use from a video interaction system such as video search processing, and includes information on face recognition results and information on video. Due to the data structure output by the output shaping processing unit 60, a function (application software, etc.) using the output data effectively realizes an operation such as a search using a person in the video as a key.

つまり、出力整形処理部６０は、追跡処理部５０から渡される追跡処理の結果のデータを整形し、出力する。一例として、出力整形処理部６０は、トラックの情報に人物識別情報（人物ＩＤ）と認識スコアとの組の情報を付与した形の、トラックの集合のデータを整形して出力する。人物識別情報と認識スコアの組は、複数であってよい。 That is, the output shaping processing unit 60 shapes the data of the result of the tracking processing passed from the tracking processing unit 50 and outputs the data. As an example, the output shaping processing unit 60 shapes and outputs data of a set of tracks in a form in which information of a set of person identification information (person ID) and recognition score is added to track information. There may be a plurality of sets of personal identification information and recognition scores.

顔追跡装置１は、映像解析処理部２０や、追跡処理部５０や、出力整形処理部６０の処理によって、従来技術では為し得なかった効果を生じさせる。次に、映像解析処理部２０、追跡処理部５０、出力整形処理部６０のそれぞれの、より詳細な処理内容について説明する。 The face tracking device 1 produces effects that could not be achieved with the conventional technology through the processes of the video analysis processing unit 20, the tracking processing unit 50, and the output shaping processing unit 60. FIG. Next, more detailed processing contents of each of the video analysis processing unit 20, the tracking processing unit 50, and the output shaping processing unit 60 will be described.

映像解析処理部２０は、映像のシーンチェンジの検出や、フレーム画像間におけるグローバルな動きおよび領域ごとのローカルな動きの予測や、領域がボケているかどうかを判定する処理を行う。これらの映像解析処理の結果は、顔認識処理部４０の結果とともに追跡処理部５０および出力整形処理部６０に供給され、顔認識精度向上のために活用される。 The video analysis processing unit 20 performs processing for detecting scene changes in video, predicting global motion between frame images and local motion for each region, and determining whether or not a region is blurred. The results of these video analysis processes are supplied to the tracking processing section 50 and the output shaping processing section 60 together with the results of the face recognition processing section 40, and are utilized to improve the accuracy of face recognition.

映像解析処理部２０が行うシーンチェンジの検出は、映像内に映っている内容が大きく変わる切れ目を検出する処理である。シーンチェンジが生じるときには、追跡処理部５０が行う映像内の顔の追跡処理をリセットすることが必要となる。シーンチェンジの検出のために、映像解析処理部２０は、前後するフレーム画像間の相関度合いを計算する。シーンチェンジが生じていない状況においては、前後するフレーム画像間の相関度合いは所定の閾値以上のレベルを維持する。シーンチェンジが生じるときには、前後するフレーム画像間の相関度合いが上記閾値を大きく下回る。映像解析処理部２０は、上記の相関度合いがこの閾値を下回っているか否かに基づいて、シーンチェンジが生じたか否かを判定する。このシーンチェンジの検出自体は、既存技術を利用して行うことができる。時刻（ｔ－１）のフレーム画像と時刻ｔのフレーム画像との間（ここでのｔは整数値）でシーンチェンジが生じたか否かを表す情報を、映像解析処理部２０は出力する。映像解析処理部２０は、例えば変数is_sc_changeの値を出力することにより、シーンチェンジの検出結果を追跡処理部５０に伝える。is_sc_change＝１は「シーンチェンジ有り」を表す。is_sc_change＝０は「シーンチェンジ無し」を表す。 The scene change detection performed by the video analysis processing unit 20 is a process of detecting a break where the content shown in the video changes significantly. When a scene change occurs, it is necessary to reset the face tracking processing in the video performed by the tracking processing unit 50 . To detect a scene change, the video analysis processing unit 20 calculates the degree of correlation between successive frame images. In a situation where there is no scene change, the degree of correlation between successive frame images maintains a level equal to or higher than a predetermined threshold. When a scene change occurs, the degree of correlation between successive frame images is significantly below the threshold. The video analysis processing unit 20 determines whether a scene change has occurred based on whether the degree of correlation is below the threshold. This scene change detection itself can be performed using existing technology. The video analysis processing unit 20 outputs information indicating whether or not a scene change occurred between the frame image at time (t−1) and the frame image at time t (where t is an integer value). The video analysis processing unit 20 notifies the tracking processing unit 50 of the scene change detection result by, for example, outputting the value of the variable is_sc_change. is_sc_change=1 represents "there is a scene change". is_sc_change=0 represents "no scene change".

映像解析処理部２０が行う動き予測の処理は、前後のフレーム画像間で画像内に映っている物や人の動きの程度を予測するものである。映像解析処理部２０は、動きの程度の予測値を、追跡処理部５０に渡す。これにより、追跡処理部５０は、映像解析処理部２０から渡された予測値にしたがった動き補償処理を行い、追跡処理の精度を向上させることができる。 The motion prediction processing performed by the video analysis processing unit 20 predicts the degree of motion of an object or person appearing in the image between the preceding and succeeding frame images. The video analysis processing unit 20 passes the predicted value of the degree of motion to the tracking processing unit 50 . As a result, the tracking processing unit 50 can perform motion compensation processing according to the prediction value passed from the video analysis processing unit 20, thereby improving the accuracy of tracking processing.

なお、映像解析処理部２０は、動き予測の処理として、画面全体の動きであるグローバル動き（撮影用カメラのパン等）と、検出された顔領域に特化した動きであるローカル動きとの、２種類の動きを予測する。映像解析処理部２０は、グローバル動きの予測として、フレーム画像間の相関度に基づいて、画面全体のグローバル動きをモデル化したパラメーターを計算する。グローバル動きのモデルの代表的な例として、アフィンモデルがある。アフィンモデル自体は、既存技術によって処理可能なモデルである。グローバル動きについてのパラメーターは、globalMというマトリックスで表現される。このマトリックスglobalMを用いた動き補償は、下の式（１）を用いて計算可能である。 Note that the video analysis processing unit 20, as the motion prediction process, performs global motion (panning of a shooting camera, etc.), which is the motion of the entire screen, and local motion, which is the motion specific to the detected face area. Predict two types of motion. As global motion prediction, the video analysis processing unit 20 calculates parameters that model the global motion of the entire screen based on the degree of correlation between frame images. An affine model is a typical example of a global motion model. The affine model itself is a model that can be processed by existing technology. Parameters for global motion are represented by the matrix globalM. Motion compensation using this matrix globalM can be calculated using equation (1) below.

ｐ（ｔ）＝glabalM・ｐ（ｔ－１）・・・（１） p(t)=glabalM・p(t−1) (1)

この式（１）において、ｐ（ｔ－１）は、時刻（ｔ－１）におけるフレーム画像上の位置を表す２次元座標（フレーム画像内の、水平方向座標および垂直方向座標）のベクトルである。また、ｐ（ｔ）は、同様に、時刻ｔにおけるフレーム画像上の位置を表す２次元座標のベクトルである。マトリックスglobalMは、時刻（ｔ－１）から時刻ｔまでの間のグローバル動きを表す２行２列のマトリックスである。 In this equation (1), p(t-1) is a vector of two-dimensional coordinates (horizontal and vertical coordinates within the frame image) representing the position on the frame image at time (t-1). . Similarly, p(t) is a vector of two-dimensional coordinates representing the position on the frame image at time t. The matrix globalM is a 2-row, 2-column matrix representing the global motion from time (t−1) to time t.

一方、ローカル動きの予測として、映像解析処理部２０は、対象の顔領域内での顔の動きを予測する。ローカル動きの予測のために、映像解析処理部２０は、カルマンフィルターの枠組みを用いる。ローカル動きの予測の処理手順は、例えば前記の非特許文献２にも開示されている。ローカル動きの予測のモデルは領域ごとに異なる。領域ｒのローカル動きのモデルをkf(r)と表現する。カルマンフィルターの予測関数kf(r).predを用いた、領域ｒの時刻（ｔ－１）から時刻ｔまでの動き補償は、下の式（２）を用いて計算可能である。 On the other hand, as prediction of local motion, the video analysis processing unit 20 predicts the motion of the face within the target face region. For local motion estimation, the video analytics processor 20 uses a Kalman filter framework. The processing procedure for local motion prediction is also disclosed in Non-Patent Document 2 mentioned above, for example. Models for local motion prediction differ from region to region. Denote the local motion model of region r as kf(r). Motion compensation for region r from time (t-1) to time t using the Kalman filter prediction function kf(r).pred can be calculated using equation (2) below.

ｒ（ｔ）＝kf(r).pred（ｒ（ｔ－１））・・・（２） r(t)=kf(r).pred(r(t-1)) (2)

本実施形態では、映像解析処理部２０は、上記のグローバル動き予測とローカル動き予測とを組み合わせて、従来手法よりも精密な動き予測を行う。グローバル動き予測とローカル動き予測とを組み合わせた時刻（ｔ－１）から時刻（ｔ）までの動き予測は、下の式（３）で表わされる。 In this embodiment, the video analysis processing unit 20 combines the above global motion prediction and local motion prediction to perform more precise motion prediction than the conventional method. Motion estimation from time (t−1) to time (t), which is a combination of global motion estimation and local motion estimation, is expressed by Equation (3) below.

ｒ（ｔ）＝integrateM（ｒ（ｔ－１））＝kf(r).pred（globalM・ｒ（ｔ－１））
・・・（３） r(t)=integrateM(r(t-1))=kf(r).pred(globalM.r(t-1))
... (3)

式（３）において、integrateMは、グローバル動き予測とローカル動き予測とを統合した動き予測モデルを表す。 In Equation (3), integrateM represents a motion prediction model that integrates global motion prediction and local motion prediction.

顔追跡装置１の全体的な処理負荷を適切にするため、比較的処理時間のかかる顔認識処理（顔認識処理部４０の処理）をローレートで行う場合がある。言い換えれば、顔追跡装置１は、動き予測の処理を相対的に高い頻度で行い、顔認識処理を相対的に低い頻度で行うことができる。仮に動き予測の処理の頻度を顔認識処理の頻度に合わせてローレートで行った場合には、動き予測の精度が低くなる可能性がある。しかしながら、本実施形態の顔追跡装置１は、上記の通り、顔認識処理を相対的ローレートで行いながら、動き予測処理を相対的ハイレートで行うことができる。これにより、顔認識処理のための装置への負荷を軽減しながら、動き予測の精度を高く維持することができる。言い換えれば、本実施形態の顔追跡装置１は、顔認識処理のレートと動き予測処理のレートとを異ならせることにより、処理コストを抑制しながら、動き予測の精度を維持することができる。 In order to make the overall processing load of the face tracking device 1 appropriate, face recognition processing (processing of the face recognition processing unit 40) that takes a relatively long processing time may be performed at a low rate. In other words, the face tracking device 1 can perform motion prediction processing with relatively high frequency and perform face recognition processing with relatively low frequency. If the frequency of motion prediction processing is performed at a low rate in accordance with the frequency of face recognition processing, there is a possibility that the precision of motion prediction will be low. However, as described above, the face tracking device 1 of the present embodiment can perform motion prediction processing at a relatively high rate while performing face recognition processing at a relatively low rate. As a result, it is possible to maintain high motion prediction accuracy while reducing the load on the apparatus for face recognition processing. In other words, the face tracking device 1 of the present embodiment can maintain motion prediction accuracy while suppressing processing costs by making the rate of face recognition processing different from the rate of motion prediction processing.

言い換えれば、顔追跡装置１は、映像解析処理部２０による映像解析処理を、顔認識処理部４０による顔認識処理よりも高頻度で行う。そして、映像解析処理部２０は、複数回の動き予測の処理の結果である動き予測結果の情報（動き予測モデルが持つ動き量）を蓄積する。そして、追跡処理部５０は、映像解析処理部２０から渡される蓄積された動き予測結果の情報に基づいて処理を行う。 In other words, the face tracking device 1 performs the image analysis processing by the image analysis processing unit 20 more frequently than the face recognition processing by the face recognition processing unit 40 . Then, the video analysis processing unit 20 accumulates motion prediction result information (amount of motion possessed by the motion prediction model) that is the result of motion prediction processing performed a plurality of times. Then, the tracking processing unit 50 performs processing based on the information on the accumulated motion prediction result passed from the video analysis processing unit 20 .

なお、追跡処理（追跡処理部５０の処理）やデータ出力のための処理（出力整形処理部６０の処理）のレートは、顔認識処理のレートに合わせられる。 Note that the rate of tracking processing (processing by the tracking processing unit 50) and processing for data output (processing by the output shaping processing unit 60) is matched with the rate of face recognition processing.

図２は、顔追跡装置１における、映像解析処理と追跡処理との時間的関係を示す概略図である。図示するように、顔追跡装置１が処理対象とする映像は、フレーム画像のシーケンスである。映像は、例えば、３０ＦＰＳあるいは６０ＦＰＳなどといったレートでのフレーム画像のシーケンスである。なお、「ＦＰＳ」は、「frames per second」（秒あたりフレーム枚数）の略である。映像解析処理部２０は、これらのフレーム画像の各々について、映像解析処理を行い、映像解析結果を出力する。一方、追跡処理部５０は、適宜定められる所定数の連続するフレーム画像を対象として、顔の追跡処理を行う。なお、追跡処理部５０が追跡処理を行う際には、映像解析処理部２０が出力する映像解析結果を利用する。 FIG. 2 is a schematic diagram showing the temporal relationship between video analysis processing and tracking processing in the face tracking device 1. As shown in FIG. As shown, the video to be processed by the face tracking device 1 is a sequence of frame images. A video is, for example, a sequence of frame images at a rate such as 30 FPS or 60 FPS. Note that "FPS" is an abbreviation for "frames per second". The video analysis processing unit 20 performs video analysis processing on each of these frame images, and outputs video analysis results. On the other hand, the tracking processing unit 50 performs face tracking processing on a predetermined number of consecutive frame images that are appropriately determined. When the tracking processing unit 50 performs tracking processing, the video analysis result output by the video analysis processing unit 20 is used.

図２における映像解析処理は、前述の動き予測処理を含むものである。また、追跡処理は、顔認識処理と同じ頻度で、顔認識処理の結果を利用して実行されるものである。図２に示す時間的関係で処理を行う場合、１回の顔認識処理および追跡処理に、複数回の動き予測処理が対応する。このとき、顔追跡装置１は、映像解析処理部２０による動き予測処理で求められる動きパラメーターを蓄積しておくようにする。 The video analysis processing in FIG. 2 includes the aforementioned motion prediction processing. Further, the tracking process is executed using the results of the face recognition process with the same frequency as the face recognition process. When processing is performed in the temporal relationship shown in FIG. 2, multiple motion prediction processes correspond to one face recognition process and tracking process. At this time, the face tracking device 1 accumulates motion parameters obtained by motion prediction processing by the video analysis processing unit 20 .

つまり、映像解析処理部２０は、時刻（ｔ－１）におけるフレーム画像と時刻ｔにおけるフレーム画像とに基づいて、時刻ｔにおける映像解析結果を求め、出力する。以下、同様に、映像解析処理部２０は、時間的に隣接する２つのフレーム画像に基づいて、映像解析結果を求める。つまり、映像解析処理部２０は、第１の映像解析処理の結果として動き予測パラメーターＭ１（グローバル動きの予測パラメーター）を出力する。以下、同様に、映像解析処理部２０は、第ｋ（ｋは正整数）の映像解析処理の結果として動き予測パラメーターＭｋ（グローバル動きの予測パラメーター）を出力する。映像解析処理部２０は、この各回の動き予測パラメーターＭｋ（ｋ＝１，２，・・・）を蓄積して、追跡処理用のグローバル動き予測パラメーターGlobalMを計算する。また、映像解析処理部２０は、それぞれの顔領域ｒについてのローカル動きの予測モデルも、同様に蓄積し、追跡処理用のローカル動き予測パラメーターを計算する。映像解析処理部２０は、これらの蓄積された動き予測パラメーターを、追跡処理のタイミングで、追跡処理部５０に渡す。 That is, the video analysis processing unit 20 obtains and outputs the video analysis result at time t based on the frame image at time (t−1) and the frame image at time t. Thereafter, similarly, the video analysis processing unit 20 obtains video analysis results based on two temporally adjacent frame images. That is, the video analysis processing unit 20 outputs the motion prediction parameter M1 (global motion prediction parameter) as a result of the first video analysis processing. Thereafter, similarly, the video analysis processing unit 20 outputs a motion prediction parameter Mk (global motion prediction parameter) as a result of the k-th (k is a positive integer) video analysis processing. The video analysis processing unit 20 accumulates the motion prediction parameters Mk (k=1, 2, . . . ) for each time and calculates a global motion prediction parameter GlobalM for tracking processing. The video analysis processing unit 20 also accumulates a local motion prediction model for each face region r, and calculates local motion prediction parameters for tracking processing. The video analysis processing unit 20 passes these accumulated motion prediction parameters to the tracking processing unit 50 at the timing of tracking processing.

また、映像解析処理部２０は、シーンチェンジ検出の処理も、同様に、ハイレートの頻度で行うことができる。この場合、第１の映像解析処理の結果として、is_sc_change_1（値は、前述の通り、０（シーンチェンジ無し）または１（シーンチェンジ有り））を求める。以下同様に、映像解析処理部２０は、第ｋ（ｋは正整数）の映像解析処理の結果としてis_sc_change_kを求める。そして、映像解析処理部２０は、各回の映像解析処理の結果としてのシーンチェンジの有無の判定結果is_sc_change_k（ｋ＝１，２，・・・）に基づいて、追跡処理用のis_sc_changeを求める。なお、ｋ＝１，２，・・・の中のいずれかに１つにおいてis_sc_change_k＝１ならば、映像解析処理部２０は、is_sc_change＝１とする。また、ｋ＝１，２，・・・の中のすべてにおいてis_sc_change＝０ならば、映像解析処理部２０は、is_sc_change＝０とする。映像解析処理部２０は、蓄積されたシーンチェンジ判定結果であるis_sc_changeの値を、追跡処理のタイミングで、追跡処理部５０に渡す。なお、is_sc_change＝１の場合には、追跡処理部５０は、動き予測パラメーターをリセットする。一旦シーンチェンジが検出されて動き予測パラメーターがリセットされた場合には、映像解析処理部２０は、次の顔認識処理（および追跡処理）のタイミングまで、すべての動き予測処理を中止するようにしてもよい。 Similarly, the video analysis processing unit 20 can also perform scene change detection processing at a high frequency. In this case, is_sc_change_1 (value is 0 (no scene change) or 1 (scene change) as described above) is obtained as a result of the first video analysis processing. Similarly, the video analysis processing unit 20 obtains is_sc_change_k as the k-th (k is a positive integer) video analysis processing result. Then, the video analysis processing unit 20 obtains is_sc_change for tracking processing based on the scene change determination result is_sc_change_k (k=1, 2, . . . ) as a result of each video analysis processing. Note that if is_sc_change_k=1 in any one of k=1, 2, . . . , the video analysis processing unit 20 sets is_sc_change=1. Also, if is_sc_change=0 for all k=1, 2, . . . , the video analysis processing unit 20 sets is_sc_change=0. The video analysis processing unit 20 passes the value of is_sc_change, which is the accumulated scene change determination result, to the tracking processing unit 50 at the timing of tracking processing. Note that when is_sc_change=1, the tracking processing unit 50 resets the motion prediction parameters. Once a scene change is detected and the motion prediction parameters are reset, the video analysis processing unit 20 suspends all motion prediction processing until the timing of the next face recognition processing (and tracking processing). good too.

映像解析処理部２０が行うボケ判定は、処理対象の映像内の顔領域ごとに、その領域がボケの領域であるか否かを判定するものである。なお、映像解析処理部２０は、顔領域がボケている場合にも、顔領域の検出を行う。ボケ領域の映像に基づいて顔認識処理を行うと、認識精度が悪くなる。つまり、ボケ顔画像を対象として顔認識処理を行っても、認識精度が低く、正しくない人物を認識してしまう可能性が高い。これは、顔追跡装置１全体の顔認識精度の低下を招いてしまう。そのような精度劣化要因を避けるため、本実施形態の顔追跡装置１は、領域ごとのボケ判定（ボケている領域であるか否かの判定）を行い、ボケ領域である場合には顔認識処理を行わない。ただし、顔追跡装置１は、顔認識処理を行わない場合にも、顔領域の検出結果の情報については、追跡処理部５０への引き渡しを行う。 The blur determination performed by the image analysis processing unit 20 is to determine whether or not each face area in the image to be processed is a blur area. Note that the image analysis processing unit 20 detects the face area even when the face area is blurred. If face recognition processing is performed based on an image in a blurred area, the recognition accuracy will be degraded. That is, even if face recognition processing is performed on a blurred face image, recognition accuracy is low, and there is a high possibility that an incorrect person will be recognized. This leads to deterioration of the face recognition accuracy of the face tracking device 1 as a whole. In order to avoid such an accuracy deterioration factor, the face tracking device 1 of the present embodiment performs blur determination (determines whether or not the area is blurred) for each area. No processing. However, the face tracking device 1 transfers the information of the detection result of the face region to the tracking processing unit 50 even when the face recognition processing is not performed.

映像解析処理部２０は、顔領域ｒがボケ領域であるか否かを判定し、判定結果を変数is_blur(r)の値として設定する。is_blur(r)＝１は、領域ｒがボケ画像であると判定されたことを表す。is_blur(r)＝０は、領域ｒがボケ画像ではないと判定されたことを表す。映像解析処理部２０は、それぞれの顔領域についてボケ領域であるか否かの判定を行う。 The video analysis processing unit 20 determines whether or not the face area r is a blurred area, and sets the determination result as the value of the variable is_blur(r). is_blur(r)=1 indicates that the region r is determined to be a blurred image. is_blur(r)=0 indicates that the region r is determined not to be a blurred image. The image analysis processing unit 20 determines whether or not each face area is a blurred area.

映像解析処理部２０は、一例として、顔領域ｒの画像が含むエッジ成分を解析することによってボケの有無を判定する。この方法自体は、従来技術を用いて実現可能である。具体的には、映像解析処理部２０は、顔領域ｒ内のエッジ成分を検出し、そのエッジ成分のシャープさの度合いが所定レベル以下である（その部分の画像が高い周波数成分を持たない）場合に「ボケ有り」（is_blur(r)＝１）と判定する。また、映像解析処理部２０は、顔領域ｒ内のエッジ成分のシャープさの度合いが所定レベル以上である（その部分の画像が高い周波数成分を持つ）場合に「ボケ無し」（is_blur(r)＝０）と判定する。 As an example, the video analysis processing unit 20 determines the presence or absence of blur by analyzing edge components included in the image of the face region r. The method itself can be implemented using conventional techniques. Specifically, the video analysis processing unit 20 detects an edge component in the face region r, and the degree of sharpness of the edge component is equal to or lower than a predetermined level (the image of that portion does not have high frequency components). If so, it is determined that there is "blur" (is_blur(r)=1). In addition, the image analysis processing unit 20 determines that "no blur" (is_blur(r) = 0).

映像解析処理部２０は、別の方法でボケ画像の判定を行うようにしてもよい。この手法では、映像解析処理部２０は、顔認識処理部４０が顔認識処理を行う際に同時にボケ画像の有無の判定を行う。即ち、この手法を用いる場合、顔データベース３０は、多数のボケ顔画像から抽出した画像の特徴量を、ボケ画像の人物ＩＤに関連付けて、予め保持しておく。これにより、顔の認識処理の対象の顔領域ｒの特徴量を、顔データベース３０が持つ特徴量と照合したときに、当該顔領域ｒがボケ領域であることを表す。つまり、映像解析処理部２０によるボケ画像の有無の判定と、顔認識処理部４０による顔認識の処理とを一体の処理として行うことができる。つまり、顔追跡装置１全体の計算量を削減することが可能となる。 The video analysis processing unit 20 may determine a blurred image by another method. In this method, the video analysis processing unit 20 determines whether or not there is a blurred image at the same time when the face recognition processing unit 40 performs face recognition processing. That is, when using this technique, the face database 30 stores in advance the feature amount of the image extracted from a large number of blurred face images in association with the person ID of the blurred image. As a result, when the feature amount of the face area r to be subjected to face recognition processing is collated with the feature amount held in the face database 30, it is indicated that the face area r is a blurred area. That is, the determination of the presence or absence of a blurred image by the video analysis processing unit 20 and the face recognition processing by the face recognition processing unit 40 can be performed as an integrated process. That is, it is possible to reduce the amount of calculation of the face tracking device 1 as a whole.

図３は、顔認識処理と一体化したボケ判定処理のための、顔データベースの構築方法を模式的に示す概略図である。図示するように、本手法では、多数のボケ顔の画像を準備する。そして、顔追跡装置１は、これらのボケ顔の画像の特徴量を抽出する。画像の特徴量を抽出する処理は、顔認識処理部４０が持つ機能を用いて行うことができる。そして、顔追跡装置１は、得られたボケ顔特徴量のデータを、ボケ顔を表す人物ＩＤに関連付けて、顔データベース３０に登録する。なお、不特定多数の人のボケ顔に対して、ボケ顔を表す人物ＩＤを１つ与えれば十分である。 FIG. 3 is a schematic diagram schematically showing a method of constructing a face database for blur determination processing integrated with face recognition processing. As illustrated, in this method, a large number of blurry face images are prepared. Then, the face tracking device 1 extracts feature amounts of these blurry face images. The process of extracting the feature amount of the image can be performed using the function of the face recognition processing section 40 . Then, the face tracking device 1 registers the obtained blurred face feature amount data in the face database 30 in association with the person ID representing the blurred face. Note that it is sufficient to give one person ID representing a blurred face to an unspecified number of people with blurred faces.

つまり、映像解析処理部２０は、顔領域がボケ領域であるか否かの判定を行う。追跡処理部５０は、その判定結果に基づき、ボケ領域である顔領域に関しては、顔領域のトラックを更新する際には認識スコアの値を最低値（最悪スコア値）とする。ボケ判定の一つの手法は、次の通りである。顔認識処理部４０は、顔領域がボケ領域である場合の特徴量と、当該ボケ領域の特徴量に関連付けられた特殊な人物ＩＤ識別情報と、を関連付けた情報（顔データベース３０等）にアクセス可能と予めしておく。映像解析処理部２０は、顔認識処理部４０が顔領域についての顔認識処理を行った結果としてボケ領域の特徴量に関連付けられた特殊な人物ＩＤを同定した場合に、当該顔領域がボケ領域であると判定する、ようにできる。 That is, the video analysis processing unit 20 determines whether or not the face area is a blurred area. Based on the determination result, the tracking processing unit 50 sets the value of the recognition score to the lowest value (worst score value) when updating the track of the face region, which is a blurred region. One method of blur judgment is as follows. The face recognition processing unit 40 accesses information (such as the face database 30) that associates a feature amount when the face area is a blurred area and special person ID identification information associated with the feature amount of the blurred area. Plan ahead if possible. When the face recognition processing unit 40 identifies a special person ID associated with the feature amount of the blurred area as a result of performing face recognition processing on the face area, the image analysis processing unit 20 detects that the face area is the blurred area. It can be determined that

追跡処理部５０による追跡処理の詳細は、下記の通りである。追跡処理部５０は、顔認識処理部４０が出力する結果と、映像解析処理部２０が出力する結果とを用いて、時間軸上での対応関係から、同一人物の顔領域のトラックを求める処理を行う。追跡処理部５０のこの処理によって、同一人物の顔領域のトラックが映像中で同定できれば、たとえある時刻のフレーム画像において顔認識処理結果の信頼性が低くても、同一トラックに属する他の時刻のフレーム画像で信頼性の高い顔認識処理結果を得られれば、置き換えにより、結果として精度の高い顔認識を実現することができる。 The details of the tracking processing by the tracking processing unit 50 are as follows. The tracking processing unit 50 uses the result output by the face recognition processing unit 40 and the result output by the video analysis processing unit 20 to obtain the track of the face area of the same person from the corresponding relationship on the time axis. I do. If the track of the face region of the same person can be identified in the video by this processing of the tracking processing unit 50, even if the reliability of the face recognition processing result in the frame image at a certain time is low, the face region of the other time belonging to the same track can be identified. If a highly reliable result of face recognition processing can be obtained from a frame image, it is possible to achieve highly accurate face recognition as a result of replacement.

図４は、追跡処理部５０による追跡処理の手順を示すフローチャートである。以下、このフローチャートに沿って説明する。なお、このフローチャートは、ある１時点での追跡処理の手順を示すものである。つまり、追跡処理部５０は、追跡処理を行うタイミングごとに、図４に示す処理を実行する。 FIG. 4 is a flowchart showing the procedure of tracking processing by the tracking processing unit 50. As shown in FIG. Description will be made below along this flow chart. It should be noted that this flowchart shows the procedure of the tracking process at a certain point in time. That is, the tracking processing unit 50 executes the processing shown in FIG. 4 each time the tracking processing is performed.

時刻ｔにおける追跡処理部５０の処理の前提となるデータは次の通りである。追跡処理部５０は、顔認識処理部４０と映像解析処理部２０からの出力信号（データ）を取得する。顔認識処理部４０から追跡処理部５０へは、検出した顔領域（フレーム画像内の位置情報）や、顔認識処理の結果である認識スコアや、認識結果である人物ＩＤや、上記顔領域の特徴量が渡される。映像解析処理部２０から追跡処理部５０へは、シーンチェンジ検出信号is_sc_changeや、顔領域の統合動き予測パラメーターintegrateMが渡される。 Data on which the processing of the tracking processing unit 50 at time t is based is as follows. The tracking processing unit 50 acquires output signals (data) from the face recognition processing unit 40 and the video analysis processing unit 20 . From the face recognition processing unit 40 to the tracking processing unit 50, the detected face area (position information in the frame image), the recognition score that is the result of the face recognition process, the person ID that is the recognition result, and the face area. Features are passed. A scene change detection signal is_sc_change and a face area integrated motion prediction parameter integrateM are passed from the video analysis processing unit 20 to the tracking processing unit 50 .

また、追跡処理部５０は、前回の追跡処理時、即ち時刻（ｔ－１）において追跡できているトラックの集合を保持している。時刻（ｔ－１）におけるトラックの集合は、下の式（４）の通りである。なお、式（４）においてn_Trは、当該トラックの集合に含まれるトラックの数である。 Further, the tracking processing unit 50 holds a set of tracks that were tracked at the time of the previous tracking processing, that is, at time (t-1). The set of tracks at time (t-1) is given by equation (4) below. Note that n_Tr in equation (4) is the number of tracks included in the set of tracks.

Tr(t-1) ＝ [tr[0](t-1), tr[1](t-1), …, tr[n_Tr-1](t-1)] ・・・（４） Tr(t-1) = [tr[0](t-1), tr[1](t-1), …, tr[n_Tr-1](t-1)] (4)

なお、各トラックtr[k](t-1)（０≦ｋ≦n_Tr－１）は、データの構造体であり、その構造体の要素は下記のbbox、feat、recog_ids、recog_scoresを含む。
bboxは、時刻（ｔ－１）におけるトラックに対応する顔領域の矩形領域情報（フレーム画像内の位置情報）である。
featは、時刻（ｔ－１）のトラックにおける顔領域の特徴量である。
recog_idsは、時刻（ｔ－１）の当該トラックに対応する顔認識結果の人物ＩＤの候補である。人物ＩＤの候補が複数であってもよい。複数の人物ＩＤの候補は、例えば、スコアの高い順に並べられている。
recog_scoresは、時刻（ｔ－１）の当該トラックにおける上記人物ＩＤ候補に対応するスコアの値である。上記の人物ＩＤが並べられている順と同じ順（例えばスコアの高い順）にソートされている。 Each track tr[k](t-1) (0≤k≤n_Tr-1) is a data structure, and the elements of the structure include the following bbox, feat, recog_ids, and recog_scores.
bbox is rectangular area information (position information within the frame image) of the face area corresponding to the track at time (t−1).
feat is the feature amount of the face area in the track at time (t-1).
recog_ids is a person ID candidate of the face recognition result corresponding to the track at time (t−1). There may be a plurality of person ID candidates. The plurality of person ID candidates are arranged in descending order of score, for example.
recog_scores is a score value corresponding to the person ID candidate in the track at time (t-1). They are sorted in the same order as the person IDs listed above (for example, in descending order of score).

構造体の要素にアクセスする際の記法は、次の通りである。例えば、時刻（ｔ－１）におけるトラック集合全体についてのbbox（矩形領域情報）は、Tr(t-1).bboxと表わされる。例えば、時刻（ｔ－１）における個別トラックについてのbboxは、tr[k](t-1).bboxと表わされる。但し、ｋは、トラックの集合内における個別トラックの指標値であり、０≦ｋ≦n_Tr－１である。 The notation for accessing the elements of the structure is as follows. For example, bbox (rectangular area information) for the entire set of tracks at time (t-1) is expressed as Tr(t-1).bbox. For example, the bbox for an individual track at time (t-1) is represented as tr[k](t-1).bbox. where k is the index value of an individual track within the set of tracks, and 0≤k≤n_Tr-1.

なお、追跡処理部５０の初回の処理の前（即ち、時刻０のとき）には、トラック集合の初期化処理が行われる。つまり、Tr(0)＝[]（空集合）とする初期化が行われる。なお、シーンチェンジによってトラックをリセットする場合にも、トラック集合は空集合に再設定される。 Before the initial processing of the tracking processing unit 50 (that is, at time 0), initialization processing of the track set is performed. That is, initialization is performed with Tr(0)=[] (empty set). Note that the track set is reset to an empty set even when the track is reset due to a scene change.

また、追跡処理部５０が顔認識処理部４０から受け取る時刻ｔの顔検出領域および顔認識結果の情報は、下の式（５）の通りである。なお、式（５）において、n_Detは、時刻ｔの顔認識処理における顔検出領域の数（即ち、顔認識の数）である。 The face detection area and face recognition result information at time t received by the tracking processing unit 50 from the face recognition processing unit 40 are given by the following equation (5). In Expression (5), n_Det is the number of face detection regions (that is, the number of face recognitions) in face recognition processing at time t.

Det(t) ＝ [det[0](t), det[1](t), …, det[n_Det-1](t)] ・・・（５） Det(t) = [det[0](t), det[1](t), …, det[n_Det-1](t)] (5)

トラックの場合と同様に、各顔検出領域det[k](t)（０≦ｋ≦n_Det－１）は、データの構造体であり、その構造体の要素は下記のbbox、feat、recog_ids、recog_scoresを含む。
bboxは、時刻ｔにおける顔検出領域の矩形領域情報（フレーム画像内の位置情報）である。
featは、時刻ｔにおける上記顔検出領域の特徴量である。
recog_idsは、時刻ｔの当該顔検出領域に対応する顔認識結果の人物ＩＤの候補である。人物ＩＤの候補が複数であってもよい。複数の人物ＩＤの候補は、例えば、スコアの高い順に並べられている。
recog_scoresは、時刻ｔの当該顔検出領域における上記人物ＩＤ候補に対応するスコアの値である。上記の人物ＩＤが並べられている順と同じ順（例えばスコアの高い順）にソートされている。 Each face detection area det[k](t) (0≦k≦n_Det−1) is a data structure, and the elements of the structure are the following bbox, feat, recog_ids, Includes recog_scores.
bbox is rectangular area information (position information within the frame image) of the face detection area at time t.
feat is the feature amount of the face detection area at time t.
recog_ids is a person ID candidate of the face recognition result corresponding to the face detection area at time t. There may be a plurality of person ID candidates. The plurality of person ID candidates are arranged in descending order of score, for example.
recog_scores is a score value corresponding to the person ID candidate in the face detection area at time t. They are sorted in the same order as the person IDs listed above (for example, in descending order of score).

構造体の要素にアクセスする際の記法は、次の通りである。例えば、時刻ｔおける顔検出領域の集合全体についてのbbox（矩形領域情報）は、Det(t).bboxと表わされる。例えば、時刻ｔにおける個別の顔検出領域についてのbboxは、det[k](t-1).bboxと表わされる。但し、ｋは、顔検出領域の集合内における個別の顔検出領域の指標値であり、０≦ｋ≦n_Det－１である。 The notation for accessing the elements of the structure is as follows. For example, bbox (rectangular area information) for the entire set of face detection areas at time t is expressed as Det(t).bbox. For example, the bbox for an individual face detection region at time t is represented as det[k](t-1).bbox. However, k is an index value of an individual face detection area within a set of face detection areas, and 0≦k≦n_Det−1.

追跡処理部５０が扱うデータは、さらに、矩形領域間の重なり度合いを表すマトリックスと、特徴量間の距離を表すマトリックスとを含む。 The data handled by the tracking processing unit 50 further includes a matrix representing the degree of overlap between rectangular regions and a matrix representing the distance between feature quantities.

矩形領域間の重なり度合いを表すマトリックスは、ＩｏＵ（bboxA,bboxB）と表わされる。ＩｏＵは、「Intersection over Union」（領域同士の和集合に対する、当該領域同士の積集合の比率）を表す。追跡処理部５０は、トラックが持つ矩形領域（上記のbboxA）と、顔検出領域が持つ矩形領域（上記のbboxB）との重なり度合いを求め、追跡処理に利用する。このマトリックスは、n_bboxA×n_bboxBのサイズを持つ２次元のマトリックスである。ただし、n_bboxAはbboxAの領域の数であり、n_bboxBはbboxBの領域の数である。このマトリックスの要素の値が１のとき、対応する領域が完全に一致する。要素の値が正の値（０以上且つ１以下）の場合には、値が小さいほど、対応する領域間の重なり度合いが小さくなっていく。要素の値が負の値である場合には、対応する領域は重なる部分を全く持たず、その値の絶対値が大きいほど、両領域間の距離は離れている。 A matrix representing the degree of overlap between rectangular areas is expressed as IoU (bboxA, bboxB). IoU represents "Intersection over Union" (the ratio of the intersection of the regions to the union of the regions). The tracking processing unit 50 obtains the degree of overlap between the rectangular area of the track (bboxA above) and the rectangular area of the face detection area (bboxB above), and uses it for tracking processing. This matrix is a two-dimensional matrix with size n_bboxA×n_bboxB. However, n_bboxA is the number of regions in bboxA, and n_bboxB is the number of regions in bboxB. When the value of an element of this matrix is 1, the corresponding regions are perfectly matched. When the element value is a positive value (0 or more and 1 or less), the smaller the value, the smaller the degree of overlap between corresponding regions. If the value of the element is a negative value, the corresponding regions do not overlap at all, and the greater the absolute value of the value, the greater the distance between the two regions.

特徴量間の距離を表すマトリックスは、Ｄｉｓ（FeatA,FeatB）と表わされる。追跡処理部５０は、トラックが持つ矩形領域の特徴量（上記のFeatA）と、顔検出領域が持つ矩形領域の特徴量（上記のFeatB）との距離を求め、追跡処理に利用する。このマトリックスは、n_FeatA×n_FeatBのサイズを持つ２次元のマトリックスである。ただし、n_FeatAはFeatAの特徴量の数（トラックが持つ矩形領域の数と同一）であり、n_FeatBはFeatBの特徴量の数（顔検出領域が持つ矩形領域の数と同一）である。距離の値は、０または正の値である。２つの特徴量が同一である場合には、当該２つの特徴量間の距離は０である。２つの特徴量が似ていない度合いが高いほど、当該特徴量間の距離は大きくなる。 A matrix representing the distance between features is represented as Dis (FeatA, FeatB). The tracking processing unit 50 obtains the distance between the feature amount of the rectangular area of the track (FeatA described above) and the feature amount of the rectangular area of the face detection area (FeatB described above), and uses it for tracking processing. This matrix is a two-dimensional matrix with size n_FeatA×n_FeatB. However, n_FeatA is the number of feature amounts of FeatA (same as the number of rectangular areas held by the track), and n_FeatB is the number of feature amounts of FeatB (same as the number of rectangular areas held by the face detection area). The distance value is 0 or a positive value. If two features are the same, the distance between the two features is zero. The higher the degree of dissimilarity between two features, the greater the distance between the features.

追跡処理部５０は、以上のデータに基づいて追跡処理を行う。言い換えれば、追跡処理部５０が行う追跡処理は、時刻（ｔ－１）におけるトラック集合Tr(t-1)と、時刻ｔにおける顔検出領域の集合Det(t)との対応付けの問題に還元される。追跡処理部５０は、時刻の経過（例えば、時刻（ｔ－１）から時刻ｔに）に伴って図４に示す処理を１回ずつ行い、これを繰り返すことにより、トラック集合Tr(t)のデータを順次更新していく。 The tracking processing unit 50 performs tracking processing based on the above data. In other words, the tracking process performed by the tracking processing unit 50 can be reduced to the problem of associating the track set Tr(t-1) at time (t-1) with the face detection area set Det(t) at time t. be done. The tracking processing unit 50 performs the processing shown in FIG. 4 once with the passage of time (for example, from time (t−1) to time t). We will update the data step by step.

追跡処理部５０の時刻ｔにおける処理手順は次の通りである。まず、ステップＳ１１において、追跡処理部５０は、映像解析処理部２０から渡される情報に基づいて、前回の追跡処理のときからシーンチェンジがあったか否かを判定する。シーンチェンジの有無は、変数is_sc_changeの値によって示される。シーンチェンジがあったとき、即ちis_sc_changeの値が１のとき（ステップＳ１１：ＹＥＳ）には、ステップＳ２１に飛ぶ。シーンチェンジがなかったとき、即ちis_sc_changeの値が０のとき（ステップＳ１１：ＮＯ）には、時刻（ｔ－１）におけるトラック集合を現トラック集合として（existTr＝Tr(t-1)）、次のステップＳ１２に進む。 The processing procedure at time t of the tracking processing unit 50 is as follows. First, in step S<b>11 , the tracking processing unit 50 determines whether or not there has been a scene change since the previous tracking processing, based on information passed from the video analysis processing unit 20 . Whether or not there is a scene change is indicated by the value of the variable is_sc_change. When there is a scene change, that is, when the value of is_sc_change is 1 (step S11: YES), the process jumps to step S21. When there is no scene change, that is, when the value of is_sc_change is 0 (step S11: NO), the track group at time (t-1) is set as the current track group (existTr=Tr(t-1)), and the next to step S12.

次に、ステップＳ１２において、追跡処理部５０は、動き補正処理を行う。つまり、追跡処理部５０は、現トラック集合existTr.bboxが持つ各トラックに対して、映像解析処理部２０から渡された統合動き予測モデルintegrateMによる動き補正の処理を行う。 Next, in step S12, the tracking processing unit 50 performs motion correction processing. That is, the tracking processing unit 50 performs motion correction processing using the integrated motion prediction model integrateM passed from the video analysis processing unit 20 for each track of the current track set existTr.bbox.

次に、ステップＳ１３において、追跡処理部５０は、領域の重なり度を計算する。具体的には、追跡処理部５０は、ステップＳ１２における動き補正処理の結果として得られる矩形領域を、ｓｃ倍に拡張する。この拡張は、動き補正処理が含み得る動き補正の誤差をカバーするためのものである。拡張倍率ｓｃの値は、１．０以上の、適宜定める値とする。例えば、１．５≦ｓｃ≦２．５などとしてよい。本実施形態では、一例としてｓｃ＝２．０とする。この拡張により、矩形領域の中心位置は変わらず、矩形領域の大きさがｓｃ倍となる。そして、追跡処理部は、拡張後の矩形領域と、顔認識処理部４０から渡されたDet(t).bboxの重なり度合いを計算する。そして、追跡処理部５０は、重なり度合いのマトリックス（前記のＩｏＵ）の各要素を、予め定められた閾値で閾値処理して、閾値処理済みのマトリックスGateMtxを作成する。つまり、閾値処理済みのマトリックスGateMtxにおいては、所定の閾値より下の重なり度合いを持つ領域同士は、相互にまったく重ならない領域として扱われる。 Next, in step S13, the tracking processing unit 50 calculates the overlapping degree of the regions. Specifically, the tracking processing unit 50 expands the rectangular area obtained as a result of the motion correction processing in step S12 by sc times. This extension is to cover motion compensation errors that the motion compensation process may contain. The value of the expansion scale factor sc is set to a suitably determined value of 1.0 or more. For example, 1.5≦sc≦2.5 may be set. In this embodiment, sc=2.0 as an example. This extension increases the size of the rectangular area by sc while keeping the center position of the rectangular area unchanged. Then, the tracking processing unit calculates the degree of overlap between the extended rectangular area and Det(t).bbox passed from the face recognition processing unit 40 . Then, the tracking processing unit 50 performs threshold processing on each element of the degree-of-overlap matrix (IoU described above) with a predetermined threshold value to create a threshold-processed matrix GateMtx. That is, in the thresholded matrix GateMtx, regions with a degree of overlap below a predetermined threshold are treated as regions that do not overlap each other at all.

つまり、追跡処理部５０は、前記第１位置情報と前記第２位置情報との重なり度合いに基づいて、前記トラックを更新する処理を行う。 That is, the tracking processing unit 50 performs processing for updating the track based on the degree of overlap between the first position information and the second position information.

次に、ステップＳ１４において、追跡処理部５０は、現トラックが持つ顔領域の特徴量と、顔認識処理部４０から渡された時刻ｔにおける顔領域の特徴量とを比較する。つまり、追跡処理部５０は、現トラックにおける特徴量であるExistTr.featと、顔認識処理部４０から渡された特徴量であるDet(t).featとの、特徴量間の距離を表すマトリックスDisMtxを求める。 Next, in step S<b>14 , the tracking processing unit 50 compares the feature amount of the face area of the current track with the feature amount of the face area at time t passed from the face recognition processing unit 40 . In other words, the tracking processing unit 50 creates a matrix representing the distance between the feature amount of the current track, ExistTr.feat, and the feature amount, Det(t).feat, passed from the face recognition processing unit 40. Ask for DisMtx.

次に、ステップＳ１５において、追跡処理部５０は、領域フィルター処理を行う。つまり、追跡処理部は、ステップＳ１３で求めたGateMtxを用いて、ステップＳ１４で求めたマトリックスDisMtxに含まれる無効成分をフィルターアウトする。言い換えれば、ステップＳ１３における閾値処理の結果として、トラックが持つ顔領域と、顔検出領域が持つ顔領域とが、相互に全く重ならない領域として扱われる場合には、追跡処理部５０がそのトラックとその顔検出領域とを対応付けることはない。 Next, in step S15, the tracking processing unit 50 performs area filtering. That is, the tracking processing unit uses GateMtx obtained in step S13 to filter out invalid components contained in the matrix DisMtx obtained in step S14. In other words, as a result of the threshold processing in step S13, when the face area of the track and the face area of the face detection area are treated as areas that do not overlap each other, the tracking processing unit 50 determines that the track and the face area are not overlapped. There is no association with the face detection area.

次に、ステップＳ１６において、追跡処理部５０は、ステップＳ１５の結果として得られたDisMtxに基づいて、時刻（ｔ－１）までの顔領域トラックと時刻ｔでの顔検出領域との対応関係を判定する。言い換えれば、追跡処理部５０は、領域間の重なり度合い（ただし、閾値処理済み）と、各領域の特徴量間の距離とに基づいて、時刻（ｔ－１）のけるトラックと時刻ｔにおける顔検出領域との対応付けを行う。なお、本ステップにおいて、追跡処理部５０は、例えばハンガリアンアルゴリズム（Hungarian algorithm、ハンガリー法）を用いて上記の対応関係を求める。ハンガリアンアルゴリズム自体は、既存技術によるものである。言い換えれば、追跡処理部５０は、顔領域トラックの番号（tr_no）と、顔検出領域の番号（det_no）との対応付けを試みる。追跡処理部５０は、求められた対応関係に応じて、顔領域の処理を次の通り分岐させる。顔領域トラックと顔検出領域とが対応している場合（matched_pairs）には、ステップＳ１７に進む。顔検出領域に対応する顔領域トラックが存在しない場合（unmatched_dets）には、ステップＳ１８に進む。顔領域トラックに対応する顔検出領域が存在しない場合（unmatched_tracks）には、ステップＳ１９に進む。なお、上記matched_pairsは、顔領域トラックの番号と顔検出領域の番号との組合せの集合である（matched_pairs∈{(tr_no,det_no)}）。また、上記unmatched_detsは、顔検出領域の番号の集合である（unmatched_dets∈｛det_no｝）。また、unmatched_tracksは、顔領域トラックの番号の集合である（unmatched_tracks∈｛tr_no｝）。 Next, in step S16, the tracking processing unit 50 determines the correspondence relationship between the face area track up to time (t−1) and the face detection area at time t, based on DisMtx obtained as a result of step S15. judge. In other words, the tracking processing unit 50 detects the track at time (t−1) and the face at time t based on the degree of overlap between regions (threshold processing has been performed) and the distance between the feature amounts of each region. Associate with the detection area. In addition, in this step, the tracking processing unit 50 obtains the above correspondence relationship using, for example, a Hungarian algorithm (Hungarian method). The Hungarian algorithm itself is based on existing technology. In other words, the tracking processing unit 50 attempts to associate the face area track number (tr_no) with the face detection area number (det_no). The tracking processing unit 50 branches the processing of the face area as follows according to the obtained correspondence relationship. If the face area track and the face detection area correspond (matched_pairs), the process proceeds to step S17. If there is no face area track corresponding to the face detection area (unmatched_dets), the process proceeds to step S18. If there is no face detection area corresponding to the face area track (unmatched_tracks), the process proceeds to step S19. The above matched_pairs is a set of combinations of face area track numbers and face detection area numbers (matched_pairsε{(tr_no, det_no)}). Also, the unmatched_dets is a set of face detection area numbers (unmatched_detsε{det_no}). Also, unmatched_tracks is a set of face area track numbers (unmatched_tracksε{tr_no}).

なお、追跡処理部５０は、ステップＳ１６における条件分岐を、領域ごと（matched_pairごと、unmatched_detごと、そしてunmatched_tracksごと）に行う。そして、追跡処理部５０は、ステップＳ１６で分類された領域ごとに、次のステップＳ１７、Ｓ１８、Ｓ１９のそれぞれの処理を行う。 Note that the tracking processing unit 50 performs conditional branching in step S16 for each region (each matched_pair, each unmatched_det, and each unmatched_tracks). Then, the tracking processing unit 50 performs the processing of steps S17, S18, and S19 for each region classified in step S16.

次に、ステップＳ１７において、追跡処理部５０は、該当する顔検出領域（Detection）の情報を、対応する既存の顔領域トラック（Track）に追加する。 Next, in step S17, the tracking processing unit 50 adds the information of the corresponding face detection area (Detection) to the corresponding existing face area track (Track).

次に、ステップＳ１８において、追跡処理部５０は、該当する顔検出領域（unmatched detection）に基づいて新しい顔領域トラック（Track）を登録する。つまり、追跡処理部５０は、その顔検出領域を、新たなトラックとして、既存トラック集合に追加する。 Next, in step S18, the tracking processing unit 50 registers a new face region track (Track) based on the corresponding face detection region (unmatched detection). That is, the tracking processing unit 50 adds the face detection area to the existing track set as a new track.

次に、ステップＳ１９において、追跡処理部５０は、該当する顔領域トラック（unmatched track）を、喪失された（lost）トラック、あるいは削除された（deleted）トラックとして登録する。具体的には、追跡処理部５０は、既存トラックに対応する顔検出領域がなくなった時点でその既存トラックを喪失状態（lost）として、所定期間この喪失状態が続く場合には、そのトラックを終了（削除、deleted）として、以後の追跡処理の対象から外してよい。 Next, in step S19, the tracking processing unit 50 registers the corresponding face area track (unmatched track) as a lost track or a deleted track. Specifically, the tracking processing unit 50 puts the existing track into a lost state (lost state) when there is no face detection area corresponding to the existing track, and terminates the track when this lost state continues for a predetermined period of time. (deleted) may be excluded from the target of subsequent tracking processing.

すべての領域について上記ステップＳ１７、Ｓ１８、Ｓ１９の処理が完了すると、次にステップＳ２０において、追跡処理部５０は、Tracksを更新する。つまり、追跡処理部５０は、ステップＳ１６で判定した対応関係に基づいて、トラックを更新し、時刻ｔにおけるトラック集合Tr(t)を生成する。 When the processes of steps S17, S18, and S19 are completed for all areas, the tracking processing unit 50 updates Tracks in step S20. That is, the tracking processing unit 50 updates the tracks based on the correspondence determined in step S16, and generates the track set Tr(t) at time t.

なお、トラックを更新する際に、領域tr[i](t)がボケ顔領域である（is_blur＝１）である場合には、追跡処理部５０は、当該領域の認識スコアを強制的に最低値（通常は、０．０）とする。即ち、追跡処理部５０は、そのような領域について、tr[i](t).recog_score＝0.0とする。 Note that when the track is updated, if the area tr[i](t) is a blurred face area (is_blur=1), the tracking processing unit 50 forces the recognition score of the area to be the lowest. value (usually 0.0). That is, the tracking processing unit 50 sets tr[i](t).recog_score=0.0 for such a region.

一方、ステップＳ２１に進んだ場合（ステップＳ１１の判定においてシーンチェンジ有だった場合）には、追跡処理部は、トラックをリセットする。即ち、時刻ｔにおけるトラックをTr(t)＝[]（空集合）とする。つまり、この場合、Tr(t-1)までのトラック情報は以後において使用されなくなる。 On the other hand, when proceeding to step S21 (when there is a scene change in the determination of step S11), the tracking processing unit resets the track. That is, the track at time t is Tr(t)=[] (empty set). That is, in this case, track information up to Tr(t-1) is no longer used.

つまり、追跡処理部５０は、シーンチェンジが検出された場合には、トラックを初期状態にリセットする。 That is, the tracking processing section 50 resets the track to the initial state when a scene change is detected.

以上説明した図４のフローチャートの処理を繰り返した結果として、追跡処理部５０は、顔認識結果の情報を持つ複数の顔領域トラックを、出力する。追跡処理部５０は、この複数の顔領域トラックのデータを、出力整形処理部６０に渡す。 As a result of repeating the processing of the flowchart of FIG. 4 described above, the tracking processing unit 50 outputs a plurality of face area tracks having face recognition result information. The tracking processing unit 50 passes the data of the plurality of face area tracks to the output shaping processing unit 60 .

出力整形処理部６０は、追跡処理部５０から渡される追跡処理の結果のデータを整形し、出力する。具体的には、出力整形処理部６０は、追跡処理部５０が出力したトラック（図４の追跡処理による更新後のトラック）が持つ各フレーム画像での顔認識処理の結果を、認識スコアの高い順に出力するように整形する。時刻ｔにおけるトラックtr[k]（ｋは、特定のトラックに対する指標値）を例にとると、次の通りである。 The output shaping processing unit 60 shapes the data of the result of the tracking processing passed from the tracking processing unit 50 and outputs the data. Specifically, the output shaping processing unit 60 converts the result of face recognition processing in each frame image of the track output by the tracking processing unit 50 (the track after being updated by the tracking processing in FIG. 4) into Format to output in order. Taking track tr[k] (k is an index value for a particular track) at time t as an example:

時刻（ｔ－１）までにおいては、当該トラックは、顔認識に関係する下記のデータを含む。
tr[k](t-1).recog_scoresは、時刻（ｔ－１）までの認識スコアを並べたものである。ここでは、認識スコアは、高い順から（即ち降順に）並んでいるものとする。
tr[k](t-1).recog_idsは、上記のtr[k](t-1).recog_scoresに対応する人物ＩＤの並びである。つまり、tr[k](t-1).recog_idsは、当該トラックの時刻（ｔ－１）までの、認識スコアの降順に並べられた人物ＩＤ（認識結果）である。 By time (t-1), the track contains the following data related to face recognition.
tr[k](t-1).recog_scores is an array of recognition scores up to time (t-1). Here, the recognition scores are arranged in ascending order (that is, in descending order).
tr[k](t-1).recog_ids is a sequence of person IDs corresponding to the above tr[k](t-1).recog_scores. That is, tr[k](t-1).recog_ids is person IDs (recognition results) arranged in descending order of recognition score up to time (t-1) of the track.

ここで、時刻ｔにおける顔検出領域det[j]が、トラックtr[k]に追加される場合を想定する。すると、顔検出領域det[j]に関する顔認識結果であるdet[j].recog_scoresおよびdet[j].recog_idsも、認識結果に関する情報としてトラックtr[k]に追加されることになる。このとき、トラックtr[k]が元々持っていた認識スコアおよび人物ＩＤ（認識結果）の情報に加えて、新たに追加される情報も含めて、認識スコアの降順でのソーティングが行われる。このソーティングを、出力整形処理部６０が行う。この処理は、下の式（６）および式（７）で表わされる。 Here, it is assumed that the face detection area det[j] at time t is added to track tr[k]. Then, det[j].recog_scores and det[j].recog_ids, which are the face recognition results for the face detection area det[j], are also added to track tr[k] as information about the recognition results. At this time, in addition to the information of the recognition score and person ID (recognition result) that the track tr[k] originally had, newly added information is included, and sorting is performed in descending order of the recognition score. This sorting is performed by the output shaping processing unit 60 . This process is represented by equations (6) and (7) below.

tr[k](t).recog_scores ＝
sort([tr[k](t-1).recog_scores+det[j].recog_scores],topK) ・・・（６）
tr[k}(t).recog_ids ＝
argsort([tr[k](t-1).recog_ids+det[j].recog_ids],topK) ・・・（７） tr[k](t).recog_scores =
sort([tr[k](t-1).recog_scores+det[j].recog_scores],topK) (6)
tr[k}(t).recog_ids =
argsort([tr[k](t-1).recog_ids+det[j].recog_ids],topK) (7)

なお、式（６）に示す処理sort(listA,topK)は、値のリストlistAの要素を、その値順に（ここでは降順に）、topK文だけ並べ替える処理を表す。なお、式（６）内の演算子「+」は、リスト同士の連結を行う演算子である。つまり、式（６）は、認識スコアのリストtr[k](t-1).recog_scoresと認識スコアのリストdet[j].recog_scoresとを連結して成るリストに対するソート処理を表す。 Note that the process sort(listA, topK) shown in Expression (6) represents the process of rearranging the elements of the value list listA in the order of their values (here, in descending order) by the topK sentences. Note that the operator "+" in expression (6) is an operator for concatenating lists. In other words, Expression (6) represents sorting processing for a list formed by concatenating the recognition score list tr[k](t−1).recog_scores and the recognition score list det[j].recog_scores.

また、式（７）に示す処理argsort(listA,topK)は、対応するrecog_scores（認識スコア）をソートキーとして、listAを降順に、topK文だけ並べ替える処理を表す。ここでも、演算子「+」はリスト同士の連結を行う演算子である。つまり、式（７）は、式（６）に対応する順に、人物ＩＤのリストtr[k](t-1).recog_idsおよび人物ＩＤのリストdet[j].recog_idsをソートする処理を表す。 Also, the process argsort(listA, topK) shown in Expression (7) represents a process of sorting only topK sentences in listA in descending order using the corresponding recog_scores (recognition scores) as sort keys. Again, the operator "+" is an operator for concatenating lists. That is, equation (7) represents processing for sorting the person ID list tr[k](t−1).recog_ids and the person ID list det[j].recog_ids in the order corresponding to equation (6).

このように、本実施形態では、あるトラックに関して、時間軸にわたって、ある人物ＩＤについての認識スコアの最大値が残るように、トラックの更新の処理を繰り返す。これにより、ある顔領域のスコアが低い時間帯（顔領域がボケ領域である時間帯を含む）の認識結果に左右されずに、顔領域のスコアが高い時間帯の認識結果に基づいて、トラックの顔認識を行うことができるようになる。 As described above, in the present embodiment, track update processing is repeated so that the maximum value of the recognition score for a certain person ID remains for a certain track over the time axis. As a result, the track can be adjusted based on the recognition results of the time period when the score of the face area is high, without being influenced by the recognition result of the time period when the score of the face area is low (including the time period when the face area is a blurred area). face recognition can be performed.

つまり、顔追跡装置１は、トラック単位で、認識スコアの高い順にソートされた顔認識結果のデータ（人物ＩＤおよび認識スコア）を保持し、更新する。これにより、処理対象の映像中に登場する人物に対して、安定した認識結果を得ることが可能となる。 In other words, the face tracking device 1 retains and updates face recognition result data (person ID and recognition score) sorted in descending order of recognition score for each track. This makes it possible to obtain a stable recognition result for a person appearing in the video to be processed.

図５は、あるトラックについての認識スコアの時間推移の例を示す概略図である。図示するように、このトラックは、認識結果の候補として、人物Ａおよび人物Ｂの人物ＩＤを持つものである。当該トラックに関して、人物Ａについてのスコアは、時刻ｔ１において０．５であり、時刻ｔ２において０．８である。一方、人物Ｂについてのスコアは、時刻ｔ１において０．６であり、時刻ｔ２において０．６である。 FIG. 5 is a schematic diagram showing an example of time transition of the recognition score for a certain track. As shown, this track has person IDs of person A and person B as candidates for recognition results. For that track, the score for person A is 0.5 at time t1 and 0.8 at time t2. On the other hand, the score for person B is 0.6 at time t1 and 0.6 at time t2.

このようなスコアが生じるのは、例えば、このトラックについての顔認識の正解が「人物Ａ」であり、時刻ｔ１においては人物Ａが横を向いていたことによって「人物Ａ」の認識スコアが０．５に低下し、時刻ｔ２においては人物Ａが正面を向いていたことによって認識スコアが０．８に上昇している場合である。つまりこの場合、時刻ｔ１においては、正解であるべき人物Ａのスコアが、不正解である人物Ｂのスコアよりも低くなってしまっている。なお、ここではｔ１＜ｔ２である。 Such a score is generated because, for example, the correct face recognition for this track is "person A", and at time t1, the recognition score for "person A" is 0 because the person A was looking sideways. 5, and at time t2, the recognition score increases to 0.8 because the person A was facing the front. That is, in this case, at time t1, the score of person A who should be correct is lower than the score of person B who is incorrect. Note that t1<t2 here.

上記の場合、時刻ｔ１において、当該トラックが持つデータは、次の通りである。
tr(t1).recog_socres ＝ [0.6,05]
tr(t1).recog_ids ＝ [B,A]
また、時刻ｔ２において、当該トラックが持つデータは、次の通りである。
tr(t2).recog_scores ＝ [0.8,0.5]
tr(t2).recog_ids ＝ [A,B] In the above case, the data held by the track at time t1 are as follows.
tr(t1).recog_socres = [0.6,05]
tr(t1).recog_ids = [B,A]
At time t2, the data held by the track are as follows.
tr(t2).recog_scores = [0.8,0.5]
tr(t2).recog_ids = [A,B]

時刻ｔ１における認識スコアと、時刻ｔ２における認識スコアとを統合すると、次の通りである。
tr.recog_score ＝ [0.8,0.6]
tr.recog_ids ＝ [A,B]
つまり、時間経過を通したトラックの認識結果としては、顔追跡装置１は、人物Ａの、よりスコアの高い正面顔のときの状態を優先して用いていることとなる。これにより、顔追跡装置は、正解である人物Ａを、人物Ｂよりも上位の認識結果として出力する。このように、顔追跡装置１は、時間経過を通したトラックの期間中において最も認識スコアの高い人物を、認識結果とする。これにより、顔追跡装置１は、より信頼性の高い認識処理を実現することが可能となる。 The recognition score at time t1 and the recognition score at time t2 are integrated as follows.
tr.recog_score = [0.8,0.6]
tr.recog_ids = [A,B]
In other words, the face tracking device 1 preferentially uses the state of the front face of the person A, which has a higher score, as the track recognition result over time. As a result, the face tracking device outputs the correct person A as a higher recognition result than the person B. FIG. In this way, the face tracking device 1 takes the person with the highest recognition score during the track over time as the recognition result. As a result, the face tracking device 1 can realize more reliable recognition processing.

なお、前述の通り、ボケ検出結果is_blurが１のときには、顔追跡装置１は、当該トラックのその時刻における顔認識スコアを強制的０．０とする。つまり、トラックの顔領域がボケているときの認識結果は、時間を通したトラック全体の認識スコアの計算に影響を与えない。 As described above, when the blur detection result is_blur is 1, the face tracking device 1 forcibly sets the face recognition score of the track at that time to 0.0. That is, the recognition result when the face area of the track is blurred does not affect the calculation of the recognition score for the entire track over time.

以上説明したように、顔追跡装置１は、映像内に出現する顔を追跡しながら、その顔の人物ＩＤを特定する。これにより、顔追跡装置１は、トラックごとの、人物ＩＤの候補と、その人物ＩＤに対応付けられたスコアとのデータを求め、出力することができる。 As described above, the face tracking device 1 identifies the person ID of the face appearing in the video while tracking the face. As a result, the face tracking device 1 can obtain and output data of person ID candidates and scores associated with the person IDs for each track.

図６は、顔追跡装置１による追跡処理の結果として得られるデータ構成の一例を示す概略図である。図示するように、顔追跡装置１は、特定の映像コンテンツに関して、トラックを実体とするデータの集合を出力することができる。図示するトラックの主キーは、例えば、トラックＩＤである。処理対象とした映像コンテンツは、ユニークなコンテンツＩＤを持つことができる。つまり、１つのコンテンツＩＤは、複数の（多数の）トラックに関連付けられる。トラックの属性は、開始時刻（開始フレーム）、終了時刻（終了フレーム）、認識スコア（リスト）、人物ＩＤ（リスト）を含む。開始時刻および終了時刻のそれぞれは、当該映像コンテンツにおける相対時刻である。開始時刻は、開始フレームに対応する。終了時刻は、終了フレームに対応する。各トラックの、開始時刻および終了時刻の精度は、追跡処理を行う頻度に依存する（図２参照）。つまり、追跡処理を十分に頻繁に（例えば、１秒に１回、あるいは０．５秒に１回等）行うことにより、トラックの開始時刻および終了時刻の精度は、実用上十分なものとなる。ただし、追跡処理の頻度についての制約はない。認識スコアは、人物ＩＤに対応するスコアの値のリストである。認識スコアは、スコアが降順に並べられたリストのデータである。人物ＩＤは、映像内に登場する人物のリストである。人物ＩＤのソート順は、対応する認識スコアの降順である。このように、人物ＩＤをキーとしてトラックの集合のデータを検索することにより、該当する人物が登場する映像内のシーン（開始時刻および終了時刻）を、所定の精度で得ることが可能となる。 FIG. 6 is a schematic diagram showing an example of a data structure obtained as a result of tracking processing by the face tracking device 1. As shown in FIG. As shown, the face tracking device 1 can output a set of data, which are represented by tracks, for specific video content. The primary key of the tracks shown is, for example, the track ID. A video content to be processed can have a unique content ID. That is, one content ID is associated with multiple (many) tracks. Track attributes include start time (start frame), end time (end frame), recognition score (list), person ID (list). Each of the start time and end time is relative time in the video content. The start time corresponds to the start frame. The end time corresponds to the end frame. The accuracy of the start time and end time of each track depends on the frequency of tracking (see FIG. 2). That is, by performing the tracking process frequently enough (e.g., once every second, or once every 0.5 seconds, etc.), the accuracy of track start and end times is sufficient for practical use. . However, there is no constraint on the frequency of tracking processing. The recognition score is a list of score values corresponding to person IDs. Recognition scores are data in a list arranged in descending order of scores. The person ID is a list of persons appearing in the video. Person IDs are sorted in descending order of their corresponding recognition scores. In this way, by searching the data of the set of tracks using the person ID as a key, it is possible to obtain the scene (start time and end time) in the video in which the corresponding person appears with a predetermined accuracy.

図６内に示すように、人物ＩＤは、人物をユニークに特定するための識別情報である。人物ＩＤは、１つまたは複数の人物名（表記）のデータに関連付けられる。また、人物ＩＤは、１つまたは複数のサムネール画像のデータに関連付けられる。つまり、人物名表記から、人物ＩＤを求めることが可能である。また、サムネール画像が選択されると、人物ＩＤを求めることが可能である。 As shown in FIG. 6, the person ID is identification information for uniquely identifying a person. A person ID is associated with one or more person name (notation) data. Also, the person ID is associated with one or more thumbnail image data. That is, it is possible to obtain the person ID from the person name notation. Also, when a thumbnail image is selected, a person ID can be obtained.

図７は、人物名あるいはサムネール画像を起点として、当該人物が登場する映像コンテンツのシーンを取り出すための情報経路の例を示す概略図である。図６を参照しながら説明した通り、例えば人物名表記が文字列として入力されたり、提示された多数のサムネール画像（人物の顔の画像）から特定のサムネール画像が選択されたりすると、人物ＩＤを特定することが可能である。人物ＩＤが得られると、図６に示したトラックの集合のデータを検索することにより、当該人物ＩＤに対応する、映像コンテンツのコンテンツＩＤと、その相対位置（開始時刻（開始フレーム）および終了時刻（終了フレーム））の情報を、１セットまたは複数セット、特定することができる。これにより、当該人物ＩＤに対応する映像シーンを、利用者等に対して提示することが可能となる。 FIG. 7 is a schematic diagram showing an example of an information path for retrieving a video content scene in which a person appears, starting from a person's name or a thumbnail image. As described with reference to FIG. 6, for example, when a person's name notation is input as a character string, or when a specific thumbnail image is selected from a large number of presented thumbnail images (person's face images), the person ID is obtained. can be specified. When the person ID is obtained, by searching the data of the set of tracks shown in FIG. 6, the content ID of the video content corresponding to the person ID and its relative position (start time (start frame) and end time (end frame)) information may be specified in one or more sets. This makes it possible to present the video scene corresponding to the person ID to the user or the like.

図８は、図７で説明した映像シーンの検索のためのＧＵＩ（グラフィカルユーザーインターフェース）の一例を示す概略図である。言い換えれば、同図は、顔認識結果に基づく映像検索システムのＧＵＩの例を示す。なお、この映像検索システムは、本実施形態の顔追跡装置１による処理の結果として得られたトラックのデータを用いて、映像を検索することができる。図示する画面は、ＰＣや、タブレット端末や、スマートフォン等の携帯型端末の表示画面である。同図において、７０１は、映像表示領域である。また、７０２は、人物名表示領域である。また、７０３は、サムネール画像表示領域である。また、７０４は、シーン特定情報表示領域である。本システムの利用者は、人物名表示領域７０２に表示されている人物名（文字列）の一つ、あるいは、サムネール画像表示領域７０３に表示されているサムネール画像（顔画像）の一つを選択してクリック（あるいはタップ）することができる。その結果、映像検索システムは、図７に示したように、当該人物に該当するトラックを抽出することができる。そして、映像検索システムは、抽出されたトラックの情報に基づき、特定の映像コンテンツの特定のシーン（相対位置）を、映像表示領域７０１に表示する。選択された人物に関連するトラックが複数存在する場合には、例えば、当該映像検索システムは、認識スコアの値に基づいて、当該人物に関して高いスコアを有するトラックを選択してもよい。 FIG. 8 is a schematic diagram showing an example of a GUI (Graphical User Interface) for searching video scenes described in FIG. In other words, this figure shows an example of a GUI of a video retrieval system based on face recognition results. This video search system can search for video using track data obtained as a result of processing by the face tracking device 1 of the present embodiment. The illustrated screen is a display screen of a mobile terminal such as a PC, a tablet terminal, or a smart phone. In the figure, 701 is a video display area. 702 is a person name display area. 703 is a thumbnail image display area. 704 is a scene specific information display area. The user of this system selects one of the person names (character strings) displayed in the person name display area 702 or one of the thumbnail images (face images) displayed in the thumbnail image display area 703. can be clicked (or tapped). As a result, the video search system can extract tracks corresponding to the person, as shown in FIG. Then, the video search system displays the specific scene (relative position) of the specific video content in the video display area 701 based on the extracted track information. If there are multiple tracks associated with the selected person, for example, the video retrieval system may select the track with the highest score for the person based on the value of the recognition score.

なお、顔追跡装置１自身が検索処理部を備えて、図８に示す映像検索を行うようにしてもよい。その場合、検索処理部は、ＧＵＩの画面等から、人物ＩＤ（人物識別情報）と、人物ＩＤに関連付けられた人物名表記文字列と、人物ＩＤに関連付けられたサムネール画像の指定と、のいずれかの検索キーを取得する。検索処理部は、取得した検索キーに基づいて、出力整形処理部６０が出力したトラックの集合のデータを検索することによって、特定の人物ＩＤに関連付けられた単数または複数のトラックを特定する。これによって特定されたトラックは、コンテンツＩＤと映像コンテンツ内の相対位置の情報とを持っている。よって、検索処理部は、トラックによって特定される映像シーンを、記憶装置等から取得し、デコードし、映像として画面（映像表示領域７０１）に表示することができる。 It should be noted that the face tracking device 1 itself may be provided with a search processing unit to perform the image search shown in FIG. In this case, the search processing unit, from the GUI screen or the like, designates any of the person ID (person identification information), the person name description character string associated with the person ID, and the thumbnail image associated with the person ID. Get the search key for The search processing unit identifies one or more tracks associated with a specific person ID by searching the data of the set of tracks output by the output shaping processing unit 60 based on the acquired search key. The track identified by this has a content ID and relative position information within the video content. Therefore, the search processing unit can acquire a video scene specified by a track from a storage device or the like, decode it, and display it as a video on the screen (video display area 701).

図９は、顔追跡装置の内部構成の例を示すブロック図である。顔追跡装置１は、コンピューターを用いて実現され得る。図示するように、そのコンピューターは、中央処理装置９０１と、ＲＡＭ９０２と、入出力ポート９０３と、入出力デバイス９０４や９０５等と、バス９０６と、を含んで構成される。コンピューター自体は、既存技術を用いて実現可能である。中央処理装置９０１は、ＲＡＭ９０２等から読み込んだプログラムに含まれる命令を実行する。中央処理装置９０１は、各命令にしたがって、ＲＡＭ９０２にデータを書き込んだり、ＲＡＭ９０２からデータを読み出したり、算術演算や論理演算を行ったりする。ＲＡＭ９０２は、データやプログラムを記憶する。ＲＡＭ９０２に含まれる各要素は、アドレスを持ち、アドレスを用いてアクセスされ得るものである。なお、ＲＡＭは、「ランダムアクセスメモリー」の略である。入出力ポート９０３は、中央処理装置９０１が外部の入出力デバイス等とデータのやり取りを行うためのポートである。入出力デバイス９０４や９０５は、入出力デバイスである。入出力デバイス９０４や９０５は、入出力ポート９０３を介して中央処理装置９０１との間でデータをやりとりする。バス９０６は、コンピューター内部で使用される共通の通信路である。例えば、中央処理装置９０１は、バス９０６を介してＲＡＭ９０２のデータを読んだり書いたりする。また、例えば、中央処理装置９０１は、バス９０６を介して入出力ポートにアクセスする。 FIG. 9 is a block diagram showing an example of the internal configuration of the face tracking device. The face tracking device 1 can be implemented using a computer. As shown, the computer includes a central processing unit 901 , RAM 902 , input/output ports 903 , input/output devices 904 and 905 and the like, and bus 906 . The computer itself can be implemented using existing technology. The central processing unit 901 executes instructions included in programs read from the RAM 902 or the like. The central processing unit 901 writes data to the RAM 902, reads data from the RAM 902, and performs arithmetic operations and logical operations according to each instruction. A RAM 902 stores data and programs. Each element contained in RAM 902 has an address and can be accessed using the address. Note that RAM is an abbreviation for "random access memory". The input/output port 903 is a port for the central processing unit 901 to exchange data with an external input/output device or the like. Input/output devices 904 and 905 are input/output devices. The input/output devices 904 and 905 exchange data with the central processing unit 901 via the input/output port 903 . Bus 906 is a common communication path used inside the computer. For example, central processing unit 901 reads and writes data in RAM 902 via bus 906 . Also, for example, central processing unit 901 accesses input/output ports via bus 906 .

なお、顔追跡装置１の少なくとも一部の機能をコンピューターで実現することができる。その場合、この機能を実現するためのプログラムをコンピューター読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピューターシステムに読み込ませ、実行することによって実現しても良い。なお、ここでいう「コンピューターシステム」とは、ＯＳや周辺機器等のハードウェアを含むものとする。また、「コンピューター読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ－ＲＯＭ、ＤＶＤ－ＲＯＭ、ＵＳＢメモリー等の可搬媒体、コンピューターシステムに内蔵されるハードディスク等の記憶装置のことをいう。つまり、「コンピューター読み取り可能な記録媒体」とは、非一過性の（non-transitory）コンピューター読み取り可能な記録媒体であってよい。さらに「コンピューター読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムを送信する場合の通信線のように、一時的に、動的にプログラムを保持するもの、その場合のサーバーやクライアントとなるコンピューターシステム内部の揮発性メモリーのように、一定時間プログラムを保持しているものも含んでも良い。また上記プログラムは、前述した機能の一部を実現するためのものであっても良く、さらに前述した機能をコンピューターシステムにすでに記録されているプログラムとの組み合わせで実現できるものであっても良い。 At least part of the functions of the face tracking device 1 can be realized by a computer. In that case, a program for realizing this function may be recorded in a computer-readable recording medium, and the program recorded in this recording medium may be read into a computer system and executed. It should be noted that the "computer system" referred to here includes hardware such as an OS and peripheral devices. In addition, “computer-readable recording media” refers to portable media such as flexible discs, magneto-optical discs, ROMs, CD-ROMs, DVD-ROMs, USB memories, and storage devices such as hard disks built into computer systems. Say things. In other words, the "computer-readable recording medium" may be a non-transitory computer-readable recording medium. In addition, "computer-readable recording medium" means a medium that temporarily and dynamically retains a program, such as a communication line for transmitting a program via a network such as the Internet or a communication line such as a telephone line. , it may also include something that holds the program for a certain period of time, such as a volatile memory inside a computer system that serves as a server or client in that case. Further, the program may be for realizing part of the functions described above, or may be a program capable of realizing the functions described above in combination with a program already recorded in the computer system.

以上、実施形態を説明したが、本発明はさらに次のような変形例でも実施することが可能である。 Although the embodiments have been described above, the present invention can also be implemented in the following modifications.

［第１変形例］
例えば、顔認識処理部４０による顔認識処理は、ニューラルネットワーク等の機械学習手段を用いて行うようにしてもよい。その場合、顔データベース３０が保持する人物ＩＤごとの画像特徴量は、ニューラルネットワーク内の内部パラメーター値の集合として表わされ、保持されるものである。つまり、画像特徴量と人物ＩＤとの関係の情報は、ニューラルネットワーク内の各ノードのパラメーター値として埋め込まれる。 [First modification]
For example, the face recognition processing by the face recognition processing unit 40 may be performed using machine learning means such as a neural network. In that case, the image feature amount for each person ID held by the face database 30 is represented and held as a set of internal parameter values in the neural network. That is, information about the relationship between the image feature amount and the person ID is embedded as a parameter value of each node in the neural network.

［第２変形例］
上記実施形態では、映像の過去側から未来側に向かって（順方向で）、顔領域を追跡し、トラックを更新する処理を行った。この方向が逆でもよい。即ち、顔追跡装置１は、映像の未来側から過去側に向かって（逆方向で）、顔領域を追跡し、トラックを更新するようにしてもよい。また、顔追跡装置１が、順方向の追跡と、逆方向の追跡とを、組み合わせて処理するようにしてもよい。これにより、動画ファイル内の複数のトラックを開始から終了まで特定でき、そのトラックに対する認識結果を付与することが可能となる。 [Second modification]
In the above embodiment, the process of tracking the face area from the past side of the image to the future side (in the forward direction) and updating the track was performed. This direction may be reversed. That is, the face tracking device 1 may track the face area from the future side of the image toward the past side (in the opposite direction) and update the track. Further, the face tracking device 1 may combine forward tracking and backward tracking. This makes it possible to specify a plurality of tracks in a moving image file from start to end, and to provide recognition results for those tracks.

以上説明したように、本実施形態（その変形例を含む。以下において同様。）によれば、映像中に映っている人物の顔認識処理と、顔領域を追跡する処理とを組み合わせることによって、顔認識精度の向上および追跡精度の向上の両方を実現する。本実施形態の顔追跡装置１は、映像内の、時間方向にわたって人物が出現する領域に対して、一貫した認識結果を付与することが可能となる。 As described above, according to the present embodiment (including its modifications; the same shall apply hereinafter), by combining face recognition processing of a person appearing in an image and processing for tracking a face region, It achieves both improved face recognition accuracy and improved tracking accuracy. The face tracking device 1 of the present embodiment can give consistent recognition results to regions in a video in which people appear over time.

また、本実施形態は、顔の追跡処理の際に、シーンチェンジ（カット）の検出結果を利用する。つまり、本実施形態は、シーンチェンジが検出されると、追跡していたトラックを初期化する。これにより、例えば長時間の映像でシーンチェンジを含む場合にも、適切な追跡を行うことができる。言い換えれば、処理の混乱を生じないようにすることができる。 In addition, in the present embodiment, the scene change (cut) detection result is used during the face tracking process. In other words, this embodiment initializes the tracks being tracked when a scene change is detected. As a result, even when a scene change is included in a long video, for example, appropriate tracking can be performed. In other words, it is possible to prevent processing confusion.

また、本実施形態は、認識処理の精度を下げる要因であるボケ画像に対応するために、積極的にボケ画像の検出処理を行い、ボケ画像が検出される場合にはその認識スコアを最低値にするなお、ボケによる認識への影響を軽減している。 In addition, in the present embodiment, in order to deal with blurred images, which are factors that lower the accuracy of recognition processing, blur image detection processing is actively performed, and when a blurred image is detected, the recognition score is reduced to the lowest value. In addition, the influence of blurring on recognition is reduced.

また、本実施形態では、２種類の処理のレート（頻度）を変えて、それらの処理を組み合わせている。具体的には、映像解析処理部２０による映像解析処理と、顔認識処理部４０および追跡処理部５０による認識および追跡処理とを、異なるレートで行えるようにしている。これにより、本実施形態は、２種類の動き予測処理の結果である動き予測を蓄積し、追跡処理時に利用している。これにより、装置全体での処理負荷を軽減しながら、必要な処理を高いレート（頻度）で実行することが可能となっている。 Also, in this embodiment, the rate (frequency) of the two types of processing is changed and the processing is combined. Specifically, the video analysis processing by the video analysis processing unit 20 and the recognition and tracking processing by the face recognition processing unit 40 and the tracking processing unit 50 can be performed at different rates. As a result, the present embodiment accumulates motion predictions, which are the results of two types of motion prediction processing, and utilizes them during tracking processing. This makes it possible to execute necessary processing at a high rate (frequency) while reducing the processing load on the entire apparatus.

また、本実施形態は、顔領域の軌跡（トラック）の単位のデータを出力し、そのトラックに認識結果のデータを付与するようにしている。このような形式のデータを出力することにより、本実施形態は、トラック単位で、最も確からしい（単数または複数の）認識結果を与えることができるようになっている。このような出力データは、人物ＩＤをキーとして、その人物が出現している映像シーンを検索するのに向いている。これにより、人物を起点として映像検索を容易にすることができる。また、本実施形態は、そのような映像検索のためのＧＵＩの一例を提示した。 Further, in the present embodiment, data is output in units of loci (tracks) of the face area, and recognition result data is added to the tracks. By outputting data in such a format, the present embodiment can give the most probable recognition result (single or plural) on a track-by-track basis. Such output data is suitable for retrieving video scenes in which the person appears, using the person ID as a key. As a result, it is possible to facilitate video retrieval with a person as a starting point. Also, this embodiment presented an example of a GUI for such video search.

つまり、本実施形態は、映像内に出現する人物の顔認識を時間的な変化にも対応して一貫した形式で実施することを可能とする。認識結果を対象映像へのメタデータ付加に簡単に利用できるようなデータ構造を考案することにより、登場人物の検索を効果的に行える検索システムの開発が容易となる。 In other words, this embodiment makes it possible to carry out face recognition of a person appearing in a video in a consistent manner in response to changes over time. By devising a data structure that allows the recognition results to be easily used to add metadata to the target video, it becomes easier to develop a search system that can effectively search for characters.

以上、この発明の実施形態について図面を参照して詳述してきたが、具体的な構成はこの実施形態に限られるものではなく、この発明の要旨を逸脱しない範囲の設計等も含まれる。 Although the embodiments of the present invention have been described in detail above with reference to the drawings, the specific configuration is not limited to these embodiments, and designs and the like are included within the scope of the gist of the present invention.

本発明は、例えば、映像コンテンツ制作事業や、映像コンテンツ配信事業や、映像コンテンツ管理事業や、映像コンテンツ検索サービス事業等に利用することができる。但し、本発明の利用範囲はここに例示したものには限られない。 INDUSTRIAL APPLICABILITY The present invention can be used, for example, in video content production business, video content distribution business, video content management business, video content search service business, and the like. However, the scope of application of the present invention is not limited to those exemplified here.

１顔追跡装置
１０映像取得部
２０映像解析処理部
３０顔データベース
４０顔認識処理部
５０追跡処理部
６０出力整形処理部
７０１映像表示領域
７０２人物名表示領域
７０３サムネール画像表示領域
７０４シーン特定情報表示領域
９０１中央処理装置
９０２ＲＡＭ
９０３入出力ポート
９０４，９０５入出力デバイス
９０６バス 1 face tracking device 10 image acquisition unit 20 image analysis processing unit 30 face database 40 face recognition processing unit 50 tracking processing unit 60 output shaping processing unit 701 image display area 702 person name display area 703 thumbnail image display area 704 scene specific information display area 901 central processing unit 902 RAM
903 input/output ports 904, 905 input/output device 906 bus

Claims

a video analysis processing unit that performs motion prediction processing for a face region included in a video and outputs information on a motion prediction result of the face region;
extracting a feature amount of the face area included in the video, and performing face recognition processing of the face area based on the feature amount; a face recognition processing unit that outputs a first set that is a set of person identification information that is information that identifies a person as a result of processing and a recognition score that corresponds to the person identification information;
a tracking processing unit that obtains a track in the time direction of the face area;
with
the track has second position information, which is position information of the face region, and a second set, which is a set of person identification information and a recognition score of the face region;
The tracking processing unit obtains the second position information after the movement of the face region based on the information of the motion prediction result of the face region passed from the video analysis processing unit, and passes the second position information from the face recognition processing unit. By updating the track based on the first set and the second set possessed by the track, and also based on the relationship between the first position information and the second position information, Giving the new second position information and the new second set to the track;
face tracker.

The face tracking device performs processing by the video analysis processing unit more frequently than processing by the face recognition processing unit,
The video analysis processing unit accumulates information of the motion prediction result, which is the result of the motion prediction processing performed a plurality of times,
The tracking processing unit performs processing based on the accumulated information of the motion prediction result passed from the video analysis processing unit.
A face tracking device according to claim 1.

The video analysis processing unit further detects a scene change in the video,
The tracking processing unit resets the track to an initial state when the scene change is detected.
3. A face tracking device according to claim 1 or 2.

The image analysis processing unit further determines whether or not the face area is a blurred area,
With respect to the face region that is the blurred region, the tracking processing unit sets the value of the recognition score to be the lowest value when updating the track of the face region.
4. A face tracking device according to any one of claims 1-3.

The face recognition processing unit is capable of accessing information in which the feature amount when the face area is a blurred area and the special person identification information associated with the feature amount of the blurred area can be accessed in advance. keep,
When the face recognition processing unit identifies the special person identification information associated with the feature amount of the blurred region as a result of the face recognition processing performed on the face region, the video analysis processing unit determining that the face area is the blurred area;
5. A face tracking device according to claim 4.

The tracking processing unit updates the track based on the degree of overlap between the first position information and the second position information.
6. A face tracking device according to any one of claims 1-5.

an output shaping processing unit for shaping and outputting the data of the set of tracks in a form in which information of the set of the person identification information and the recognition score is added to the information of the track;
7. A face tracking device according to any one of claims 1 to 6, further comprising:

obtaining a search key for one of the personal identification information, a character string in which a person's name is associated with the personal identification information, and a designation of a thumbnail image associated with the personal identification information, and obtaining the obtained search key; a search processing unit that identifies the track associated with specific person identification information by searching the data of the set of tracks output by the output shaping processing unit based on
8. The face tracking device of claim 7, further comprising:

a video analysis processing unit that performs motion prediction processing for a face region included in a video and outputs information on a motion prediction result of the face region;
extracting a feature amount of the face area included in the video, and performing face recognition processing of the face area based on the feature amount; a face recognition processing unit that outputs a first set that is a set of person identification information that is information that identifies a person as a result of processing and a recognition score that corresponds to the person identification information;
a tracking processing unit that obtains a track in the time direction of the face area;
with
the track has second position information, which is position information of the face region, and a second set, which is a set of person identification information and a recognition score of the face region;
The tracking processing unit obtains the second position information after the movement of the face region based on the information of the motion prediction result of the face region passed from the video analysis processing unit, and passes the second position information from the face recognition processing unit. By updating the track based on the first set and the second set possessed by the track, and also based on the relationship between the first position information and the second position information, Giving the new second position information and the new second set to the track;
A program that makes a computer act as a face-tracking device.