JP2019049829A

JP2019049829A - Target section determination device, model learning device and program

Info

Publication number: JP2019049829A
Application number: JP2017173318A
Authority: JP
Inventors: 小島　真一; Shinichi Kojima; 真一小島; 博幸森▲崎▼; Hiroyuki Morisaki; 和久永石; Kazuhisa Nagaishi
Original assignee: Aisin Seiki Co Ltd; Toyota Central R&D Labs Inc
Current assignee: Toyota Central R&D Labs Inc; Aisin Corp
Priority date: 2017-09-08
Filing date: 2017-09-08
Publication date: 2019-03-28

Abstract

To provide a target section determination unit, a model learning device and a program which can accurately determine whether at least a part of a face of an individual is in a target section or not in a short processing time.SOLUTION: A target section determination device 10 comprises: an image feature quantity extraction unit 16 which extracts an image feature quantity from an image obtained by imaging at least a part of a face of an individual; an intra-individual variation parameter estimation unit 18 which estimates an intra-individual variation parameter relating to an intra-individual variation base on the basis of an intra-individual variation base for representing an individual difference component of the image feature quantity and an intra-individual variation base for representing an intra-individual variation component of the image feature quantity, which are preliminarily obtained from the image feature quantity extracted from an image feature quantity extraction unit 16; and a target section determination unit 20 which determines whether the at least a part of the face is in a target section or not on the basis of the intra-individual variation parameter estimated by the intra-individual variation parameter estimation unit 18.SELECTED DRAWING: Figure 1

Description

本発明は、目的区間判別装置、モデル学習装置、及びプログラムに係り、特に、個人の顔の少なくとも一部を撮影した画像が目的区間であるか否かを判別するための目的区間判別装置、モデル学習装置、及びプログラムに関する。 The present invention relates to a target segment determination device, a model learning device, and a program, and in particular, a target segment determination device for determining whether an image obtained by photographing at least a part of an individual's face is a target segment, a model The present invention relates to a learning device and a program.

従来、話者の唇を含む画像をカメラにより連続的に撮影し、撮影された画像により得られる唇の動きから話者が発話している発話区間を検出する技術が研究されている。 2. Description of the Related Art Conventionally, a technology has been studied in which an image including the lips of a speaker is continuously captured by a camera, and a speech section in which a speaker is speaking is detected from lip movement obtained by the captured image.

例えば、特許文献１には、話者の唇を含んだ画像をカメラで連続的に撮影すると共に、話者が発話した音声を集音し、連続的に撮影した画像に基づいて唇の形状が変形した度合いを示す変形量を導出する技術が記載されている。この特許文献１に記載の技術によれば、画像に基づいてカメラから話者までの距離及びカメラに対する話者の顔の向きを導出し、導出した距離が所定範囲内で、かつ、顔の向きがカメラに対して所定角度範囲内であり、かつ、集音した音声の強度が所定レベル以上である場合に、導出した変形量に基づいて話者が発話している発話区間の判別に用いる変形量の閾値を決定し、決定した閾値を用いて導出した変形量から発話区間が検出される。 For example, in Patent Document 1, an image including the lips of the speaker is continuously captured by a camera, and a voice uttered by the speaker is collected, and the lip shape is based on the images captured continuously. A technique for deriving the amount of deformation indicating the degree of deformation is described. According to the technique described in Patent Document 1, the distance from the camera to the speaker and the direction of the face of the speaker with respect to the camera are derived based on the image, and the derived distance is within a predetermined range and the direction of the face. Is within a predetermined angle range with respect to the camera, and the strength of the collected voice is equal to or higher than a predetermined level, the deformation used to determine the speech section in which the speaker is speaking based on the derived deformation amount The threshold of the amount is determined, and the speech section is detected from the amount of deformation derived using the determined threshold.

また、特許文献２には、連続的に撮影された画像中の、特定の画像における口唇パターンと、特定の画像の直前に撮影された連続した複数の画像の中の口唇包含パターンとを比較し、これらの相関値を算出し、算出された変動量に基づいて発話区間であるか否かを検出する技術が記載されている。 In addition, Patent Document 2 compares the lip pattern in a specific image in the continuously captured images with the lip inclusion pattern in a plurality of continuous images captured immediately before the specific image. A technique is described which calculates these correlation values and detects whether or not it is a speech section based on the calculated variation.

また、特許文献３には、集音された音響情報に基づく音響情報の特徴量と撮像された画像情報に基づく唇特徴量を時間軸方向に平滑化した視覚特徴量とを統合して発話区間を検出し、検出された発話区間に基づき発話を認識する技術が記載されている。 Further, in Patent Document 3, a speech section is integrated with a feature quantity of acoustic information based on collected sound information and a visual feature quantity obtained by smoothing a lip feature quantity based on captured image information in a time axis direction. There is described a technique for detecting an utterance and recognizing an utterance based on the detected utterance section.

特許第４７１５７３８号公報Patent No. 4715738 gazette 特許第４６５０８８８号公報Patent 4650888 gazette 特開２０１１−１９１４２３号公報JP, 2011-191423, A

しかしながら、上記の特許文献１に記載の技術では、話者の顔の向きや距離により補正が行われるため、顔の基本的な形状に基づく個人差の影響は除去されていない。従って、発話区間の検出精度が低下する場合がある。 However, in the technology described in Patent Document 1 above, since correction is performed according to the direction and distance of the speaker's face, the influence of individual differences based on the basic shape of the face is not removed. Therefore, the detection accuracy of the utterance section may be reduced.

また、上記の特許文献２、３に記載の技術では、時間的に変化する複数の画像を用いるため、処理に時間がかかり、発話区間の検出に時間的な遅れが発生する場合がある。 Further, in the techniques described in Patent Documents 2 and 3 described above, since a plurality of temporally changing images are used, processing takes time, and there may be a time delay in detection of an utterance section.

このため、口をはじめ、個人の顔の少なくとも一部を対象とした場合に、発話区間等の目的区間を、短い処理時間で、精度良く判別できることが望まれている。 For this reason, when targeting at least a part of the face of an individual including the mouth, it is desirable that the target section such as the speech section can be accurately determined in a short processing time.

本発明は、上記の問題点を解決するためになされたもので、短い処理時間で、個人の顔の少なくとも一部が目的区間であるか否かを精度良く判別することができる目的区間判別装置、モデル学習装置、及びプログラムを提供することを目的とする。 The present invention has been made to solve the above-mentioned problems, and it is possible to accurately determine whether or not at least a part of an individual's face is a target section in a short processing time, and a target section discrimination device , A model learning device, and a program.

上記目的を達成するために、請求項１に係る目的区間判別装置は、個人の顔の少なくとも一部を撮影した画像から画像特徴量を抽出する画像特徴量抽出部と、前記画像特徴量抽出部により抽出された前記画像特徴量から、予め求められた、前記画像特徴量の個人差成分を表現するための個人差基底及び前記画像特徴量の個人内の変動成分を表現するための個人内変動基底に基づいて、前記個人内変動基底に関する個人内変動パラメータを推定する個人内変動パラメータ推定部と、前記個人内変動パラメータ推定部により推定された前記個人内変動パラメータに基づいて、前記顔の少なくとも一部が目的区間であるか否かを判別する目的区間判別部と、を備える。 In order to achieve the above object, an object feature discrimination apparatus according to claim 1 comprises an image feature quantity extraction unit for extracting an image feature quantity from an image obtained by photographing at least a part of an individual's face; Individual difference basis for expressing the individual difference component of the image feature, which is previously obtained from the image feature extracted by the method, and intra-individual variation for expressing the intra-individual fluctuation component of the image feature. An intra-individual variation parameter estimation unit that estimates an intra-individual variation parameter related to the intra-individual variation basis based on a basis, and at least the face based on the intra-individual variation parameter estimated by the intra-individual variation parameter estimation unit. And a target segment determination unit that determines whether a part is a target segment.

また、請求項２に係る目的区間判別装置は、請求項１に記載の発明において、前記個人差基底が、前記顔の少なくとも一部と同一の部分を撮影した、複数の個人の基準の状態を表す複数の学習用画像の画像特徴量に基づいて求められ、前記個人内変動基底が、前記顔の少なくとも一部と同一の部分を撮影した、複数の個人の状態を表す複数の学習用画像の画像特徴量から、前記個人差基底を用いて個人差成分を除去した画像特徴量に基づいて求められるものである。 Further, in the object segment discrimination device according to a second aspect of the present invention, in the invention according to the first aspect, the condition of a plurality of individuals in which the individual difference basis photographed the same part as at least a part of the face A plurality of learning images representing states of a plurality of individuals, which are obtained based on image feature amounts of a plurality of learning images to be represented, and the intra-individual variation base has photographed the same part as at least a part of the face. The image feature amount is obtained based on the image feature amount from which the individual difference component is removed using the individual difference basis.

また、請求項３に係る目的区間判別装置は、請求項１又は２に記載の発明において、前記目的区間判別部が、予め学習された、前記個人内変動パラメータに基づいて前記目的区間であるか否かを判別するためのモデルを用いて、前記目的区間であるか否かを判別するものである。 Further, in the target segment discrimination device according to claim 3, in the invention according to claim 1 or 2, whether the target segment discriminator is the target segment based on the intra-individual variation parameter learned in advance It is determined whether or not the target section is by using a model for determining whether or not it is the target section.

また、請求項４に係る目的区間判別装置は、請求項１〜３のいずれか１項に記載の発明において、前記個人の音声から音声特徴量を抽出する音声特徴量抽出部を更に備え、前記目的区間判別部が、前記個人内変動パラメータ推定部により推定された前記個人内変動パラメータと、前記音声特徴量抽出部により抽出された前記音声特徴量とに基づいて、前記顔の少なくとも一部が目的区間であるか否かを判別するものである。 Further, in the invention according to any one of claims 1 to 3, the object segment discrimination device according to claim 4 further includes a voice feature amount extraction unit for extracting a voice feature amount from the voice of the individual, At least a part of the face is determined based on the intra-personal variation parameter estimated by the intra-individual variation parameter estimation unit and the voice feature quantity extracted by the voice feature quantity extraction unit. It is determined whether or not it is a target section.

また、請求項５に係る目的区間判別装置は、請求項１〜４のいずれか１項に記載の発明において、前記個人の顔の少なくとも一部が、口とされ、前記目的区間が、前記口が開いた状態を表す発話区間とされている。 In a fifth aspect of the present invention, in the target section discrimination apparatus according to any one of the first to fourth aspects, at least a part of the face of the individual is a mouth, and the target section is the third mouth. Is an utterance section that represents an open state.

一方、上記目的を達成するために、請求項６に記載のモデル学習装置は、個人の顔の少なくとも一部を撮影した、複数の状態のいずれかを表す学習用画像の各々のうち、前記複数の状態の中の基準の状態を表す学習用画像の各々から抽出される画像特徴量に基づいて、前記画像特徴量の個人差成分を表現するための個人差基底を算出する個人差基底算出部と、前記複数の状態のいずれかを表す学習用画像の各々から抽出される画像特徴量と、前記個人差基底算出部により算出された前記個人差基底とに基づいて、前記学習用画像の画像特徴量から、前記個人差基底を用いて個人差成分を除去し、前記個人差成分を除去した画像特徴量に基づいて、個人内の変動成分を表現するための個人内変動基底を算出する個人内変動基底算出部と、目的区間であるか否かが付与され、前記個人の顔の少なくとも一部を撮影したモデル学習用画像の各々から抽出された画像特徴量と、前記個人差基底算出部により算出された前記個人差基底と、前記個人内変動基底算出部により算出された前記個人内変動基底とに基づいて、前記モデル学習用画像の各々について、前記個人内変動基底に関する個人内変動パラメータを推定し、前記モデル学習用画像の各々について推定された前記個人内変動パラメータと、前記モデル学習用画像の各々に付与された、前記目的区間であるか否かとに基づいて、前記個人内変動パラメータに基づいて前記目的区間であるか否かを判別するためのモデルを学習するモデル生成部と、を備える。 On the other hand, in order to achieve the above object, in the model learning device according to claim 6, each of the plurality of learning images representing any of a plurality of states obtained by photographing at least a part of an individual's face An individual difference basis calculation unit that calculates an individual difference basis for expressing an individual difference component of the image feature amount based on the image feature amount extracted from each of the learning images representing the reference state in the state of An image of the learning image based on the image feature amount extracted from each of the learning images representing any of the plurality of states, and the individual difference basis calculated by the individual difference basis calculating unit An individual who calculates an intra-individual variation base for expressing an intra-individual variation component based on an image feature quantity from which an individual difference component is removed from the feature amount using the individual difference basis, and the individual difference component is removed. Internal variation basis calculation unit, target area And the individual difference basis calculated by the individual difference basis calculation unit and image feature quantities extracted from each of the model learning images obtained by photographing at least a part of the individual's face, And estimating the intra-individual variation parameter relating to the intra-individual variation base for each of the image for model learning based on the intra-individual variation base calculated by the intra-individual variation base calculation unit, the image for model learning Based on the intra-individual variation parameter based on the intra-individual variation parameter estimated for each of the in-person variation parameters and the target interval assigned to each of the model learning images. And a model generation unit that learns a model for determining whether or not it is.

一方、上記目的を達成するために、請求項７に記載のプログラムは、コンピュータを、個人の顔の少なくとも一部を撮影した画像から画像特徴量を抽出する画像特徴量抽出部、前記画像特徴量抽出部により抽出された前記画像特徴量から、予め求められた、前記画像特徴量の個人差成分を表現するための個人差基底及び前記画像特徴量の個人内の変動成分を表現するための個人内変動基底に基づいて、前記個人内変動基底に関する個人内変動パラメータを推定する個人内変動パラメータ推定部、及び前記個人内変動パラメータ推定部により推定された前記個人内変動パラメータに基づいて、前記顔の少なくとも一部が目的区間であるか否かを判別する目的区間判別部として機能させるためのプログラムである。 On the other hand, in order to achieve the above object, a program according to claim 7 is an image feature quantity extraction unit for extracting an image feature quantity from an image obtained by photographing at least a part of an individual's face. An individual difference basis for expressing an individual difference component of the image feature amount obtained in advance from the image feature amount extracted by the extraction unit, and an individual for expressing an intra-individual variation component of the image feature amount The intra-individual variation parameter estimation unit that estimates an intra-individual variation parameter related to the intra-individual variation base based on an intra-variation basis, and the face based on the intra-individual variation parameter estimated by the intra-individual variation parameter estimation unit Is a program for functioning as a target segment determination unit that determines whether or not at least a part of the target segment is a target segment.

更に、上記目的を達成するために、請求項８に記載のプログラムは、コンピュータを、個人の顔の少なくとも一部を撮影した、複数の状態のいずれかを表す学習用画像の各々のうち、前記複数の状態の中の基準の状態を表す学習用画像の各々から抽出される画像特徴量に基づいて、前記画像特徴量の個人差成分を表現するための個人差基底を算出する個人差基底算出部、前記複数の状態のいずれかを表す学習用画像の各々から抽出される画像特徴量と、前記個人差基底算出部により算出された前記個人差基底とに基づいて、前記学習用画像の画像特徴量から、前記個人差基底を用いて個人差成分を除去し、前記個人差成分を除去した画像特徴量に基づいて、個人内の変動成分を表現するための個人内変動基底を算出する個人内変動基底算出部、及び目的区間であるか否かが付与され、前記個人の顔の少なくとも一部を撮影したモデル学習用画像の各々から抽出された画像特徴量と、前記個人差基底算出部により算出された前記個人差基底と、前記個人内変動基底算出部により算出された前記個人内変動基底とに基づいて、前記モデル学習用画像の各々について、前記個人内変動基底に関する個人内変動パラメータを推定し、前記モデル学習用画像の各々について推定された前記個人内変動パラメータと、前記モデル学習用画像の各々に付与された、前記目的区間であるか否かとに基づいて、前記個人内変動パラメータに基づいて前記目的区間であるか否かを判別するためのモデルを学習するモデル生成部、として機能させるためのプログラムである。 Furthermore, in order to achieve the above object, a program according to claim 8 is characterized in that the computer is configured to capture at least a part of an individual's face, each of the learning images representing any of a plurality of states. Individual difference basis calculation for calculating an individual difference basis for expressing an individual difference component of the image feature amount based on the image feature amount extracted from each of the learning images representing the reference state in a plurality of states An image of the learning image based on the image feature amount extracted from each of the learning images representing any of the plurality of states, and the individual difference basis calculated by the individual difference basis calculating unit An individual who calculates an intra-individual variation base for expressing an intra-individual variation component based on an image feature quantity from which an individual difference component is removed from the feature amount using the individual difference basis, and the individual difference component is removed. Internal variation basis calculation unit And an image feature value extracted from each of the model learning images obtained by capturing at least a part of the face of the individual, and the individual calculated by the individual difference basis calculation unit. The intra-individual variation parameter relating to the intra-individual variation base is estimated for each of the model learning images based on the difference base and the intra-individual variation base calculated by the intra-individual variation base calculation unit, and the model is estimated The object based on the intra-individual variation parameter based on the intra-individual variation parameter estimated for each of the learning images and whether or not the target section is assigned to each of the model learning images. It is a program for functioning as a model generation unit that learns a model for determining whether or not it is a section.

以上説明したように、本発明の目的区間判別装置、モデル学習装置、及びプログラムによれば、短い処理時間で、個人の顔の少なくとも一部が目的区間であるか否かを精度良く判別することができる。 As described above, according to the target segment judging device, the model learning device, and the program of the present invention, it is accurately judged whether or not at least a part of an individual's face is a target segment in a short processing time. Can.

第１の実施形態に係る目的区間判別装置の機能的な構成の一例を示すブロック図である。It is a block diagram which shows an example of a functional structure of the target area determination apparatus which concerns on 1st Embodiment. 第１の実施形態に係る目的区間判別装置として機能するコンピュータの構成の一例を示すブロック図である。It is a block diagram showing an example of composition of a computer which functions as an object area distinction device concerning a 1st embodiment. 第１の実施形態に係るモデル学習装置の機能的な構成の一例を示すブロック図である。It is a block diagram showing an example of a functional composition of a model learning device concerning a 1st embodiment. 第１の実施形態に係るモデル学習装置として機能するコンピュータの構成の一例を示すブロック図である。It is a block diagram showing an example of composition of a computer which functions as a model learning device concerning a 1st embodiment. （Ａ）実施形態に係る口を閉じた状態の学習用画像の一例を示す図、（Ｂ）実施形態に係る口を閉じた状態及び口を開いた状態を含む学習用画像の一例を示す図である。(A) A diagram showing an example of a learning image in a state in which the mouth is closed according to the embodiment, (B) A diagram showing an example of a learning image including a state in which the mouth is closed and a condition in which the mouth is opened. It is. 実施形態に係る学習用画像から得られる複数の特徴点の一例を示す図である。It is a figure which shows an example of the some feature point obtained from the image for learning which concerns on embodiment. 第１の実施形態に係る目的区間判別処理プログラムの処理の流れの一例を示すフローチャートである。It is a flowchart which shows an example of the flow of a process of the object area discrimination processing program which concerns on 1st Embodiment. 第１の実施形態に係るモデル学習処理プログラムの処理の流れの一例を示すフローチャートである。It is a flowchart which shows an example of the flow of a process of the model learning processing program which concerns on 1st Embodiment. （Ａ）〜（Ｄ）実施形態に係るモデル学習装置により得られる解像度毎の個人差基底の一例を示す図である。It is a figure which shows an example of the individual difference base for every resolution obtained by the model learning apparatus which concerns on (A)-(D) embodiment. （Ａ）〜（Ｄ）実施形態に係るモデル学習装置により得られる解像度毎の個人内変動基底の一例を示す図である。(A)-(D) is a figure which shows an example of the in-person variation base for every resolution obtained by the model learning apparatus which concerns on embodiment. 第２の実施形態に係る目的区間判別装置の機能的な構成の一例を示すブロック図である。It is a block diagram which shows an example of a functional structure of the target area determination apparatus which concerns on 2nd Embodiment.

以下、図面を参照して、本発明を実施するための形態の一例について詳細に説明する。なお、本実施形態では、一例として、個人の顔の少なくとも一部を口とし、目的区間を発話区間として、撮影対象とされる個人の口を含む口画像から発話区間を判別する場合について説明する。 Hereinafter, an example of an embodiment of the present invention will be described in detail with reference to the drawings. In the present embodiment, as an example, the speech segment is determined from the mouth image of the person to be photographed, with the target segment as the speech segment and the target segment as the speech segment. .

[第１の実施形態]
図１は、第１の実施形態に係る目的区間判別装置１０の機能的な構成の一例を示すブロック図である。
図１に示すように、第１の実施形態に係る目的区間判別装置１０は、入力部１２、特徴点検出部１４、画像特徴量抽出部１６、個人内変動パラメータ推定部１８、目的区間判別部２０、出力部２２、及び記憶部２４を備える。 First Embodiment
FIG. 1 is a block diagram showing an example of a functional configuration of a target segment determination device 10 according to the first embodiment.
As shown in FIG. 1, the target segment discrimination device 10 according to the first embodiment includes an input unit 12, a feature point detector 14, an image feature quantity extractor 16, an intra-individual variation parameter estimator 18, a target segment discriminator 20, an output unit 22, and a storage unit 24.

入力部１２は、カメラ３０により撮影された、個人の口を含む口画像の入力を受け付ける。なお、口画像は、少なくとも１つあればよい。 The input unit 12 receives an input of a mouth image including an individual's mouth, which is photographed by the camera 30. In addition, there may be at least one mouth image.

特徴点検出部１４は、入力部１２により入力された口画像から、例えば、パターンマッチング法等の周知の領域検出技術を用いて、口を表す口領域を検出する。そして、特徴点検出部１４は、検出した口領域から、例えば、周知のパターンマッチング法等を用いて、予め定められた判別モデル（後述）における複数の特徴点に対応する複数の特徴点を検出する。なお、口画像における複数の特徴点の場所及び数は、特に限定されるものではないが、判別モデルにおける複数の特徴点の場所及び数と一致させておく。 The feature point detection unit 14 detects a mouth area representing a mouth from the mouth image input by the input unit 12 using, for example, a known area detection technique such as a pattern matching method. Then, the feature point detection unit 14 detects a plurality of feature points corresponding to a plurality of feature points in a predetermined discrimination model (described later) using a known pattern matching method or the like from the detected mouth area, for example. Do. The locations and the numbers of the plurality of feature points in the mouth image are not particularly limited, but are made to coincide with the locations and the numbers of the plurality of feature points in the discrimination model.

画像特徴量抽出部１６は、特徴点検出部１４により検出された複数の特徴点から、口画像の画像特徴量を抽出する。なお、口画像の画像特徴量としては、一例として、複数の特徴点の各々の座標を（Ｘ，Ｙ）、特徴点の数をＮとした場合、以下の２Ｎ次元の特徴ベクトルとして求められる。但し、Ｔは転置を表す。 The image feature quantity extraction unit 16 extracts the image feature quantity of the mouth image from the plurality of feature points detected by the feature point detection unit 14. In addition, as an image feature quantity of the mouth image, as an example, when the coordinates of each of a plurality of feature points are (X, Y) and the number of feature points is N, it can be obtained as the following 2N dimensional feature vector. However, T represents transposition.

[Ｘ_１Ｙ_１・・・Ｘ_ＮＹ_Ｎ]^Ｔ [X ₁ Y ₁ ... X _N Y _N ] ^T

個人内変動パラメータ推定部１８は、画像特徴量抽出部１６により抽出された画像特徴量から、予め求められた個人差基底及び個人内変動基底に基づいて、個人内変動基底に関する個人内変動パラメータを推定する。 The intra-individual variation parameter estimation unit 18 determines the intra-individual variation parameter relating to the intra-individual variation basis based on the individual difference basis and the intra-individual variation basis obtained in advance from the image feature quantity extracted by the image feature quantity extraction unit 16. presume.

ここで、個人差基底とは、画像特徴量の個人差成分を表現するための複数の基底ベクトルからなる行列として示される。個人差基底は、後述するモデル学習装置４０により、複数の被験者の口が基準の状態を表す複数の学習用画像の画像特徴量に基づいて予め求められる。なお、口の基準の状態とは、例えば、口が閉じた状態を示す。 Here, the individual difference basis is indicated as a matrix composed of a plurality of basis vectors for expressing individual difference components of the image feature amount. The individual difference base is determined in advance by the model learning device 40 described later based on the image feature amounts of the plurality of learning images in which the mouths of the plurality of subjects represent the reference state. In addition, the state of the reference | standard of a mouth shows the state which the mouth closed, for example.

一方、個人内変動基底とは、画像特徴量の個人内の変動成分を表現するための複数の基底ベクトルからなる行列として示される。個人内変動基底は、モデル学習装置４０により、複数の被験者の口の状態を表す複数の学習用画像の画像特徴量から、個人差基底を用いて個人差成分を除去した画像特徴量に基づいて予め求められる。この場合、口の状態には、口が閉じた状態、口が開いた状態、口が半開きの状態等の様々な状態が含まれる。 On the other hand, the intra-individual variation base is shown as a matrix consisting of a plurality of basis vectors for expressing the intra-individual variation component of the image feature amount. The intra-individual variation base is based on the image feature amount obtained by removing the individual difference component from the image feature amounts of the plurality of learning images representing the states of the plurality of test subjects by the model learning device 40. It is obtained in advance. In this case, the state of the mouth includes various states such as a closed mouth, an open mouth, and a half-opened mouth.

これらの個人差基底及び個人内変動基底は、記憶部２４に予め記憶されている。また、記憶部２４には、画像特徴量の平均値を表す特徴量平均が記憶されている。この特徴量平均は、モデル学習装置４０により、複数の学習用画像から得られる複数の特徴ベクトルを平均して得られる、予め求められたベクトルである。 These inter-individual difference bases and intra-individual variation bases are stored in advance in the storage unit 24. Further, the storage unit 24 stores a feature amount average representing an average value of the image feature amounts. The feature amount average is a vector obtained in advance by averaging the plurality of feature vectors obtained from the plurality of learning images by the model learning device 40.

ここで、個人差基底を表す行列をＰ_ｂ、個人内変動基底を表す行列をＰ_ｗ、特徴量平均を表す複数の特徴ベクトルの平均値を Here, a matrix representing an individual difference basis is P _b , a matrix representing an intra-individual variation base is P _w , and an average value of a plurality of feature vectors representing feature amount averages

とする。そして、入力を受け付けた口画像の特徴ベクトルをｘ、個人差基底Ｐ_ｂに関する個人差パラメータをｐ_ｂ、個人内変動基底Ｐ_ｗに関する個人内変動パラメータをｐ_ｗとした場合、以下の関係が成立する。但し、Ｐ_ｂｐ_ｂは、個人差成分を表し、Ｐ_ｗｐ_ｗは、個人内の変動成分を表す。個人差パラメータｐ_ｂは、式（２）で表される。個人内変動パラメータｐ_ｗは、式（４）で表される。なお、個人内変動パラメータｐ_ｗを求める際に、個人差パラメータｐ_ｂを必ずしも求める必要はない。 I assume. Then, when the feature vector of the image of the mouth image receiving the input is x, the individual difference parameter on the individual difference basis P _b is p _b , and the intra-individual variation parameter on the intra-individual variation base P _w is p _w , the following relationship is established Do. However, P _b p _b represents an individual difference component, and P _w p _w represents a fluctuation component in an individual. The individual difference parameter p _b is expressed by equation (2). The intra-individual variation parameter p _w is expressed by equation (4). In addition, when obtaining the in-person variation parameter p _w , it is not necessary to necessarily obtain the individual difference parameter p _b .

なお、Ｐ_ｂを直交基底とした場合、上記式（２）は以下のようにして導出される。 When P _b is an orthogonal basis, the equation (2) is derived as follows.

ここで、上記式（４）により求まる個人内変動パラメータｐ_ｗは、個人内変動基底Ｐ_ｗ及び個人差基底Ｐ_ｂを用いて表されるが、後述するように、個人内変動基底Ｐ_ｗは、学習用画像の画像特徴量から個人差成分を除去したものから求められている。従って、個人内変動パラメータｐ_ｗは、入力を受け付けた口画像の特徴ベクトルｘから、個人差の影響を除去した特徴ベクトル（個人内変動成分）に関するパラメータとされる。 Here, the intra-individual variation parameter p _w determined by the equation (4) is expressed using the intra-individual variation base P _w and the individual difference basis P _b , but as described later, the intra-individual variation base P _w is , It is obtained from an image feature amount of a learning image from which an individual difference component is removed. Therefore, the intra-individual variation parameter p _w is a parameter related to the feature vector (in-individual variation component) from which the influence of the individual difference is removed from the feature vector x of the mouth image having received the input.

次に、目的区間判別部２０は、個人内変動パラメータ推定部１８により推定された個人内変動パラメータに基づいて、入力を受け付けた口画像に含まれる口が発話区間であるか否かを判別する。具体的には、記憶部２４に予め記憶されている判別モデルを用いて、判別される。この判別モデルは、モデル学習装置４０により、複数の学習用画像を用いて予め学習されたモデルであり、個人内変動パラメータに基づいて発話区間であるか否かを判別するためのモデルである。 Next, on the basis of the intra-personal variation parameter estimated by the intra-individual variation parameter estimation unit 18, the target interval determination unit 20 determines whether or not the mouth included in the mouth image for which the input has been received is a speech interval. . Specifically, the determination is performed using a determination model stored in advance in the storage unit 24. The discriminant model is a model learned in advance using a plurality of learning images by the model learning device 40, and is a model for discriminating whether or not it is a speech section based on the in-person variation parameter.

出力部２２は、目的区間判別部２０による判別結果を表す信号を出力する。この判別結果を表す信号としては、発話区間と非発話区間とを識別可能な信号であればよく、例えば、「０」及び「１」のいずれかを表す信号としてもよい。 The output unit 22 outputs a signal indicating the determination result by the target segment determination unit 20. The signal representing the determination result may be any signal as long as it can distinguish between the speech period and the non-speech period, and may be, for example, a signal representing either “0” or “1”.

図２は、第１の実施形態に係る目的区間判別装置１０として機能するコンピュータの構成の一例を示すブロック図である。
図２に示すように、本実施形態に係る目的区間判別装置１０は、ＣＰＵ(Central Processing Unit)１０Ａ及び内部メモリ１０Ｂを含む汎用的なコンピュータとして構成される。 FIG. 2 is a block diagram showing an example of the configuration of a computer that functions as the target segment determination device 10 according to the first embodiment.
As shown in FIG. 2, the target segment determination device 10 according to the present embodiment is configured as a general-purpose computer including a central processing unit (CPU) 10A and an internal memory 10B.

内部メモリ１０Ｂには、本実施形態に係る目的区間判別処理プログラムが格納されている。この目的区間判別処理プログラムは、例えば、目的区間判別装置１０に予めインストールされていてもよい。また、目的区間判別処理プログラムは、不揮発性の記憶媒体に記憶して、又はネットワークを介して配布し、目的区間判別装置１０に適宜インストールすることで実現してもよい。なお、不揮発性の記憶媒体の例としては、ＣＤ-ＲＯＭ(Compact Disc Read Only Memory)、光磁気ディスク、ＨＤＤ(Hard Disk Drive)、ＤＶＤ-ＲＯＭ(Digital Versatile Disc Read Only Memory)、フラッシュメモリ、メモリカード等が想定される。 In the internal memory 10B, a target segment determination processing program according to the present embodiment is stored. The target segment determination processing program may be installed in advance in the target segment determination device 10, for example. In addition, the target segment determination processing program may be realized by storing the program in a non-volatile storage medium or distributing it via a network and appropriately installing it in the target segment determination device 10. Examples of non-volatile storage media include compact disc read only memory (CD-ROM), magneto-optical disc, hard disk drive (HDD), digital versatile disc read only memory (DVD-ROM), flash memory, and memory. A card etc. are assumed.

ＣＰＵ１０Ａは、図１に示す入力部１２、特徴点検出部１４、画像特徴量抽出部１６、個人内変動パラメータ推定部１８、目的区間判別部２０、及び出力部２２として機能する。ＣＰＵ１０Ａは、内部メモリ１０Ｂから目的区間判別処理プログラムを読み出して実行することで、これら各部として機能する。また、ＣＰＵ１０Ａは、カメラ３０、外部システム３２、及び外部記憶装置３４の各々と接続されている。外部システム３２は、例えば、対話システム等であり、ＣＰＵ１０Ａ（出力部２２）からの判別結果を表す信号を受信して各種の処理を行う。外部記憶装置３４には、本実施形態に係る目的区間判別処理に用いる各種のデータが記憶される。 The CPU 10A functions as an input unit 12, a feature point detection unit 14, an image feature quantity extraction unit 16, an in-person variation parameter estimation unit 18, a target section determination unit 20, and an output unit 22 shown in FIG. The CPU 10A functions as these units by reading out and executing a target segment determination processing program from the internal memory 10B. Further, the CPU 10A is connected to each of the camera 30, the external system 32, and the external storage device 34. The external system 32 is, for example, an interactive system or the like, receives a signal representing the determination result from the CPU 10A (the output unit 22), and performs various processes. The external storage device 34 stores various data used in the target segment determination process according to the present embodiment.

次に、判別モデルを学習するためのモデル学習装置４０について説明する。 Next, a model learning device 40 for learning a discriminant model will be described.

図３は、第１の実施形態に係るモデル学習装置４０の機能的な構成の一例を示すブロック図である。
図３に示すように、本実施形態に係るモデル学習装置４０は、入力部４２、特徴点検出部４４、画像特徴量抽出部４６、個人差基底算出部４８、個人内変動基底算出部５０、及びモデル生成部５２を備える。 FIG. 3 is a block diagram showing an example of a functional configuration of the model learning device 40 according to the first embodiment.
As shown in FIG. 3, the model learning device 40 according to the present embodiment includes an input unit 42, a feature point detection unit 44, an image feature quantity extraction unit 46, an individual difference basis calculation unit 48, an intra-individual variation base calculation unit 50, And a model generation unit 52.

図４は、第１の実施形態に係るモデル学習装置４０として機能するコンピュータの構成の一例を示すブロック図である。
図４に示すように、本実施形態に係るモデル学習装置４０は、ＣＰＵ４０Ａ及び内部メモリ４０Ｂを含む汎用的なコンピュータとして構成される。 FIG. 4 is a block diagram showing an example of the configuration of a computer that functions as the model learning device 40 according to the first embodiment.
As shown in FIG. 4, the model learning device 40 according to the present embodiment is configured as a general-purpose computer including a CPU 40A and an internal memory 40B.

内部メモリ４０Ｂには、本実施形態に係るモデル学習処理プログラムが格納されている。このモデル学習処理プログラムは、例えば、モデル学習装置４０に予めインストールされていてもよい。また、モデル学習処理プログラムは、不揮発性の記憶媒体に記憶して、又はネットワークを介して配布し、モデル学習装置４０に適宜インストールすることで実現してもよい。なお、不揮発性の記憶媒体の例としては、上記と同様に、ＣＤ-ＲＯＭ、光磁気ディスク、ＨＤＤ、ＤＶＤ-ＲＯＭ、フラッシュメモリ、メモリカード等が想定される。 The internal memory 40B stores a model learning processing program according to the present embodiment. The model learning processing program may be installed in advance in the model learning device 40, for example. In addition, the model learning processing program may be realized by storing the program in a non-volatile storage medium or distributing it via a network and installing the program in the model learning device 40 as appropriate. As an example of the non-volatile storage medium, a CD-ROM, a magneto-optical disk, an HDD, a DVD-ROM, a flash memory, a memory card, etc. are assumed as described above.

ＣＰＵ４０Ａは、図３に示す入力部４２、特徴点検出部４４、画像特徴量抽出部４６、個人差基底算出部４８、個人内変動基底算出部５０、及びモデル生成部５２として機能する。ＣＰＵ４０Ａは、内部メモリ１０Ｂからモデル学習処理プログラムを読み出して実行することで、これら各部として機能する。また、ＣＰＵ４０Ａは、外部記憶装置３４と接続されている。 The CPU 40A functions as an input unit 42, a feature point detection unit 44, an image feature quantity extraction unit 46, an individual difference basis calculation unit 48, an intra-individual variation base calculation unit 50, and a model generation unit 52 shown in FIG. The CPU 40A functions as these units by reading out and executing a model learning processing program from the internal memory 10B. The CPU 40A is connected to the external storage device 34.

図５（Ａ）は、本実施形態に係る口を閉じた状態の学習用画像の一例を示す図である。図５（Ｂ）は、本実施形態に係る口を閉じた状態及び口を開いた状態を含む学習用画像の一例を示す図である。 FIG. 5A is a view showing an example of a learning image in a state in which the mouth is closed according to the present embodiment. FIG. 5B is a view showing an example of a learning image including the state in which the mouth is closed and the state in which the mouth is open according to the present embodiment.

図５（Ａ）及び図５（Ｂ）に示す学習用画像の各々は、左右の口端の中点を中心として、左右の口端の座標値の差をｗとした場合に、一例として、（ｗ＋１０）×（ｗ＋１０）の正方形状を切り出した画像である。 Each of the learning images shown in FIGS. 5A and 5B is, by way of example, when the difference between the coordinate values of the left and right mouth ends is w, centering on the midpoint between the left and right mouth ends. It is the image which cut out the square shape of (w + 10) x (w + 10).

外部記憶装置３４には、図５（Ａ）及び図５（Ｂ）に示すような、被験者の口を含む口画像が、学習用画像として記憶されている。図５（Ａ）に示す口を閉じた状態のみの学習用画像（以下、口閉じ画像という。）は、個人差基底の算出に用いられ、図５（Ｂ）に示す口を閉じた状態及び口を開いた状態を含む画像（以下、全体画像という。）は、個人内変動基底の算出に用いられる。 In the external storage device 34, a mouth image including the subject's mouth as shown in FIGS. 5 (A) and 5 (B) is stored as a learning image. The learning image (hereinafter referred to as a mouth closed image) in the state where the mouth is closed shown in FIG. 5 (A) is used for calculation of the individual difference basis, and the state where the mouth shown in FIG. 5 (B) is closed and An image including an open mouth (hereinafter referred to as a whole image) is used to calculate an intra-individual variation base.

まず、複数の学習用画像から個人差基底を求める方法について説明する。この場合、図５（Ａ）に示す複数の口閉じ画像が用いられる。 First, a method of obtaining an individual difference basis from a plurality of learning images will be described. In this case, a plurality of closed-mouth images shown in FIG. 5 (A) are used.

入力部４２は、外部記憶装置３４から複数（例えばＭ個）の口閉じ画像の入力を受け付ける。 The input unit 42 receives an input of a plurality of (for example, M) mouth closed images from the external storage device 34.

図６は、本実施形態に係る学習用画像から得られる複数の特徴点の一例を示す図である。
図６に示す例では、口を開いた状態の学習用画像における複数の特徴点を示すが、口を閉じた状態の学習用画像でも、口を開いた状態と同一の場所及び数の複数の特徴点が用いられる。 FIG. 6 is a view showing an example of a plurality of feature points obtained from a learning image according to the present embodiment.
Although the example shown in FIG. 6 shows a plurality of feature points in the learning image in the open state, even in the learning image in the closed state, a plurality of the same places and numbers as the open state are used. Feature points are used.

特徴点検出部４４は、入力部４２により入力された口閉じ画像から、口領域を検出し、検出した口領域から、一例として図６に示すように、複数の特徴点の各々の座標（ｘ，ｙ）を検出する。 The feature point detection unit 44 detects the mouth area from the mouth closed image input by the input unit 42, and from the detected mouth area, as shown in FIG. 6 as an example, the coordinates (x of the plurality of feature points , Y).

画像特徴量抽出部４６は、特徴点検出部４４により検出された複数の特徴点から、口閉じ画像の画像特徴量を抽出する。口閉じ画像の画像特徴量としては、一例として、サンプリングした特徴点の数をＮ個とした場合、以下の２Ｎ次元の特徴ベクトルとして求められる。 The image feature quantity extraction unit 46 extracts the image feature quantity of the mouth closed image from the plurality of feature points detected by the feature point detection unit 44. As an example, when the number of sampled feature points is N, the image feature quantity of the mouth closed image is obtained as the following 2N-dimensional feature vector.

[ｘ_１ｙ_１・・・ｘ_Ｎｙ_Ｎ]^Ｔ [x ₁ y ₁ ... x _N y _N ] ^T

本実施形態に係る個人差基底算出部４８は、上記の２Ｎ次元の特徴ベクトルがＭ個の口閉じ画像の各々について得られるため、Ｍ個の口閉じ画像についての特徴ベクトルを表す、２Ｎ×Ｍの行例（以下、行列Ａ_１という。）を求める。そして、求めた行列Ａ_１について、行ごとに平均値を求める。これにより、Ｍ個の口閉じ画像についての特徴量平均が得られる。 The individual difference basis calculation unit 48 according to the present embodiment is 2N × M representing the feature vectors for the M closed images, because the 2N-dimensional feature vectors described above are obtained for each of the M closed images. An example of the line (hereinafter referred to as matrix A ₁ ) is determined. Then, the matrix A ₁ obtained, the average value for each row. As a result, the feature amount average for the M closed mouth images is obtained.

個人差基底算出部４８は、上記で求めた行列Ａ_１に対して主成分分析を行い、主成分分析で得られた固有ベクトルを大きさが１になるように正規化してから固有値の大きい順に並べ、固有値の大きい方からｎ_１個の固有ベクトルを取り出して並べた２Ｎ×ｎ_１の行列を生成する。この２Ｎ×ｎ_１の行列を個人差基底（すなわち、個人差基底Ｐ_ｂ）とする。なお、ｎ_１の決め方としては、固有値の寄与率が一定割合（例えば８０％）以上になるように選択する方法や、経験的に個数を決める方法等がある。 The individual difference basis calculation unit 48 performs principal component analysis on the matrix A ₁ obtained above, normalizes the eigenvectors obtained by the principal component analysis so that the magnitude is 1 and arranges them in descending order of eigenvalues Then, n ₁ eigenvectors are extracted from the larger one of the eigenvalues and arranged to generate a 2N × n ₁ matrix. This 2N × n ₁ matrix is used as an individual difference basis (ie, an individual difference basis P _b ). Note that as a method of determining n ₁ , there is a method of selecting so that the contribution rate of the eigen value becomes a fixed ratio (for example, 80%) or more, a method of empirically determining the number, and the like.

次に、複数の学習用画像から個人内変動基底を求める方法について説明する。この場合、図５（Ｂ）に示す複数の全体画像が用いられる。 Next, a method of obtaining an intra-individual variation base from a plurality of learning images will be described. In this case, a plurality of whole images shown in FIG. 5 (B) are used.

本実施形態に係る個人内変動基底算出部５０は、個人差基底の場合と同様に、上記の２Ｎ次元の特徴ベクトルがＭ個の全体画像の各々について得られるため、Ｍ個の全体画像についての特徴ベクトルを表す、２Ｎ×Ｍの行列（以下、行列Ａ_２という。）を求める。そして、求めた行列Ａ_２について、行ごとに平均値を求める。これにより、Ｍ個の全体画像についての特徴量平均が得られる。本実施形態では、全体画像についての特徴量平均、及び、口閉じ画像についての特徴量平均のいずれを用いてもよいが、様々な口の状態に対応した全体画像についての特徴量平均を用いることが望ましい。 As in the case of the individual difference basis, since the 2N-dimensional feature vector described above is obtained for each of the M entire images, the intra-individual variation base calculation unit 50 according to the present embodiment can obtain M total images. A 2N × M matrix (hereinafter referred to as matrix A ₂ ) representing the feature vector is determined. Then, the matrix A ₂ obtained, the average value for each row. Thereby, the feature amount average for M whole images is obtained. In the present embodiment, either the feature amount average for the entire image or the feature amount average for the mouth closed image may be used, but using the feature amount average for the entire image corresponding to various states of the mouth Is desirable.

個人内変動基底算出部５０は、Ｍ個の全体画像についての画像特徴量（行列Ａ_２）と、個人差基底算出部４８により算出された個人差基底（個人差基底Ｐ_ｂ）とに基づいて、Ｍ個の全体画像についての画像特徴量（行列Ａ_２）から、個人差基底を用いて個人差成分を除去する。一例として、以下の式５が適用される。 The intra-individual variation base calculation unit 50 is based on the image feature amount (matrix A ₂ ) for the M whole images and the individual difference basis (individual difference basis P _b ) calculated by the individual difference basis calculation unit 48. The individual difference component is removed from the image feature quantity (matrix A ₂ ) for M total images using an individual difference basis. As an example, the following equation 5 is applied.

Ａ_３＝Ａ_２−(Ａ_２×Ｐ_ｂ×Ｐ_ｂ ^Ｔ）（５） _{_{_{A 3 = A 2 - (A}}} 2 × P b × P b T) (5)

上記式５に従って算出された行列Ａ_３は、行列Ａ_２と同じ次元の２Ｎ×Ｍ行列で、行列Ａ_２から個人差成分が除去され、個人差基底の影響が除かれたものになる。なお、この例では、個人差成分として、Ａ_２×Ｐ_ｂ×Ｐ_ｂ ^Ｔ、が除去されている。 Matrix A _3, which is calculated according to the equation 5, the same dimension 2N × M matrix and the matrix A _2, individual differences component is removed from the matrix A _2, becomes what influence of individual differences basis is removed. In this example, A ₂ × P _b × P _b ^T is removed as an individual difference component.

そして、個人内変動基底算出部５０は、個人差成分を除去した画像特徴量（行列Ａ_３）に基づいて、個人内変動基底を算出する。つまり、個人内変動基底算出部５０は、行列Ａ_３に対して主成分分析を行い、主成分分析で得られた固有ベクトルを大きさが１になるように正規化してから固有値の大きい順に並べ、固有値の大きい方からｎ_２個の固有ベクトルを取り出して並べた２Ｎ×ｎ_２の行列を生成する。この２Ｎ×ｎ_２の行列を個人内変動基底（すなわち、個人内変動基底Ｐ_ｗ）とする。なお、ｎ_２の決め方としては、ｎ_１と同様であり、固有値の寄与率が一定割合（例えば８０％）以上になるように選択する方法や、経験的に個数を決める方法等がある。 Then, the intra-individual variation base calculation unit 50 calculates an intra-individual variation base based on the image feature quantity (matrix A ₃ ) from which the individual difference component has been removed. That is, the intra-individual variation base calculation unit 50 performs principal component analysis on the matrix A ₃ , normalizes the eigenvectors obtained by the principal component analysis so that the magnitude is 1 and arranges the eigenvectors in descending order of eigenvalues. generating a matrix of 2N × n ₂ from the larger eigenvalue lined taken out n ₂ eigenvectors. Let this 2N × n ₂ matrix be an intra-individual variation base (ie, an intra-individual variation base P _w ). The method of determining n _{2 is} the same as that of n _1, and there is a method of selecting so that the contribution rate of the eigen value is a fixed ratio (for example, 80%) or more, a method of empirically determining the number, and the like.

モデル生成部５２は、複数のモデル学習用画像の各々から抽出された画像特徴量と、個人差基底算出部４８により算出された個人差基底と、個人内変動基底算出部５０により算出された個人内変動基底とに基づいて、モデル学習用画像の各々について、個人内変動基底に関する個人内変動パラメータを推定する。なお、モデル学習用画像とは、上述の学習用画像と同様に、被験者の口を含む口画像であり、発話区間であるか否かを示す情報として、例えばフラグが付与されている。このモデル学習用画像には、発話区間であるか否かを示す情報が付与されていれば、個人内変動基底の算出処理に用いた全体画像（図５（Ｂ）を参照）を利用してもよいし、当該全体画像とは異なる画像を用いてもよい。また、個人内変動パラメータは、上述の式（４）を用いて算出される。 The model generation unit 52 calculates the image feature quantities extracted from each of the plurality of model learning images, the individual difference basis calculated by the individual difference basis calculation unit 48, and the individual calculated by the intra-individual variation base calculation unit 50. The intra-individual variation parameter relating to the intra-individual variation basis is estimated for each of the model learning images based on the intra-variation basis. The model learning image is a mouth image including the subject's mouth, as in the above-described learning image, and a flag is given as information indicating whether or not it is a speech section. If information indicating whether or not it is a speech section is attached to the image for model learning, using the entire image (see FIG. 5B) used in the calculation process of the variation base in the individual Alternatively, an image different from the whole image may be used. Further, the intra-individual variation parameter is calculated using the above-mentioned equation (4).

そして、モデル生成部５２は、モデル学習用画像の各々について推定された個人内変動パラメータと、モデル学習用画像の各々に付与された、発話区間であるか否かを示すフラグとに基づいて、個人内変動パラメータに基づいて発話区間であるか否かを判別するためのモデルを学習する。これにより、発話区間を判別するための判別モデルが生成される。 Then, based on the intra-individual variation parameter estimated for each of the model learning images, and the flag added to each of the model learning images, the model generation unit 52 indicates whether it is an utterance section or not. A model is learned to determine whether it is a speech section based on the intra-individual variation parameter. Thereby, a discrimination model for discriminating the speech segment is generated.

ここで、複数の特徴点の取り方によっては、口を閉じた状態と口を開けた状態とで画素値の変化が大きい特徴点と、口を閉じた状態と口を開けた状態とで画素値の変化が小さい特徴点と、が存在する場合がある。この場合、モデル生成部５２により、判別モデルとして、画素値の変化が大きい特徴点には、変化の度合いをより強調するための重み係数が付与され、画素値の変化が小さい特徴点には、変化の度合いをより低減するための重み係数が付与される。 Here, depending on how to take a plurality of feature points, a feature point in which the change in pixel value is large between the closed state and the open state, and the pixel in the closed state and the open state There may be a feature point with a small change in value. In this case, a weighting factor for emphasizing the degree of change is given by the model generation unit 52 to a feature point with a large change in pixel value as a discrimination model, and a feature point with a small change in pixel value is added A weighting factor is provided to further reduce the degree of change.

本実施形態によれば、１つの入力画像から発話区間であるか否かを判別できるため、連続した複数の画像を用いる場合と比べて、処理時間を短縮することができる。また、入力画像から得られる画像特徴量から個人差の影響が除去されているため、発話区間と非発話区間とを精度良く判別することができる。 According to this embodiment, since it can be determined whether or not it is a speech section from one input image, the processing time can be shortened as compared with the case where a plurality of continuous images are used. Further, since the influence of the individual difference is removed from the image feature amount obtained from the input image, it is possible to accurately discriminate between the speech section and the non-speech section.

次に、図７を参照して、第１の実施形態に係る目的区間判別装置１０の作用を説明する。なお、図７は、第１の実施形態に係る目的区間判別処理プログラムの処理の流れの一例を示すフローチャートである。 Next, with reference to FIG. 7, the operation of the target segment determination device 10 according to the first embodiment will be described. FIG. 7 is a flowchart showing an example of the process flow of the target segment determination processing program according to the first embodiment.

まず、図７のステップ１００では、入力部１２が、対象とされる個人の口をカメラ３０で撮影した口画像の入力を受け付ける。 First, in step 100 of FIG. 7, the input unit 12 receives an input of a mouth image obtained by photographing the mouth of a target individual with the camera 30.

ステップ１０２では、特徴点検出部１４が、入力部１２により入力を受け付けた口画像から、口領域を検出し、検出した口領域から複数の特徴点を検出する。 In step 102, the feature point detection unit 14 detects a mouth area from the mouth image for which the input unit 12 receives an input, and detects a plurality of feature points from the detected mouth area.

ステップ１０４では、画像特徴量抽出部１６が、特徴点検出部１４により検出された複数の特徴点から、画像特徴量として、例えば、２Ｎ次元の特徴ベクトルを抽出する。 In step 104, the image feature quantity extraction unit 16 extracts, for example, a 2N-dimensional feature vector as an image feature quantity from the plurality of feature points detected by the feature point detection unit 14.

ステップ１０６では、個人内変動パラメータ推定部１８が、画像特徴量抽出部１６により抽出された画像特徴量、並びに、記憶部２４に予め記憶されている個人差基底及び個人内変動基底に基づいて、上記式（４）に従って、個人内変動基底に関する個人内変動パラメータを推定する。 In step 106, the intra-individual variation parameter estimation unit 18 determines, based on the image feature quantity extracted by the image feature quantity extraction unit 16 and the individual difference basis and the intra-individual variation basis stored in advance in the storage unit 24. The intra-individual variation parameter relating to the intra-individual variation base is estimated according to the above equation (4).

ステップ１０８では、目的区間判別部２０が、個人内変動パラメータ推定部１８により推定された個人内変動パラメータに基づいて、入力を受け付けた口画像に含まれる口が発話区間であるか否かを、記憶部２４に予め記憶されている判別モデルを用いて判別する。発話区間と判別された場合（肯定判定の場合）、ステップ１１０に移行し、非発話区間と判別された場合（否定判定の場合）、ステップ１１２に移行する。 In step 108, based on the intra-personal variation parameter estimated by the intra-individual variation parameter estimation unit 18, the target interval determination unit 20 determines whether the mouth included in the mouth image for which the input has been received is the utterance interval or not. Discrimination is performed using a discriminant model stored in advance in the storage unit 24. If it is determined that the section is in the utterance period (in the case of a positive determination), the process proceeds to step 110. If it is determined that the section is not in the utterance period (in the case of negative determination), the process proceeds to step 112.

ステップ１１０では、出力部２２が、目的区間判別部２０による判別結果を表す信号として、発話区間を表す信号を出力し、一連の目的区間判別処理プログラムの処理を終了する。 In step 110, the output unit 22 outputs a signal representing an utterance period as a signal representing the determination result by the target period determination unit 20, and ends the processing of the series of target period determination processing programs.

一方、ステップ１１２では、出力部２２が、目的区間判別部２０による判別結果を表す信号として、非発話区間を表す信号を出力し、一連の目的区間判別処理プログラムの処理を終了する。 On the other hand, in step 112, the output unit 22 outputs a signal representing a non-speech section as a signal representing the determination result by the target section determination unit 20, and ends the processing of the series of target section determination processing programs.

次に、図８を参照して、第１の実施形態に係るモデル学習装置４０の作用を説明する。なお、図８は、第１の実施形態に係るモデル学習処理プログラムの処理の流れの一例を示すフローチャートである。 Next, with reference to FIG. 8, an operation of the model learning device 40 according to the first embodiment will be described. FIG. 8 is a flowchart illustrating an example of the process flow of the model learning processing program according to the first embodiment.

まず、図８のステップ２００では、入力部４２が、外部記憶装置３４からＭ個の学習用画像（口閉じ画像）の入力を順番に受け付ける。なお、口閉じ画像とは、例えば、図５（Ａ）に示される口画像である。 First, in step 200 of FIG. 8, the input unit 42 sequentially receives inputs of M learning images (closed-mouth images) from the external storage device 34. The mouth closed image is, for example, a mouth image shown in FIG.

ステップ２０２では、特徴点検出部４４が、入力部４２により入力を受け付けた口閉じ画像から、口領域を検出し、検出した口領域から複数の特徴点を検出する。 In step 202, the feature point detection unit 44 detects a mouth area from the mouth closed image whose input is received by the input unit 42, and detects a plurality of feature points from the detected mouth area.

ステップ２０４では、画像特徴量抽出部４６が、特徴点検出部４４により検出された複数の特徴点から、画像特徴量として、例えば、２Ｎ次元の特徴ベクトルを抽出する。 In step 204, the image feature quantity extraction unit 46 extracts, for example, a 2N-dimensional feature vector as an image feature quantity from the plurality of feature points detected by the feature point detection unit 44.

ステップ２０６では、画像特徴量抽出部４６が、Ｍ個の口閉じ画像の全てについて画像特徴量を抽出する処理が終了したか否かを判定する。Ｍ個の口閉じ画像の全てについて処理が終了したと判定した場合（肯定判定の場合）、ステップ２０８に移行する。一方、Ｍ個の口閉じ画像の全てについては処理が終了していないと判定した場合（否定判定の場合）、ステップ２００に戻り処理を繰り返す。 In step 206, the image feature quantity extraction unit 46 determines whether the process of extracting the image feature quantity has been completed for all of the M closed mouth images. If it is determined that the processing is completed for all the M mouth closed images (in the case of a positive determination), the process proceeds to step 208. On the other hand, when it is determined that the process is not completed for all the M mouth closed images (in the case of a negative determination), the process returns to step 200 and the process is repeated.

ステップ２０８では、個人差基底算出部４８が、画像特徴量抽出部４６により抽出されたＭ個の画像特徴量から行列Ａ_１を求め、求めた行列Ａ_１に対して主成分分析を行って、個人差基底Ｐ_ｂを算出し、ステップ２１０に移行する。 At step 208, individual differences base calculation unit 48, the calculated matrix A ₁ from M image feature extracted by the image feature amount extracting unit 46, performs principal component analysis on the matrix A ₁ obtained, The individual difference basis P _b is calculated, and the process proceeds to step 210.

次に、ステップ２１０では、入力部４２が、外部記憶装置３４からＭ個の学習用画像（全体画像）の入力を順番に受け付ける。なお、全体画像とは、例えば、図５（Ｂ）に示される口画像である。 Next, in step 210, the input unit 42 sequentially receives input of M learning images (whole images) from the external storage device 34. Note that the whole image is, for example, a mouth image shown in FIG. 5 (B).

ステップ２１２では、特徴点検出部４４が、入力部４２により入力を受け付けた全体画像から、口領域を検出し、検出した口領域から複数の特徴点を検出する。 In step 212, the feature point detection unit 44 detects a mouth area from the entire image received by the input unit 42 and detects a plurality of feature points from the detected mouth area.

ステップ２１４では、画像特徴量抽出部４６が、特徴点検出部４４により検出された複数の特徴点から、画像特徴量として、例えば、２Ｎ次元の特徴ベクトルを抽出する。 In step 214, the image feature quantity extraction unit 46 extracts, for example, a 2N-dimensional feature vector as an image feature quantity from the plurality of feature points detected by the feature point detection unit 44.

ステップ２１６では、画像特徴量抽出部４６が、Ｍ個の全体画像の全てについて画像特徴量を抽出する処理が終了したか否かを判定する。Ｍ個の全体画像の全てについて処理が終了したと判定した場合（肯定判定の場合）、ステップ２１８に移行する。一方、Ｍ個の全体画像の全てについては処理が終了していないと判定した場合（否定判定の場合）、ステップ２１０に戻り処理を繰り返す。 In step 216, the image feature quantity extraction unit 46 determines whether the process of extracting the image feature quantity has been completed for all of the M total images. If it is determined that the processing is completed for all of the M total images (in the case of a positive determination), the process moves to step 218. On the other hand, if it is determined that the process has not been completed for all of the M entire images (in the case of a negative determination), the process returns to step 210 to repeat the process.

ステップ２１８では、個人内変動基底算出部５０が、画像特徴量抽出部４６により抽出されたＭ個の画像特徴量から行列Ａ_２を求め、求めた行列Ａ_２から、上記式（５）に従って、個人差基底算出部４８により算出された個人差基底を用いて個人差成分を除去して行列Ａ_３を求める。そして、個人差成分を除去した行列Ａ_３に対して主成分分析を行って、個人内変動基底Ｐ_ｗを算出し、ステップ２２０に移行する。 In step 218, intraindividual variation base calculation unit 50 obtains the image feature amount extracting unit 46 matrix A ₂ from the image feature amount of the M extracted by, the matrix A ₂ obtained according to the above equation (5), Request matrix a ₃ by removing the individual differences component with individual differences basal calculated by individual differences base calculation unit 48. Then, by performing a principal component analysis on the matrix A ₃ to remove the individual differences component, calculates the intra-individual variation basal P _w, the process proceeds to step 220.

次に、ステップ２２０では、入力部４２が、外部記憶装置３４からＭ個のモデル学習用画像の入力を順番に受け付ける。 Next, in step 220, the input unit 42 sequentially receives inputs of M model learning images from the external storage device 34.

ステップ２２２では、特徴点検出部４４が、入力部４２により入力を受け付けたモデル学習用画像から、口領域を検出し、検出した口領域から複数の特徴点を検出する。 In step 222, the feature point detection unit 44 detects a mouth area from the model learning image which has received an input from the input unit 42, and detects a plurality of feature points from the detected mouth area.

ステップ２２４では、画像特徴量抽出部４６が、特徴点検出部４４により検出された複数の特徴点から、画像特徴量として、例えば、２Ｎ次元の特徴ベクトルを抽出する。なお、モデル学習用画像として、上述の個人内変動基底の算出処理に用いた全体画像を利用した場合には、ステップ２２０〜ステップ２２４の処理が省略される。 In step 224, the image feature quantity extraction unit 46 extracts, for example, a 2N-dimensional feature vector as an image feature quantity from the plurality of feature points detected by the feature point detection unit 44. In addition, when the whole image used for the calculation process of the above-mentioned intra-personal variation base is used as a model learning image, the process of step 220-step 224 is abbreviate | omitted.

ステップ２２６では、モデル生成部５２が、画像特徴量抽出部４６により抽出された画像特徴量と、個人差基底算出部４８により算出された個人差基底と、個人内変動基底算出部５０により算出された個人内変動基底とに基づいて、上記式（４）に従って、個人内変動パラメータを推定する。 In step 226, the model generation unit 52 calculates the image feature quantity extracted by the image feature quantity extraction unit 46, the individual difference basis calculated by the individual difference basis calculation unit 48, and the intra-personal variation basis calculation unit 50. The intra-individual variation parameter is estimated according to the above equation (4) based on the intra-individual variation base.

ステップ２２８では、モデル生成部５２が、Ｍ個のモデル学習用画像の全てについて個人内変動パラメータを推定する処理が終了したか否かを判定する。Ｍ個のモデル学習用画像の全てについて処理が終了したと判定した場合（肯定判定の場合）、ステップ２３０に移行する。一方、Ｍ個のモデル学習用画像の全てについては処理が終了していないと判定した場合（否定判定の場合）、ステップ２２０に戻り処理を繰り返す。 In step 228, the model generation unit 52 determines whether or not the process of estimating the intra-individual variation parameter has been completed for all of the M model learning images. If it is determined that the processing is completed for all of the M model learning images (in the case of an affirmative determination), the process proceeds to step 230. On the other hand, if it is determined that the process has not been completed for all of the M model learning images (in the case of a negative determination), the process returns to step 220 to repeat the process.

ステップ２３０では、モデル生成部５２が、Ｍ個のモデル学習用画像の各々について推定された個人内変動パラメータと、モデル学習用画像の各々に付与された、発話区間であるか否かを示すフラグとに基づいて、個人内変動パラメータに基づいて発話区間であるか否かを判別するためのモデルを学習する。これにより、発話区間を判別するための判別モデルが生成される。そして、ステップ２３０の処理の後、一連のモデル学習処理プログラムの処理を終了する。 In step 230, the model generation unit 52 generates an intra-individual variation parameter estimated for each of the M model learning images, and a flag indicating whether each of the model learning images is an utterance section or not. And learn a model for determining whether or not it is a speech section based on the intra-individual variation parameter. Thereby, a discrimination model for discriminating the speech segment is generated. Then, after the processing of step 230, the processing of the series of model learning processing programs is ended.

なお、上記の実施形態では、特徴点の座標値を用いて、口画像の特徴ベクトルを表したが、これに限定されるものではなく、口画像の各画素値からなる特徴ベクトルを用いてもよい。この場合には、図９（Ａ）〜（Ｄ）に示すような個人差基底が得られる。図９（Ａ）〜（Ｄ）は、本実施形態に係るモデル学習装置４０により得られる解像度毎の個人差基底の一例を示す図である。 In the above embodiment, although the feature vector of the mouth image is represented using the coordinate values of the feature points, the present invention is not limited to this, and a feature vector consisting of each pixel value of the mouth image may be used. Good. In this case, an individual difference basis as shown in FIGS. 9A to 9D is obtained. FIGS. 9A to 9D are diagrams showing an example of the individual difference basis for each resolution obtained by the model learning device 40 according to the present embodiment.

図９（Ａ）は、８×８の画素で表した場合であり、図９（Ｂ）は、１６×１６の画素で表した場合であり、図９（Ｃ）は、３２×３２の画素で表した場合であり、図９（Ｄ）は、６４×６４の画素で表した場合である。 FIG. 9A shows the case of 8 × 8 pixels, FIG. 9B shows the case of 16 × 16 pixels, and FIG. 9C shows 32 × 32 pixels. 9D is a case where it is represented by 64 × 64 pixels.

図１０（Ａ）〜（Ｄ）は、本実施形態に係るモデル学習装置４０により得られる解像度毎の個人内変動基底の一例を示す図である。 FIGS. 10A to 10D are diagrams showing an example of the intra-individual variation base for each resolution obtained by the model learning device 40 according to the present embodiment.

図１０（Ａ）は、８×８の画素で表した場合であり、図１０（Ｂ）は、１６×１６の画素で表した場合であり、図１０（Ｃ）は、３２×３２の画素で表した場合であり、図１０（Ｄ）は、６４×６４の画素で表した場合である。 FIG. 10 (A) shows the case of 8 × 8 pixels, FIG. 10 (B) shows the case of 16 × 16 pixels, and FIG. 10 (C) shows 32 × 32 pixels. FIG. 10D shows the case of 64 × 64 pixels.

[第２の実施形態]
図１１は、第２の実施形態に係る目的区間判別装置１１の機能的な構成の一例を示すブロック図である。
図１１に示すように、第２の実施形態に係る目的区間判別装置１１は、音声特徴量抽出部２６を備える点が第１の実施形態に係る目的区間判別装置１０と相違する。このため、同じ符号を付した構成要素については繰り返しの説明を省略する。 Second Embodiment
FIG. 11 is a block diagram showing an example of a functional configuration of the target segment determination device 11 according to the second embodiment.
As shown in FIG. 11, the target segment determination device 11 according to the second embodiment is different from the target segment determination device 10 according to the first embodiment in that a voice feature extraction unit 26 is provided. For this reason, repeated explanation is omitted about the component which attached the same numerals.

音声特徴量抽出部２６は、マイク３６と接続されている。音声特徴量抽出部２６は、マイク３６から入力される、対象とされる個人の音声から音声特徴量を抽出する。 The audio feature quantity extraction unit 26 is connected to the microphone 36. The voice feature quantity extraction unit 26 extracts a voice feature quantity from the voice of the targeted individual input from the microphone 36.

目的区間判別部２０は、個人内変動パラメータ推定部１８により推定された個人内変動パラメータと、音声特徴量抽出部２６により抽出された音声特徴量とに基づいて、口画像に含まれる個人の口が発話区間であるか否かを判別する。音声特徴量としては、一例として、ＭＳＬＳ(Mel Scale Logarithmic Spectrum：メルスケール対数スペクトル)が抽出される。このＭＳＬＳは、音声認識の特徴量としてスペクトル特徴量を用い、ＭＦＣＣ(Mel Frequency Cepstrum Coefficient：メル周波数ケプストラム係数)を逆離散コサイン変換することで得られる。この音声特徴量を用いた発話区間か否かの判別には、周知の技術が用いられる。この場合、音声特徴量により発話区間と判別され、かつ、個人内変動パラメータが判別モデルにより発話区間と判別された場合に、発話区間と判別される。 The target section discrimination unit 20 is based on the intra-individual variation parameter estimated by the intra-individual variation parameter estimation unit 18 and the voice feature value extracted by the voice feature value extraction unit 26, the individual's mouth included in the mouth image It is determined whether or not it is a speech section. As a speech feature quantity, MSLS (Mel Scale Logarithmic Spectrum: Mel-scale logarithmic spectrum) is extracted as an example. This MSLS is obtained by inverse discrete cosine transform of MFCC (Mel Frequency Cepstrum Coefficient: coefficient of mel frequency) using a spectral feature as a feature of speech recognition. A well-known technique is used to determine whether or not the speech section is a speech section using the voice feature amount. In this case, if it is determined that the speech section is a speech section according to the voice feature amount, and the intra-personal variation parameter is determined to be a speech section according to the discrimination model, it is determined to be a speech section.

また、単に音声の強度を用いてもよい。この場合、音声の強度が所定レベル以上であり、かつ、個人内変動パラメータが判別モデルにより発話区間と判別された場合に、発話区間と判別される。 Also, the voice intensity may be used simply. In this case, when the voice strength is equal to or higher than a predetermined level and the intra-personal variation parameter is determined to be an utterance period by the discrimination model, it is determined to be an utterance period.

本実施形態によれば、対象とされた個人の口を含む口画像の画像特徴量に加え、当該個人の音声の音声特徴量を用いることで、発話区間の判別をより高精度に行うことができる。 According to the present embodiment, by using the voice feature amount of the voice of the individual in addition to the image feature amount of the mouth image of the targeted individual, it is possible to more accurately determine the speech section. it can.

なお、上記の各実施形態では、顔の少なくとも一部として、口を対象とした場合について説明したが、目を対象としてもよい。目を対象とした場合、目的区間は、一例として、目を開いた状態とされる。また、顔の表情を対象としてもよい。顔の表情を対象とした場合、目的区間は、一例として、笑っている状態とされる。 In each of the above embodiments, the case where the mouth is targeted as at least a part of the face has been described, but eyes may be targeted. When the eye is targeted, the target section is, for example, in a state in which the eyes are open. In addition, facial expressions may be targeted. When targeting facial expressions, the target segment is, for example, in a smiling state.

以上、実施形態として目的区間判別装置及びモデル学習装置を例示して説明した。実施形態は、コンピュータを、目的区間判別装置、又は、モデル学習装置が備える各部として機能させるためのプログラムの形態としてもよい。実施形態は、このプログラムを記憶したコンピュータが読み取り可能な記憶媒体の形態としてもよい。 As described above, the target segment determination device and the model learning device have been described as the embodiments. The embodiment may be in the form of a program for causing a computer to function as a target section determination device or each unit included in a model learning device. The embodiment may be in the form of a computer readable storage medium storing the program.

その他、上記実施形態で説明した目的区間判別装置及びモデル学習装置の構成は、一例であり、主旨を逸脱しない範囲内において状況に応じて変更してもよい。 In addition, the configurations of the target segment determination device and the model learning device described in the above embodiment are an example, and may be changed according to the situation without departing from the scope of the present invention.

また、上記実施形態で説明したプログラムの処理の流れも、一例であり、主旨を逸脱しない範囲内において不要なステップを削除したり、新たなステップを追加したり、処理順序を入れ替えたりしてもよい。 Further, the flow of processing of the program described in the above embodiment is also an example, and unnecessary steps may be deleted, new steps may be added, or the processing order may be changed without departing from the scope of the present invention. Good.

また、上記実施形態では、プログラムを実行することにより、実施形態に係る処理がコンピュータを利用してソフトウェア構成により実現される場合について説明したが、これに限らない。実施形態は、例えば、ハードウェア構成や、ハードウェア構成とソフトウェア構成との組み合わせによって実現してもよい。 Further, in the above embodiment, the case where the processing according to the embodiment is realized by the software configuration using a computer by executing the program has been described, but the present invention is not limited thereto. The embodiment may be realized by, for example, a hardware configuration or a combination of the hardware configuration and the software configuration.

１０、１１目的区間判別装置
１０ＡＣＰＵ
１０Ｂ内部メモリ
１２入力部
１４特徴点検出部
１６画像特徴量抽出部
１８個人内変動パラメータ推定部
２０目的区間判別部
２２出力部
２４記憶部
２６音声特徴量抽出部
３０カメラ
３２外部システム
３４外部記憶装置
３６マイク
４０モデル学習装置
４０ＡＣＰＵ
４０Ｂ内部メモリ
４２入力部
４４特徴点検出部
４６画像特徴量抽出部
４８個人差基底算出部
５０個人内変動基底算出部
５２モデル生成部 10, 11 Target section discrimination device 10A CPU
10B Internal memory 12 Input unit 14 Feature point detection unit 16 Image feature quantity extraction unit 18 Intra-individual variation parameter estimation unit 20 Target segment determination unit 22 Output unit 24 Storage unit 26 Voice feature quantity extraction unit 30 Camera 32 External system 34 External storage device 36 microphone 40 model learning device 40A CPU
40B internal memory 42 input unit 44 feature point detection unit 46 image feature quantity extraction unit 48 individual difference base calculation unit 50 intra-individual variation base calculation unit 52 model generation unit

Claims

An image feature extraction unit for extracting an image feature from an image obtained by capturing at least a part of an individual's face;
Expressing the individual difference basis for expressing the individual difference component of the image feature and the variation component within the individual of the image feature, which are obtained in advance from the image feature extracted by the image feature extraction unit The intra-individual variation parameter estimation unit for estimating the intra-individual variation parameter relating to the intra-individual variation base based on the intra-individual variation base to
A target section discriminating section for judging whether or not at least a part of the face is a target section based on the in-personal variation parameter estimated by the in-personal variation parameter estimating section;
Target segment discrimination device equipped with

The individual difference basis is determined based on image feature amounts of a plurality of learning images representing the state of a plurality of individuals' references, which is obtained by capturing the same portion as at least a part of the face.
The intra-individual variation base uses the inter-individual difference component from image feature amounts of a plurality of learning images representing states of a plurality of individuals obtained by capturing the same part as at least a part of the face. The target segment discrimination device according to claim 1, which is obtained based on the removed image feature amount.

The target segment determination unit determines whether the target segment is the target segment by using a model for determining whether the target segment is the target segment based on the in-person variation parameter learned in advance. The object area discrimination device according to Item 1 or 2.

It further comprises an audio feature extraction unit for extracting audio features from the voice of the individual,
The target segment determination unit determines at least a part of the face based on the in-person variation parameter estimated by the in-person variation parameter estimation unit and the voice feature amount extracted by the voice feature amount extraction unit. The target segment discrimination device according to any one of claims 1 to 3, wherein it is determined whether or not the target segment is a target segment.

At least a part of the face of the individual is a mouth,
The target segment discrimination device according to any one of claims 1 to 4, wherein the target segment is a speech segment representing a state in which the mouth is open.

An image feature extracted from each of the learning images representing the reference state of the plurality of states among each of the learning images representing any of the plurality of states in which at least a part of the face of the individual is photographed An individual difference basis calculation unit that calculates an individual difference basis for expressing an individual difference component of the image feature amount based on an amount;
An image feature amount of the learning image based on an image feature amount extracted from each of the learning images representing any of the plurality of states and the individual difference basis calculated by the individual difference basis calculating unit The intra-individual variation to calculate the intra-individual variation base for representing the intra-individual variation component based on the image feature quantity from which the individual variation component is removed using the individual variation basis, and the individual variation component is removed A base calculation unit,
Image feature quantities extracted from each of the model learning images obtained by capturing at least a part of the face of the individual and the individual difference calculated by the individual difference base calculation unit, which is assigned whether or not it is a target section. An intra-individual variation parameter related to the intra-individual variation base is estimated for each of the image for model learning based on the basis and the intra-individual variation base calculated by the intra-individual variation base calculation unit, and the model learning Based on the intra-individual variation parameter based on the intra-individual variation parameter estimated for each of the for-use images and whether or not it is the target interval attached to each of the model learning images A model generation unit that learns a model for determining whether or not
Model learning device equipped with

Computer,
An image feature quantity extraction unit that extracts an image feature quantity from an image obtained by photographing at least a part of an individual's face;
Expressing the individual difference basis for expressing the individual difference component of the image feature and the variation component within the individual of the image feature, which are obtained in advance from the image feature extracted by the image feature extraction unit The intra-individual variation parameter estimation unit estimates the intra-individual variation parameter related to the intra-individual variation base based on the intra-individual variation base to be used, and the intra-individual variation parameter estimated by the intra-individual variation parameter estimation unit A target section determining unit that determines whether at least a part of the face is a target section;
Program to function as.

Computer,
An image feature extracted from each of the learning images representing the reference state of the plurality of states among each of the learning images representing any of the plurality of states in which at least a part of the face of the individual is photographed An individual difference basis calculation unit that calculates an individual difference basis for expressing an individual difference component of the image feature amount based on an amount
An image feature amount of the learning image based on an image feature amount extracted from each of the learning images representing any of the plurality of states and the individual difference basis calculated by the individual difference basis calculating unit The intra-individual variation to calculate the intra-individual variation base for representing the intra-individual variation component based on the image feature quantity from which the individual variation component is removed using the individual variation basis, and the individual variation component is removed A basis calculation unit and whether or not it is a target section is added, and calculated by the individual difference basis calculation unit and image feature quantities extracted from each of the model learning images obtained by photographing at least a part of the individual's face The intra-individual variation parameter relating to the intra-individual variation base for each of the image for model learning based on the individual difference base thus calculated and the intra-individual variation base calculated by the intra-individual variation base calculation unit Within the individual based on the intra-individual variation parameter estimated for each of the model learning images and whether it is the target segment assigned to each of the model learning images. A model generation unit that learns a model for determining whether or not the target section is based on a fluctuation parameter;
Program to function as.