JP7032284B2

JP7032284B2 - A device, program and method for estimating the activation timing based on the image of the user's face.

Info

Publication number: JP7032284B2
Application number: JP2018200329A
Authority: JP
Inventors: 剣明呉; 啓一郎帆足
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2018-10-24
Filing date: 2018-10-24
Publication date: 2022-03-08
Anticipated expiration: 2038-10-24
Also published as: JP2020067562A

Description

本発明は、ユーザと自然な対話を実現する対話装置の技術に関する。 The present invention relates to a technique of a dialogue device that realizes a natural dialogue with a user.

対話装置は、スマートフォンやタブレット端末のインタフェースを介して、ユーザと対話する。ユーザの発話音声をテキストに変換し、その文脈構成から発話意味を推定する。そして、その発話意味に対応する対話シナリオに基づいて、ユーザに応答する。例えば、「Siri（登録商標）」や「しゃべってコンシェル（登録商標）」のような対話システムがある。 The dialogue device interacts with the user via the interface of the smartphone or tablet terminal. The user's utterance voice is converted into text, and the utterance meaning is estimated from the context structure. Then, it responds to the user based on the dialogue scenario corresponding to the utterance meaning. For example, there are dialogue systems such as "Siri®" and "Shabette Concierge®".

近年、ユーザとの対話装置として、「Google Home（登録商標）」や「Amazon Echo（登録商標）」のようなスマートスピーカや、「SOTA（登録商標）」や「ユニボー（登録商標）」のようなロボットが用いられてきている（以下「ロボット」と称す）。ユーザは、これらロボットと対話する前に、アクティブコマンド（「OK, XXX」, etc.）を発話する必要がある。このコマンドを検知したロボットは、音声認識機能を起動し、その後に続くユーザの発話音声を認識するように動作する。 In recent years, smart speakers such as "Google Home (registered trademark)" and "Amazon Echo (registered trademark)" and "SOTA (registered trademark)" and "Unibo (registered trademark)" have been used as devices for interacting with users. Robots have been used (hereinafter referred to as "robots"). The user needs to speak an active command ("OK, XXX", etc.) before interacting with these robots. The robot that detects this command activates the voice recognition function and operates to recognize the subsequent user's spoken voice.

従来、ユーザの意図を考慮して声を掛ける案内ロボットの技術がある（例えば特許文献１参照）。この技術によれば、時系列に連続したフレーム画像毎に、ユーザが見ている方向を判別し、時間経過に伴うその方向の変化を表す方向変化量を算出し、その方向変化量に基づいて声を掛けるか否かを判断する。具体的には、展示場内や店舗内で、キョロキョロして何か困っているユーザに、声を掛けることができる。 Conventionally, there is a technique of a guidance robot that calls out in consideration of the user's intention (see, for example, Patent Document 1). According to this technology, the direction the user is looking at is determined for each frame image that is continuous in time series, the amount of change in direction representing the change in that direction with the passage of time is calculated, and the amount of change in direction is calculated based on the amount of change in direction. Decide whether to call out. Specifically, it is possible to reach out to users who are having trouble with something in the exhibition hall or in the store.

また、店舗内の顧客の態様から、最適な広告を表示する顧客購買意思予測装置の技術もある（例えば特許文献２参照）。この技術によれば、顧客の顔の向きの変化を一定時間に渡って追跡し、商品に対して顧客の顔の向きが停止している商品注意時間が最長となるその商品を、顧客に推薦することができる。 Further, there is also a technique of a customer purchase intention prediction device that displays an optimum advertisement from the aspect of the customer in the store (see, for example, Patent Document 2). According to this technology, changes in the customer's face orientation are tracked over a certain period of time, and the product whose customer's face orientation is stopped with respect to the product is recommended to the customer for the longest product attention time. can do.

特開２０１７－１５９３９６号公報JP-A-2017-159396 特開２０１６－０７６１０９号公報Japanese Unexamined Patent Publication No. 2016-076109

「Head Pose Estimation using OpenCV and Dlib」、[online]、［平成３０年１０月４日検索］、インターネット＜URL:https://www.learnopencv.com/head-pose-estimation-using-opencv-and-dlib/＞"Head Pose Optimization using OpenCV and Dlib", [online], [Searched October 4, 2018], Internet <URL: https://www.learnopencv.com/head-pose-estimation-using-opencv-and -dlib /> 「短時間フーリエ変換」、[online]、［平成３０年１０月４日検索］、インターネット＜URL:https://www.ieice.org/jpn/event/FIT/pdf/d/2014/H-039.pdf＞"Short-time Fourier transform", [online], [Search on October 4, 2018], Internet <URL: https://www.ieice.org/jpn/event/FIT/pdf/d/2014/H- 039.pdf ＞「ウェーブレット変換」、[online]、［平成３０年１０月４日検索］、インターネット＜URL:http://www.cqpub.co.jp/hanbai/books/30/30961/30961_9syo.pdf＞"Wavelet transform", [online], [Search on October 4, 2018], Internet <URL: http://www.cqpub.co.jp/hanbai/books/30/30961/30961_9syo.pdf>

ユーザにとって、ロボットに毎回、アクティブコマンドを発話することは面倒な場合がある。
例えばテレビや会話などの周辺雑音が多い宅内や店舗では、ユーザが発話するアクティブコマンドを、ロボットが認識できない場合がある。
逆に、ユーザが意図的にロボットに向かって発話していない時でも、ロボットが周辺雑音に反応して誤動作してしまう場合もある。 It can be tedious for the user to say the active command to the robot every time.
For example, in a house or a store where there is a lot of ambient noise such as TV or conversation, the robot may not be able to recognize the active command spoken by the user.
On the contrary, even when the user does not intentionally speak to the robot, the robot may malfunction in response to ambient noise.

また、特許文献１及び２のように、ユーザの顔の向きの変化量のみで、ロボットが話し掛けるタイミングを、ユーザにとって最適なものとすることは精度的に難しい。
これに対し、本願の発明者らは、ユーザがロボットに話し掛けたいタイミングでは、ユーザの顔の映像に何らかの特徴的変化があるのではないか、と考えた。この特徴的変化を経験値から学習して見い出すことができれば、最適な発動タイミングを特定することができるのではないか、と考えた。 Further, as in Patent Documents 1 and 2, it is difficult to accurately optimize the timing at which the robot speaks only by the amount of change in the orientation of the user's face.
On the other hand, the inventors of the present application thought that there might be some characteristic change in the image of the user's face at the timing when the user wants to talk to the robot. I thought that if we could learn and find this characteristic change from the experience value, we would be able to identify the optimal activation timing.

そこで、本発明は、ユーザの顔の映像から、ユーザに対する話し掛けや動作の発動タイミングを高い精度で推定する装置、プログラム及び方法を提供することを目的とする。 Therefore, an object of the present invention is to provide a device, a program, and a method for estimating with high accuracy the timing of talking to a user or invoking an action from an image of a user's face.

本発明によれば、ユーザと対話する対話装置において、
カメラによって撮影されたユーザの顔が映り込む時系列画像を入力し、各画像から顔領域を検出する顔領域検出手段と、
画像に映り込む顔領域から、顔の各パラメータを抽出する顔パラメータ抽出手段と、
前記顔パラメータの時系列変化から時間周波数特徴量を抽出する特徴量抽出手段と、
時系列画像における異なるタイムスパン毎に、時間周波数特徴量と発動可否（正例・負例）とを対応付けた教師データによって予め学習したものであって、推定時に、タイムスパン毎の時間周波数特徴量を入力し、推定精度が最大となるタイムスパンに基づいて、前記現時点がユーザに対する発動タイミングか否かを推定する機械学習エンジンと、
前記機械学習エンジンによって真と判定された場合、ユーザに対して発動する発動手段と
を有することを特徴とする。 According to the present invention, in a dialogue device that interacts with a user,
A face area detection means that inputs a time-series image of the user's face taken by the camera and detects the face area from each image, and
Face parameter extraction means that extracts each parameter of the face from the face area reflected in the image,
A feature amount extraction means for extracting a time-frequency feature amount from the time-series change of the face parameter, and a feature amount extraction means.
It was learned in advance from the teacher data in which the time-frequency feature amount and the activation availability (positive example / negative example) were associated with each different time span in the time-series image, and the time-frequency feature for each time span was learned at the time of estimation. A machine learning engine that inputs an amount and estimates whether or not the current timing is the activation timing for the user based on the time span that maximizes the estimation accuracy .
It is characterized by having an activation means that is activated for the user when it is determined to be true by the machine learning engine.

本発明の対話装置における他の実施形態によれば、
顔パラメータ抽出手段は、顔の各パラメータとして、顔向きのオイラー角、顔の中心位置、及び／又は、顔のサイズを含むことも好ましい。 According to another embodiment of the dialogue device of the present invention.
It is also preferable that the face parameter extraction means includes Euler angles for the face, the center position of the face, and / or the size of the face as each parameter of the face.

本発明の対話装置における他の実施形態によれば、
ユーザの発話音声からテキストを抽出する音声認識手段を更に有し、
機械学習エンジンによって偽と判定された場合、音声認識手段における音声認識確率の閾値を上げることによって、音声認識の誤りを低減させる
ことも好ましい。 According to another embodiment of the dialogue device of the present invention.
It also has a voice recognition means to extract text from the user's spoken voice.
When it is determined to be false by the machine learning engine, it is also preferable to reduce the error in speech recognition by increasing the threshold value of the speech recognition probability in the speech recognition means.

本発明の対話装置における他の実施形態によれば、
発動手段は、ユーザに対する発動として、対話シナリオに基づく初期テキストを発話することも好ましい。 According to another embodiment of the dialogue device of the present invention.
It is also preferable that the triggering means utters an initial text based on a dialogue scenario as a triggering to the user.

本発明の対話装置における他の実施形態によれば、
当該対話装置が、動作可能なロボットである場合、
発動手段は、ユーザに対する発動として、行動シナリオに基づく初期挙動で動作することも好ましい。 According to another embodiment of the dialogue device of the present invention.
If the dialogue device is a movable robot,
It is also preferable that the activation means operates with the initial behavior based on the action scenario as the activation for the user.

本発明の対話装置における他の実施形態によれば、
対話シナリオ発動手段が初期テキストを発話した後、ユーザとの対話が成立しなかった時、
機械学習エンジンは、その時までの時間周波数特徴量に対して発動不可（負例）として教師データを収集する
ことも好ましい。 According to another embodiment of the dialogue device of the present invention.
When the dialogue with the user is not established after the dialogue scenario trigger means utters the initial text.
It is also preferable that the machine learning engine collects teacher data as inoperable (negative example) for the time-frequency features up to that time.

本発明の対話装置における他の実施形態によれば、
対話シナリオの中断中に、ユーザから発話された時、
機械学習エンジンは、その時までの時間周波数特徴量に対して発動可能（正例）として教師データを収集することも好ましい。 According to another embodiment of the dialogue device of the present invention.
When spoken by the user during the interruption of the dialogue scenario
It is also preferable that the machine learning engine collects teacher data as being operable (normal example) for the time-frequency features up to that time.

本発明の対話装置における他の実施形態によれば、
特徴量抽出手段は、顔の各パラメータの時系列に対して、短時間フーリエ変換又はウェーブレット変換によって時間周波数特徴量を抽出するものであり、
機械学習エンジンは、ＬＳＴＭ(Long Short-Term Memory)である
ようにコンピュータを機能させることも好ましい。 According to another embodiment of the dialogue device of the present invention.
The feature amount extraction means extracts the time-frequency feature amount by short-time Fourier transform or wavelet transform for the time series of each parameter of the face.
It is also preferable that the machine learning engine functions the computer as if it were an LSTM (Long Short-Term Memory).

本発明によれば、ユーザと対話する装置に搭載されたコンピュータを機能させるプログラムにおいて、
カメラによって撮影されたユーザの顔が映り込む時系列画像を入力し、各画像から顔領域を検出する顔領域検出手段と、
画像に映り込む顔領域から、顔の各パラメータを抽出する顔パラメータ抽出手段と、
前記顔パラメータの時系列変化から時間周波数特徴量を抽出する特徴量抽出手段と、
時系列画像における異なるタイムスパン毎に、時間周波数特徴量と発動可否（正例・負例）とを対応付けた教師データによって予め学習したものであって、推定時に、タイムスパン毎の時間周波数特徴量を入力し、推定精度が最大となるタイムスパンに基づいて、前記現時点がユーザに対する発動タイミングか否かを推定する機械学習エンジンと、
前記機械学習エンジンによって真と判定された場合、ユーザに対して発動する発動手段と
してコンピュータを機能させることを特徴とする。 According to the present invention, in a program for operating a computer mounted on a device that interacts with a user.
A face area detection means that inputs a time-series image of the user's face taken by the camera and detects the face area from each image, and
Face parameter extraction means that extracts each parameter of the face from the face area reflected in the image,
A feature amount extraction means for extracting a time-frequency feature amount from the time-series change of the face parameter, and a feature amount extraction means.
It was learned in advance from the teacher data in which the time-frequency feature amount and the activation availability (positive example / negative example) were associated with each different time span in the time-series image, and the time-frequency feature for each time span was learned at the time of estimation. A machine learning engine that inputs an amount and estimates whether or not the current timing is the activation timing for the user based on the time span that maximizes the estimation accuracy .
When it is determined to be true by the machine learning engine, it is characterized in that the computer functions as an activation means to be activated for the user.

本発明によれば、ユーザと対話する装置の対話方法において、
装置は、
時系列画像における異なるタイムスパン毎に、時間周波数特徴量と発動可否（正例・負例）とを対応付けた教師データによって予め学習した機械学習エンジンを有し、
カメラによって撮影されたユーザの顔が映り込む時系列画像を入力し、各画像から顔領域を検出する第１のステップと、
画像に映り込む顔領域から、顔の各パラメータを抽出する第２のステップと、
前記顔パラメータの時系列変化から時間周波数特徴量を抽出する第３のステップと、
前記機械学習エンジンを用いて、推定時に、タイムスパン毎の時間周波数特徴量から、推定精度が最大となるタイムスパンに基づいて、前記現時点がユーザに対する発動タイミングか否かを推定する第４のステップと、
第４のステップによって真と判定された場合、ユーザに対して発動する第５のステップと
を実行することを特徴とする。 According to the present invention, in a method of interacting with a device that interacts with a user,
The device is
It has a machine learning engine that has been learned in advance using teacher data that associates time-frequency features with activation availability (positive and negative examples) for each different time span in a time-series image .
The first step of inputting a time-series image in which the user's face taken by the camera is reflected and detecting the face area from each image, and
The second step of extracting each parameter of the face from the face area reflected in the image,
The third step of extracting the time-frequency feature amount from the time-series change of the face parameter, and
A fourth method of estimating whether or not the current time is the activation timing for the user based on the time span that maximizes the estimation accuracy from the time frequency feature amount for each time span at the time of estimation using the machine learning engine. Steps and
If it is determined to be true by the 4th step, the 5th step to be activated for the user
It is characterized by executing.

本発明の対話装置、プログラム及び方法によれば、ユーザの顔の映像から、ユーザに対する話し掛けや動作の発動タイミングを高い精度で推定することができる。 According to the dialogue device, the program, and the method of the present invention, it is possible to estimate the timing of talking to the user and the activation timing of the operation with high accuracy from the image of the user's face.

本発明における対話装置の機能構成図である。It is a functional block diagram of the dialogue apparatus in this invention. 対話システムにおけるサーバの機能構成図である。It is a functional block diagram of a server in a dialogue system. 推定段階における各機能構成部の処理の流れを表す説明図である。It is explanatory drawing which shows the processing flow of each functional component part in the estimation stage. 対話装置のロボットがユーザの顔を撮影している外観図である。It is an external view which a robot of a dialogue device is taking a picture of a user's face. 顔領域検出部及び顔パラメータ抽出部の処理を表す説明図である。It is explanatory drawing which shows the process of a face area detection part and face parameter extraction part. 特徴量抽出部の処理を表す説明図である。It is explanatory drawing which shows the process of a feature amount extraction part. 初期段階における各機能構成部の処理の流れを表す説明図である。It is explanatory drawing which shows the processing flow of each functional component part in an initial stage. 学習させる各機能構成部の処理の流れを表す説明図である。It is explanatory drawing which shows the process flow of each functional component part to be trained.

以下、本発明の実施の形態について、図面を用いて詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

図１は、本発明における対話装置の機能構成図である。 FIG. 1 is a functional configuration diagram of the dialogue device in the present invention.

図１によれば、対話装置１は、ユーザと対話するロボット（スマートスピーカも含む）である。対話装置１は、ユーザインタフェースの入出力デバイスとして、マイク、スピーカ及びカメラを搭載している。カメラは、ユーザの顔の映像を撮影する。マイクは、ユーザの発話音声を取得する。スピーカは、音声によってユーザへ発話する。
ユーザは、キャラクタとしてのロボットの対話装置１との間で、自然な対話を実現することができる。 According to FIG. 1, the dialogue device 1 is a robot (including a smart speaker) that interacts with a user. The dialogue device 1 is equipped with a microphone, a speaker, and a camera as input / output devices of the user interface. The camera captures an image of the user's face. The microphone acquires the user's spoken voice. The speaker speaks to the user by voice.
The user can realize a natural dialogue with the dialogue device 1 of the robot as a character.

図２は、対話システムにおけるサーバの機能構成図である。 FIG. 2 is a functional configuration diagram of a server in a dialogue system.

図２によれば、図１における本発明の機能構成と全く同じであるが、サーバによって構成されている。ユーザによって所持されるスマートフォンやタブレット端末が、対話システムにおけるユーザインタフェースとなる。 According to FIG. 2, it is exactly the same as the functional configuration of the present invention in FIG. 1, but is configured by a server. A smartphone or tablet terminal owned by the user serves as a user interface in the dialogue system.

本発明の対話装置１は、ユーザにとって最良のタイミングで能動的に、ユーザに話し掛けたり又は動作（例えばロボットが、手を上げる又は歩き始める等）することができる。
図１によれば、対話装置１は、顔領域検出部１１と、顔パラメータ抽出部１２と、特徴量抽出部１３と、機械学習エンジン１４と、発動部１５と、音声認識部１０１と、対話実行部１０２と、音声変換部１０３とを有する。これら機能構成部は、対話装置に搭載されたコンピュータを機能させるプログラムを実行することによって実現できる。また、これら機能構成部の処理の流れは、装置の対話方法としても理解できる。 The dialogue device 1 of the present invention can actively talk to or act on the user (for example, the robot raises a hand or starts walking) at the best timing for the user.
According to FIG. 1, the dialogue device 1 has a dialogue with a face area detection unit 11, a face parameter extraction unit 12, a feature amount extraction unit 13, a machine learning engine 14, an activation unit 15, and a voice recognition unit 101. It has an execution unit 102 and a voice conversion unit 103. These functional components can be realized by executing a program that makes a computer mounted on the dialogue device function. Further, the flow of processing of these functional components can be understood as a method of dialogue of the device.

音声認識部１０１、対話実行部１０２及び音声変換部１０３は、一般的な対話装置の機能構成部である。
音声認識部１０１は、マイクからユーザの発話音声を入力し、その発話音声をテキストに変換し、そのテキストを対話実行部１０２へ出力する。
対話実行部１０２は、音声認識部１０１から入力したテキストに対して、次の対話シナリオとなるテキストを検索する。そのテキストは、音声変換部１０３へ出力される。対話シナリオとは、ユーザの発話テキストに対して次の対話テキストを対応付けたものであり、質問及び回答からなる対話ノードをツリー状に構成したものである。
音声変換部１０３は、対話実行部１０２からユーザに対する対話文を入力し、その対話文を音声信号に変換し、その音声信号をスピーカへ出力する。 The voice recognition unit 101, the dialogue execution unit 102, and the voice conversion unit 103 are functional components of a general dialogue device.
The voice recognition unit 101 inputs the user's uttered voice from the microphone, converts the uttered voice into text, and outputs the text to the dialogue execution unit 102.
The dialogue execution unit 102 searches for the text that becomes the next dialogue scenario with respect to the text input from the voice recognition unit 101. The text is output to the voice conversion unit 103. The dialogue scenario is a scenario in which the following dialogue text is associated with the user's utterance text, and a dialogue node composed of a question and an answer is configured in a tree shape.
The voice conversion unit 103 inputs a dialogue sentence to the user from the dialogue execution unit 102, converts the dialogue sentence into a voice signal, and outputs the voice signal to the speaker.

本発明によれば、対話装置１は、機械学習エンジン１４を用いた＜推定段階＞と＜初期段階＞とに分けられる。 According to the present invention, the dialogue device 1 is divided into a <estimation stage> and an <initial stage> using the machine learning engine 14.

＜推定段階＞
図３は、推定段階における各機能構成部の処理の流れを表す説明図である。 <Estimation stage>
FIG. 3 is an explanatory diagram showing a processing flow of each functional component in the estimation stage.

［顔領域検出部１１］
顔領域検出部１１は、カメラによって撮影されたユーザの顔が映り込む時系列画像（映像）を入力し、各画像から顔領域を検出する。 [Face area detection unit 11]
The face area detection unit 11 inputs a time-series image (video) in which the user's face captured by the camera is reflected, and detects the face area from each image.

図４は、対話装置のロボットがユーザの顔を撮影している外観図である。
図５は、顔領域検出部及び顔パラメータ抽出部の処理を表す説明図である。 FIG. 4 is an external view in which the robot of the dialogue device photographs the user's face.
FIG. 5 is an explanatory diagram showing the processing of the face area detection unit and the face parameter extraction unit.

顔領域検出部１１は、カメラによって撮影された時系列の各画像フレームから、顔の目立つ特徴を抽出することによって顔自体を識別する。特徴としては、例えば、顔のパーツの相対位置や大きさ、目や鼻やほお骨やあごの形を用いる。顔画像の特徴から作成されたテンプレートと一致する画像部分を、顔領域として検索する。尚、顔認識アルゴリズムとしては、様々な既存の方法がある。
図５によれば、ユーザの顔が撮影された時系列画像が並んでおり、顔領域検出部１１は、各画像から顔領域が検出されている。 The face area detection unit 11 identifies the face itself by extracting conspicuous features of the face from each image frame in the time series taken by the camera. As features, for example, the relative positions and sizes of facial parts, and the shapes of eyes, nose, cheekbones, and chin are used. The image part that matches the template created from the features of the face image is searched for as the face area. There are various existing methods as the face recognition algorithm.
According to FIG. 5, time-series images of the user's face are arranged side by side, and the face area detection unit 11 detects the face area from each image.

［顔パラメータ抽出部１２］
顔パラメータ抽出部１２は、画像に映り込む顔領域から、顔の各パラメータを抽出する。顔の各パラメータとしては、以下のようなものを含む。
顔向きのオイラー角
顔の中心位置
顔のサイズ [Face parameter extraction unit 12]
The face parameter extraction unit 12 extracts each parameter of the face from the face region reflected in the image. Each face parameter includes the following.
Euler angles for the face Center position of the face Face size

顔パラメータの検出には、例えば頭部姿勢推定(Head Pose Estimation)方法を用いることができる（例えば非特許文献１参照）。
顔の向き判別をするために、画像認識として、オープンソースライブラリのOpenCV（画像処理）やDlib（機械学習）、深層学習分類モデルを用いて実装することができる。
顔の中心位置やサイズは、画角全体に対する顔領域の位置やサイズとして導出することができる。 For the detection of face parameters, for example, a head pose estimation method can be used (see, for example, Non-Patent Document 1).
In order to determine the orientation of the face, it can be implemented as image recognition using OpenCV (image processing), Dlib (machine learning), and deep learning classification model of the open source library.
The center position and size of the face can be derived as the position and size of the face region with respect to the entire angle of view.

顔パラメータは、ユーザの動作によって、例えば以下のように時系列に変化する。
（着席して携帯を見る）->
顔向きは真正面から下へ変化し、顔の中心位置は上から下へ変化する。
（起立して出かける）->
顔向きは前から後へ変化し、顔の中心位置は下から上へ変化する。
（薬を飲む）->
顔向きは下から上へ変化し、また再び下へ変化する。
（周辺を見ながらロボットに近づく）->
顔向きは左右に変化し、顔のサイズは大きく変化する。
（案内図を見ながら周辺を確認）->
顔向きは下から左右に変化し、また再び下へ変化する。
（ロボットの姿を見る）->
顔向きは上下左右に変化し、顔位置は上下左右に変化する。 The face parameters change in time series, for example, as follows, depending on the user's actions.
(Sit down and look at your cellphone)->
The face orientation changes from the front to the bottom, and the center position of the face changes from the top to the bottom.
(Stand up and go out)->
The face orientation changes from front to back, and the center position of the face changes from bottom to top.
(Take medicine)->
The face orientation changes from bottom to top and then down again.
(Approaching the robot while looking around)->
The orientation of the face changes from side to side, and the size of the face changes greatly.
(Check the surrounding area while looking at the guide map)->
The face orientation changes from bottom to left and right, and then down again.
(See the robot)->
The face orientation changes up, down, left and right, and the face position changes up, down, left and right.

抽出された時系列の顔パラメータは、特徴量抽出部１３へ出力される。 The extracted time-series face parameters are output to the feature amount extraction unit 13.

［特徴量抽出部１３］
特徴量抽出部１３は、顔パラメータの時系列変化から「時間周波数特徴量」を抽出する。即ち、時間及び周波数に係る特徴量を同時に抽出する。 [Feature amount extraction unit 13]
The feature amount extraction unit 13 extracts a "time-frequency feature amount" from the time-series change of the face parameter. That is, the feature quantities related to time and frequency are extracted at the same time.

図６は、特徴量抽出部の処理を表す説明図である。 FIG. 6 is an explanatory diagram showing the processing of the feature amount extraction unit.

特徴量抽出部１３は、顔の各パラメータの時系列に対して、例えば「短時間フーリエ変換」又は「ウェーブレット変換」によって時間周波数特徴量を抽出する。 The feature amount extraction unit 13 extracts the time-frequency feature amount from the time series of each parameter of the face by, for example, "short-time Fourier transform" or "wavelet transform".

短時間フーリエ変換(short-time Fourier transform：STFT)とは、時間を一定間隔ずつ切り出して、次々にフーリエ変換する方法である（例えば非特許文献２参照）。これによって、時間変化するパラメータの周波数と位相（の変化）を解析する。 The short-time Fourier transform (STFT) is a method of cutting out time at regular intervals and performing a Fourier transform one after another (see, for example, Non-Patent Document 2). In this way, the frequency and phase (change) of the parameter that changes with time are analyzed.

短時間フーリエ変換を利用した場合、顔の各パラメータについて以下の表１のように、１列目は周波数、２列目は振幅に変換する。これに対して、短時間フーリエ変換を適用する。

When the short-time Fourier transform is used, each parameter of the face is converted into frequency in the first column and amplitude in the second column as shown in Table 1 below. On the other hand, a short-time Fourier transform is applied.

ウェーブレット変換(wavelet transformation)は、周波数に応じて解析する時間幅を変化させる方法である（例えば非特許文献３参照）。フーリエ変換によって周波数特性を求める際に失われる時間領域の情報を残す。ウェーブレット変換では、小さい波（ウェーブレット）を拡大縮小、平行移動して足し合わせることで、与えられた広い周波数領域の波形を表現することができる。 The wavelet transformation is a method of changing the time width for analysis according to the frequency (see, for example, Non-Patent Document 3). It leaves information in the time domain that is lost when the frequency characteristics are obtained by the Fourier transform. In the wavelet transform, a waveform in a given wide frequency domain can be expressed by enlarging / reducing a small wave (wavelet), moving it in parallel, and adding them together.

ウェーブレット変換を利用した場合、顔の各パラメータについて以下の表２のように、１列目は周波数（ウェーブレット変換の出力Scaleから換算）、２列目は開始時間～終了時間、３列目は振幅に変換する。これに対して、ウェーブレット変換を適用し、時間的に変動する周波数成分を取得することによって、短時間フーリエ変換よりも詳細に時間周波数特徴量を導出することができる。

When wavelet transform is used, as shown in Table 2 below for each parameter of the face, the first column is the frequency (converted from the output Scale of the wavelet transform), the second column is the start time to the end time, and the third column is the amplitude. Convert to. On the other hand, by applying the wavelet transform and acquiring the frequency component that fluctuates with time, the time-frequency feature amount can be derived in more detail than the short-time Fourier transform.

顔の各パラメータに基づく時間周波数特徴量は、ユーザの行動パターン認識の網羅性と、発動タイミングが外乱の影響を受けにくいロバスト性とを向上させることができる。 The time-frequency feature amount based on each parameter of the face can improve the completeness of the user's behavior pattern recognition and the robustness in which the activation timing is less susceptible to disturbance.

［機械学習エンジン１４］
機械学習エンジン１４は、時間周波数特徴量と発動可否（正例・負例）とを対応付けた教師データによって予め学習したものである。「発動」とは、ユーザに何らか話し掛けたり、又は、ロボットが動作してユーザの注目を受けることをいう。即ち、顔の各パラメータの時系列変化に基づく時間周波数特徴量と、発動タイミングＯＫ又はＮＧとの相互関係を、学習モデルとして構築したものである。 [Machine learning engine 14]
The machine learning engine 14 has been learned in advance using teacher data in which the time-frequency feature amount and the activation possibility (positive example / negative example) are associated with each other. "Activation" means talking to the user or operating the robot to receive the user's attention. That is, the interrelationship between the time-frequency feature amount based on the time-series change of each parameter of the face and the activation timing OK or NG is constructed as a learning model.

機械学習エンジン１４は、例えばＬＳＴＭ(Long Short-Term Memory)であることが好ましい。ＬＳＴＭとは、長期的な依存関係を学習可能な、ＲＮＮ(Recurrent Neural Network)の一種である。ＲＮＮは、ニューラルネットワークのモジュールを繰り返す、鎖状のものである。 The machine learning engine 14 is preferably LSTM (Long Short-Term Memory), for example. LSTM is a kind of RNN (Recurrent Neural Network) that can learn long-term dependencies. The RNN is a chain that repeats the modules of the neural network.

そして、機械学習エンジン１４は、特徴量抽出部１３から出力された時間周波数特徴量を入力し、現時点が、ユーザに対する発動タイミングか否かを推定する。
図３によれば、機械学習エンジン１４は、発動タイミングＯＫである場合、その旨を、発動部１５へ出力する。
また、機械学習エンジン１４は、発動タイミングＮＧである場合、音声認識部１０１における音声認識確率の閾値を上げるように指示する。発動タイミングＮＧであるということは、ユーザが、対話装置１へ注目することはないために、音声認識確率を上げることによって、周辺雑音から音声認識されないようにする。これによって、ユーザの発話の誤認識を低減させることができる。 Then, the machine learning engine 14 inputs the time-frequency feature amount output from the feature amount extraction unit 13, and estimates whether or not the current time is the activation timing for the user.
According to FIG. 3, when the activation timing is OK, the machine learning engine 14 outputs to that effect to the activation unit 15.
Further, when the activation timing is NG, the machine learning engine 14 instructs the voice recognition unit 101 to raise the threshold value of the voice recognition probability. The activation timing NG means that the user does not pay attention to the dialogue device 1, so that the voice recognition probability is increased so that the voice is not recognized from the ambient noise. This makes it possible to reduce the misrecognition of the user's utterance.

［発動部１５］
発動部１５は、機械学習エンジン１４によって真（発動タイミングＯＫ）と判定された場合、ユーザに対して発動する。ここで、「発動」とは、例えば以下のような態様をいう。
（１）ユーザに対する発動として、対話シナリオに基づく初期テキストを発話する。
（２）当該対話装置が、動作可能なロボットである場合、ユーザに対する発動として、行動シナリオに基づく初期挙動で動作する。 [Activator 15]
When the machine learning engine 14 determines that the activation unit 15 is true (activation timing is OK), the activation unit 15 is activated for the user. Here, "invocation" refers to, for example, the following aspects.
(1) Speak an initial text based on a dialogue scenario as an activation for the user.
(2) When the dialogue device is an operable robot, it operates in the initial behavior based on the behavior scenario as an activation for the user.

次に、機械学習エンジン１４における学習処理について説明する。
機械学習エンジン１４は、予め蓄積された教師データに基づいて学習モデルを構築する初期段階の学習処理と、教師データを収集しながら学習モデルを構築する推定段階の学習処理とを実行する。 Next, the learning process in the machine learning engine 14 will be described.
The machine learning engine 14 executes a learning process in an initial stage of constructing a learning model based on pre-stored teacher data, and a learning process in an estimation stage of constructing a learning model while collecting teacher data.

＜機械学習エンジン１４の初期段階の学習処理＞
図７は、初期段階における各機能構成部の処理の流れを表す説明図である。 <Learning process at the initial stage of machine learning engine 14>
FIG. 7 is an explanatory diagram showing a processing flow of each functional component in the initial stage.

図７によれば、教師データとして、ユーザの顔の映像と、発動可否（正例・負例）とが対応付けられている。教師データは、例えば対話装置１のカメラの前で、複数の被験者における模範的な顔の動きを記録したものである。被験者の顔の動きの映像から、話し掛け又は動作の発動タイミングの可否を対応付ける。 According to FIG. 7, as the teacher data, the image of the user's face and whether or not it can be activated (positive example / negative example) are associated with each other. The teacher data is, for example, recording the model facial movements of a plurality of subjects in front of the camera of the dialogue device 1. From the video of the subject's facial movement, whether or not to activate the conversation or movement is associated.

ユーザの顔の映像は、前述した顔領域検出部１１、顔パラメータ抽出部１２及び特徴量抽出部１３によって処理され、時間周波数特徴量が得られる。その時間周波数特徴量と発動可否（正例・負例）とを対応付けて、機械学習エンジン１４へ入力する。これによって、機械学習エンジン１４は、学習モデルを構築する。 The image of the user's face is processed by the face area detection unit 11, the face parameter extraction unit 12, and the feature amount extraction unit 13 described above, and the time-frequency feature amount is obtained. The time-frequency feature amount is associated with the activation availability (positive example / negative example) and input to the machine learning engine 14. As a result, the machine learning engine 14 builds a learning model.

＜機械学習エンジン１４の推定段階の学習処理＞
機械学習エンジン１４は、初期段階で全ての学習パターンをカバーすることは困難となる。そのために、推定段階（運用段階）でも、ユーザの肯定的な反応、又は、否定的な反応に基づいて正例・負例の教師データを収集する。
機械学習エンジン１４は、推定段階を実行しながら、正例となる教師データと、負例となる教師データとを収集する。 <Learning process at the estimation stage of the machine learning engine 14>
It becomes difficult for the machine learning engine 14 to cover all learning patterns at an initial stage. Therefore, even in the estimation stage (operation stage), positive and negative teacher data are collected based on the positive reaction or the negative reaction of the user.
The machine learning engine 14 collects positive teacher data and negative teacher data while executing the estimation stage.

図８は、学習させる各機能構成部の処理の流れを表す説明図である。 FIG. 8 is an explanatory diagram showing a processing flow of each functional component to be learned.

（正例となる教師データを収集する場合）
対話シナリオの中断中に、ユーザから発話された時、機械学習エンジン１４は、その時までの時間周波数特徴量に対して発動可能（正例）として、教師データを収集する。
対話装置１から話し掛けたり又は動作したりしてはいけない（負例）と判定しているにも関わらず、ユーザの反応が肯定的である（自らロボットに声をかける）場合、この直前までの時間周波数特徴量は、発動可能であったと判定する。 (When collecting positive teacher data)
When the user speaks during the interruption of the dialogue scenario, the machine learning engine 14 collects teacher data as activating (normal example) for the time-frequency features up to that time.
If the user's reaction is positive (speaks to the robot by himself) even though it is determined that the dialogue device 1 should not talk or operate (negative example), up to this point immediately before. It is determined that the time-frequency feature amount can be activated.

（負例となる教師データを収集する場合）
発動タイミングで発動部１５から話し掛け又は動作をした後、ユーザとの対話が成立しなかった時、機械学習エンジン１４は、その時までの時間周波数特徴量に対して発動不可として、教師データを収集する。
対話装置１から話し掛けたり又は動作してもよい（正例）と判定しているにも関わらず、ユーザの反応が否定的である（無視している）場合、この直前までの時間周波数特徴量は、発動不可であったと判定する。 (When collecting negative teacher data)
When the dialogue with the user is not established after talking or operating from the activation unit 15 at the activation timing, the machine learning engine 14 collects teacher data as it cannot be activated for the time frequency feature amount up to that time. ..
If the user's reaction is negative (ignored) even though it is determined that the dialogue device 1 may speak or operate (normal example), the time-frequency feature amount up to this point immediately before this. Determines that it could not be activated.

＜異なるタイムスパンの設定＞
他の実施形態として、機械学習エンジンの教師データは、異なるタイムスパン毎に、時間周波数特徴量と発動可否とを対応付けたものであることも好ましい。
例えば、細粒度及び粗粒度の複数のタイムスパンを設定し、直近Ｎ秒間の固定フレーム数をデフォルト値として設定する。
（細粒度のタイムスパン）直近１秒・１０フレーム ->時間周波数特徴量の導出
（粗粒度のタイムスパン）直近５秒・１０フレーム ->時間周波数特徴量の導出
タイムスパンとは、時間周波数特徴量を導出するために使用する画像の時間間隔（サンプリング間隔）をいう。 <Setting different time spans>
As another embodiment, it is also preferable that the teacher data of the machine learning engine associates the time-frequency feature amount with the activation availability for each different time span.
For example, a plurality of fine particle size and coarse particle size time spans are set, and the fixed number of frames in the last N seconds is set as a default value.
(Fine grain timespan) Latest 1 second / 10 frames-> Derivation of time frequency features (Coarse grain timespan) Latest 5 seconds / 10 frames-> Derivation of time frequency features What is timespan? The time interval (sampling interval) of the image used to derive the quantity.

前述した特徴量抽出部１３は、ユーザの顔が映り込む同じ映像を入力しても、タイムスパン毎に異なる時間周波数特徴量を出力する。そして、タイムスパン毎に、時間周波数特徴量を、機械学習エンジン１４へ入力する。これによって、タイムスパン毎に、異なる学習モデルを構築することとなる。 The feature amount extraction unit 13 described above outputs a different time frequency feature amount for each time span even if the same image in which the user's face is reflected is input. Then, the time-frequency feature amount is input to the machine learning engine 14 for each time span. As a result, a different learning model will be constructed for each time span.

機械学習エンジン１４は、推定段階について、タイムスパン毎に推定精度を評価するものであってもよい。精度が最大となるタイムスパンを用いることもできる。ここでの推定精度は、タイムスパン毎の推定結果と、ユーザの肯定的な反応又は否定的な反応とを照合し、一致率として算出したものであってもよい。 The machine learning engine 14 may evaluate the estimation accuracy for each time span for the estimation stage. It is also possible to use the time span that maximizes the accuracy. The estimation accuracy here may be calculated as a concordance rate by collating the estimation result for each time span with the positive reaction or the negative reaction of the user.

以上、詳細に説明したように、本発明の対話装置、プログラム及び方法によれば、ユーザの顔の映像から、ユーザに対する話し掛けや動作の発動タイミングを高い精度で推定することができる。即ち、ユーザから見て、利便性及びインテリジェンス性を向上させて、人の空気が読めるロボットやスマートスピーカを実現することができる。 As described above in detail, according to the dialogue device, the program, and the method of the present invention, it is possible to estimate the timing of talking to the user and the activation timing of the operation with high accuracy from the image of the user's face. That is, it is possible to realize a robot or a smart speaker that can read human air by improving convenience and intelligence from the user's point of view.

前述した本発明の種々の実施形態について、本発明の技術思想及び見地の範囲の種々の変更、修正及び省略は、当業者によれば容易に行うことができる。前述の説明はあくまで例であって、何ら制約しようとするものではない。本発明は、特許請求の範囲及びその均等物として限定するものにのみ制約される。 Various modifications, modifications and omissions of the above-mentioned various embodiments of the present invention within the scope of the technical idea and viewpoint of the present invention can be easily carried out by those skilled in the art. The above explanation is just an example and does not attempt to limit anything. The present invention is limited only to the scope of claims and their equivalents.

１対話装置
１１顔領域検出部
１２顔パラメータ抽出部
１３特徴量抽出部
１４機械学習エンジン
１５発動部
１０１音声認識部
１０２対話実行部
１０３音声変換部
1 Dialogue device 11 Face area detection unit 12 Face parameter extraction unit 13 Feature quantity extraction unit 14 Machine learning engine 15 Activation unit 101 Voice recognition unit 102 Dialogue execution unit 103 Voice conversion unit

Claims

In a dialogue device that interacts with the user
A face area detection means that inputs a time-series image of the user's face taken by the camera and detects the face area from each image, and
Face parameter extraction means that extracts each parameter of the face from the face area reflected in the image,
A feature amount extraction means for extracting a time-frequency feature amount from the time-series change of the face parameter, and a feature amount extraction means.
It was learned in advance from the teacher data in which the time-frequency feature amount and the activation availability (positive example / negative example) were associated with each different time span in the time-series image, and the time-frequency feature for each time span was learned at the time of estimation. A machine learning engine that inputs an amount and estimates whether or not the current timing is the activation timing for the user based on the time span that maximizes the estimation accuracy .
A dialogue device comprising an activation means that is activated for a user when it is determined to be true by the machine learning engine.

The dialogue device according to claim 1, wherein the face parameter extracting means includes Euler angles facing the face, a center position of the face, and / or the size of the face as each parameter of the face.

It also has a voice recognition means to extract text from the user's spoken voice.
The dialogue device according to claim 1 or 2, wherein when it is determined to be false by the machine learning engine, the error in voice recognition is reduced by increasing the threshold value of the voice recognition probability in the voice recognition means.

The dialogue device according to any one of claims 1 to 3, wherein the activation means utters an initial text based on a dialogue scenario as an activation to a user.

If the dialogue device is a movable robot,
The dialogue device according to any one of claims 1 to 3, wherein the activation means operates with an initial behavior based on an action scenario as an activation for a user.

When the dialogue with the user is not established after the dialogue scenario invoking means utters the initial text.
The dialogue device according to claim 4, wherein the machine learning engine collects teacher data as inoperable (negative example) with respect to the time-frequency feature amount up to that time.

When spoken by the user during the interruption of the dialogue scenario
The dialogue device according to claim 4 or 6, wherein the machine learning engine collects teacher data as being operable (normal example) with respect to the time-frequency feature amount up to that time.

The feature amount extraction means extracts the time-frequency feature amount by short-time Fourier transform or wavelet transform for the time series of each parameter of the face.
The program according to any one of claims 1 to 7 , wherein the machine learning engine operates a computer so as to be an LSTM (Long Short-Term Memory).

In a program that activates a computer installed in a device that interacts with a user
A face area detection means that inputs a time-series image of the user's face taken by the camera and detects the face area from each image, and
Face parameter extraction means that extracts each parameter of the face from the face area reflected in the image,
A feature amount extraction means for extracting a time-frequency feature amount from the time-series change of the face parameter, and a feature amount extraction means.
It was learned in advance from the teacher data in which the time-frequency feature amount and the activation availability (positive example / negative example) were associated with each different time span in the time-series image, and the time-frequency feature for each time span was learned at the time of estimation. A machine learning engine that inputs an amount and estimates whether or not the current timing is the activation timing for the user based on the time span that maximizes the estimation accuracy .
A program characterized in that a computer functions as an invoking means to be activated for a user when it is determined to be true by the machine learning engine.

In the method of interacting with the device that interacts with the user
The device is
It has a machine learning engine that has been learned in advance using teacher data that associates time-frequency features with activation availability (positive and negative examples) for each different time span in a time-series image .
The first step of inputting a time-series image in which the user's face taken by the camera is reflected and detecting the face area from each image, and
The second step of extracting each parameter of the face from the face area reflected in the image,
The third step of extracting the time-frequency feature amount from the time-series change of the face parameter, and
A fourth method of estimating whether or not the current time is the activation timing for the user based on the time span that maximizes the estimation accuracy from the time frequency feature amount for each time span at the time of estimation using the machine learning engine. Steps and
If it is determined to be true by the 4th step, the 5th step to be activated for the user
A method of interacting with a device, characterized by performing.