JP2008026485A

JP2008026485A - Speech production control system of remote control android

Info

Publication number: JP2008026485A
Application number: JP2006197112A
Authority: JP
Inventors: Carlos Toshinori Ishii; カルロス寿憲石井; Shuichi Nishio; 修一西尾; Hiroshi Ishiguro; 浩石黒; Norihiro Hagita; 紀博萩田
Original assignee: ATR Advanced Telecommunications Research Institute International
Current assignee: ATR Advanced Telecommunications Research Institute International
Priority date: 2006-07-19
Filing date: 2006-07-19
Publication date: 2008-02-07
Anticipated expiration: 2026-07-19
Also published as: JP5055486B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a speech production control system of a remote control android, capable of performing lip action suited to voice of an operator in an android. <P>SOLUTION: The android control system includes, for example, a remote control terminal and a control device of the android. When the operator speaks, its speaking voice is reproduced with certain delay. Based on an acoustic feature extracted from the speaking voice, a lip shape is estimated by using a nonlinear model (S27). Action delay from action command issuing time for a specified lip shape to actual forming time of the lip shape, is estimated (S31). Based on time sequence of the estimated lip shape, optimized lip action over a predetermined period is rearranged (S35). Each action command is issued in each timing which is set with voice reproducing start timing as reference, based on each action delay (S39). The lip action for adapting to the speaking voice of the operator is attained in the android. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

この発明は遠隔操作アンドロイドの発話動作制御システムに関し、特にたとえば、操作者の発話音声に基づいてアンドロイドの発話時の口唇動作を制御するシステムに関する。 The present invention relates to a remote operation android utterance operation control system, and more particularly to a system for controlling the lip movement of an android utterance based on an utterance voice of an operator, for example.

一般に、音声合成に用いる音素情報を利用して口唇形状を出力する手法は有効であるが、音素情報が無い場合には、発話音声の音響特徴から口唇形状を推定する必要が生じる。発話音声のみを元に人の口唇の動きを再現する技術は、たとえば非特許文献１−４で提案されている。音声のみから口唇の形状を推定する手法は、音響特徴抽出と、音響特徴から口唇形状へのマッピングの２ステップに分けられる。音声信号から抽出される音響特徴としては、ＬＰＣ（linear predictive coding）−Ｃｅｐｓｔｒａｌ係数、ＭＦＣＣ（Mel-Frequency Cepstral Coefficients）係数、ＬＳＰ（Line Spectrum Pair）係数などが挙げられる。音響特徴から口唇形状へのマッピング手法としては、線形回帰分析、ニューラル・ネットワーク、ＨＭＭ（Hidden Markov Model）、ＫＮＮ（K-Nearest Neighbor）などが挙げられる。上記の手法では、音声と口唇形状の画像的な情報との１対１のマッピングが定期のフレームごとに行われる。
Lavagetto,F.,”Converting speech into lip movements: A multimedia telephone for hard of hearing people”,IEEE Trans. on Rehabilitation Engineering,Vol.3,No.1,pp.90-102,March 1995. Yehia,H.,Rubin,P.,Vatikiotis-Bateson,E.,”Quantitative association of vocal-tract and facial behavior”,Speech Communication,vol.26,no.1-2,pp.23-43,1998. Hong,P.,Wen,Z.,Huang,T.,”Real-time speech-driven face animation with expressions using neural networks”,IEEE Trans. on Neural Networks,Vol.13,No.4,pp.916-927,July 2002. Gutierrez-Osuna,R.,Kakumanu,P.,et al.,”Speech-driven facial animation with realistic dynamics”,IEEE Trans. on Multimedia, Vol.7,No.1,pp.33-41,Feb.2005. In general, a method of outputting the lip shape using phoneme information used for speech synthesis is effective. However, when there is no phoneme information, it is necessary to estimate the lip shape from the acoustic features of the uttered speech. For example, Non-Patent Documents 1-4 propose a technique for reproducing the movement of a person's lips based only on uttered speech. The method for estimating the shape of the lips from only the speech is divided into two steps: acoustic feature extraction and mapping from acoustic features to the lip shape. As acoustic features extracted from a speech signal, there are LPC (linear predictive coding) -Cepstral coefficient, MFCC (Mel-Frequency Cepstral Coefficients) coefficient, LSP (Line Spectrum Pair) coefficient, and the like. Examples of mapping methods from acoustic features to lip shapes include linear regression analysis, neural networks, HMM (Hidden Markov Model), and KNN (K-Nearest Neighbor). In the above method, one-to-one mapping between voice and lip-shaped image information is performed for each regular frame.
Lavagetto, F., “Converting speech into lip movements: A multimedia telephone for hard of hearing people”, IEEE Trans. On Rehabilitation Engineering, Vol. 3, No. 1, pp. 90-102, March 1995. Yehia, H., Rubin, P., Vatikiotis-Bateson, E., “Quantitative association of vocal-tract and facial behavior”, Speech Communication, vol. 26, no. 1-2, pp. 23-43, 1998. Hong, P., Wen, Z., Huang, T., “Real-time speech-driven face animation with expressions using neural networks”, IEEE Trans. On Neural Networks, Vol.13, No.4, pp.916- 927, July 2002. Gutierrez-Osuna, R., Kakumanu, P., et al., “Speech-driven facial animation with realistic dynamics”, IEEE Trans. On Multimedia, Vol. 7, No. 1, pp. 33-41, Feb. 2005 .

しかしながら、音響特徴と口唇の形状との間には非線形な関係が存在することから、的確に動作するものは未だ存在しない。実物のロボットまたはアンドロイド（人間に酷似した姿形を有し、また、人間に酷似した動作を行うロボット）の制御においては、アクチュエータの応答は動作によって変わってしまう。たとえば、口を開く動作は、閉じる動作に比べて遅い。そのため、上記関連技術のように１フレームごとに音響特徴とターゲットとした口唇形状との１対１のマッピングをするのは困難となる。 However, since there is a non-linear relationship between the acoustic features and the shape of the lips, there is still nothing that works properly. In the control of a real robot or an android (a robot having a form resembling that of a human and performing a movement resembling a human), the response of the actuator varies depending on the operation. For example, the opening operation is slower than the closing operation. Therefore, it is difficult to perform one-to-one mapping between the acoustic features and the target lip shape for each frame as in the related art.

また、以下のような問題も存在する。発声においては、口唇や舌等の形が定まってから空気が通ることで音が発せられる。無声破裂音（／ｐ／，／ｔ／，／ｋ／）などでは、音に先立って口唇や舌の形が定まっている。したがって、音を元に口唇等を動かしたのでは、自然でない動作になってしまう。つまり、音を元に口唇を主とした顔面動作の制御を行う過程には、必然的に遅延が伴う。 There are also the following problems. In utterance, a sound is produced by the passage of air after the shape of the lips, tongue, etc. is determined. In the case of unvoiced plosives (/ p /, / t /, / k /), the shape of the lips and tongue is determined prior to the sound. Therefore, moving the lips based on the sound results in an unnatural operation. In other words, the process of controlling the facial motion mainly based on the lips based on the sound inevitably involves a delay.

また、アンドロイドの顔面は、人の顔面とは動き方が異なる。これは、人の顔面筋肉と同様にアクチュエータを内在させることが現状の技術では不可能であるためである。また、反応速度等も異なるため、人とまったく同一の動きをさせることはできない。一般には、人では可能であってもアンドロイドでは不可能な動きが数多く存在する。 Also, the face of an Android moves differently than a human face. This is because it is impossible with the current technology to incorporate an actuator in the same way as a human facial muscle. In addition, since the reaction speed and the like are different, it is impossible to make the person move exactly the same. In general, there are many movements that are possible with humans but not with Android.

それゆえに、この発明の主たる目的は、アンドロイドにおいて操作者の音声に適合した口唇動作を行うことができる、遠隔操作アンドロイドの発話動作制御システムを提供することである。 Therefore, a main object of the present invention is to provide a speech operation control system for a remotely operated android that can perform a lip motion adapted to the voice of the operator in the android.

請求項１の発明は、操作者によって遠隔操作されるアンドロイドの発話動作を操作者の発話音声に応じて制御するためのシステムであって、操作者の発する音声を取得する音声取得手段、取得された音声を一定時間の遅延のもとに再生する音声再生手段、取得された音声の音響特徴から口唇形状を非線形モデルを用いて推定する口唇形状推定手段、口唇形状を形成するための動作指令を発行してから当該口唇形状が形成されるまでにかかる時間情報を示す動作遅延を推定する動作遅延推定手段、推定された動作遅延に基づいて音声再生手段による再生開始タイミングを基準として当該動作指令の発行タイミングを設定する動作指令設定手段、および動作指令設定手段によって設定された発行タイミングに従って各動作指令を発行する動作指令発行手段を備える、システムである。 The invention of claim 1 is a system for controlling an utterance operation of an Android remotely operated by an operator according to an utterance voice of the operator, and is obtained by a voice acquisition means for acquiring a voice uttered by the operator. Voice playback means for playing back the voice with a delay of a certain time, lip shape estimation means for estimating the lip shape from the acoustic features of the acquired voice using a non-linear model, and operation commands for forming the lip shape An operation delay estimating means for estimating an operation delay indicating time information taken from the issuance to the formation of the lip shape, and based on the reproduction start timing by the sound reproducing means based on the estimated operation delay, Operation command setting means for setting issue timing, and operation commands for issuing each operation command according to the issue timing set by the operation command setting means It comprises a row unit, a system.

請求項１の発明では、システム（１０：後述する実施例で相当する参照符号。以下同じ。）は、遠隔操作されるアンドロイド（１２）の発話動作、すなわち発話時の口唇を主とした顔面動作を、操作者の発話音声に応じて制御するためのものである。システムは、たとえば、遠隔操作端末（２０）とアンドロイドの制御装置（１４）を含み、以下のような各手段を備えている。操作者の発話した音声は、音声取得手段（２４、７０、７４、７８、Ｓ３）によって取得される。音声再生手段（７０、８０、４８、Ｓ７、Ｓ９）は、取得された音声を一定時間の遅延のもとに再生する。口唇形状推定手段（７０、Ｓ５、Ｓ２１−Ｓ２７）は、取得された音声の音響特徴から口唇形状を非線形モデルを用いて推定する。たとえば、音響特徴としてはＭＦＣＣ係数が適用され、当該音響特徴が高い変動量を示す時点の前後所定時間の音響特徴を用いて、口唇形状の推定が行われる。また、音声信号と口唇形状との間には非線形な関係が存在するため、ニューラル・ネットワークのような非線形モデルが使用される。動作遅延推定手段（７０、Ｓ３１）は、口唇形状を形成するための動作指令を発行してから実際にアンドロイドのアクチュエータ（４０）が駆動して顔面において当該口唇形状が形成されるまでにかかる時間情報を示す動作遅延を推定する。動作指令設定手段（７０、Ｓ３７）は、推定された動作遅延に基づいて、一定遅延で再生される発話音声の再生開始タイミングを基準として、当該動作指令の発行タイミングを設定する。具体的には、或る音声を発する際に形成される口唇形状のための動作指令の発行タイミングは、動作遅延を考慮して、当該形状に対応する音声が出力されるタイミングよりも早められる。動作指令発行手段（７０、Ｓ３９）は、設定された発行タイミングに従って、各動作指令を発行する。 In the first aspect of the invention, the system (10: reference numeral corresponding to an embodiment to be described later, the same applies hereinafter) is a speech operation of an Android (12) that is remotely operated, that is, a facial operation mainly using a lip during speech. Is controlled according to the voice of the operator. The system includes, for example, a remote control terminal (20) and an android control device (14), and includes the following means. The voice uttered by the operator is acquired by the voice acquisition means (24, 70, 74, 78, S3). The sound reproduction means (70, 80, 48, S7, S9) reproduces the acquired sound with a certain time delay. The lip shape estimation means (70, S5, S21-S27) estimates the lip shape from the acquired acoustic features using a nonlinear model. For example, the MFCC coefficient is applied as the acoustic feature, and the lip shape is estimated using the acoustic feature for a predetermined time before and after the time point when the acoustic feature exhibits a high fluctuation amount. Further, since there is a non-linear relationship between the audio signal and the lip shape, a non-linear model such as a neural network is used. The operation delay estimation means (70, S31) takes time from when the operation command for forming the lip shape is issued until the android actuator (40) is actually driven to form the lip shape on the face. Estimate informational operational delay. The operation command setting means (70, S37) sets the issuance timing of the operation command based on the estimated operation delay with reference to the reproduction start timing of the uttered voice reproduced with a certain delay. Specifically, the operation command issuance timing for the lip shape formed when a certain sound is produced is earlier than the timing at which the sound corresponding to the shape is output in consideration of the operation delay. The operation command issuing means (70, S39) issues each operation command in accordance with the set issue timing.

したがって、請求項１の発明によれば、アンドロイドのアクチュエータが駆動されてその顔面において実際に口唇形状が形成されてから、当該口唇形状に対応する音声が出力されることとなる。このように、遠隔操作アンドロイドにおいて、操作者の音声に適合した口唇動作を実現することができる。 Therefore, according to the first aspect of the present invention, after the android actuator is driven and a lip shape is actually formed on the face, sound corresponding to the lip shape is output. As described above, in the remotely operated android, it is possible to realize the lip movement adapted to the voice of the operator.

請求項２の発明は、請求項１に従属するシステムであり、口唇形状推定手段によって推定された所定区間の口唇形状の時系列に基づいて当該区間を通じた動作の最適化を行う最適化手段をさらに備える。 The invention according to claim 2 is a system dependent on claim 1, and includes optimization means for optimizing the operation through the section based on the time series of the lip shape of the predetermined section estimated by the lip shape estimation means. Further prepare.

請求項２の発明では、最適化手段（７０、Ｓ３５）は、所定区間の口唇形状の時系列に基づいて、当該区間における動作の再構成が行われて、当該区間を通じた動作の最適化が行われる。所定区間は、たとえば、複数の音素や単語単位が含まれる程度の時間に設定される。具体的には、当該区間を通して、動作の簡略化、重要音素のみでの提示、動作量の相対化、動作速度の変換などのような変換が試みられる。そして、最適化された口唇動作に含まれる各動作の指令の発行タイミングが設定されて、当該最適化された口唇動作が提示されることとなる。したがって、口唇動作をより自然に見せることや、動作をより素早く行うことなどが可能になる。 In the invention of claim 2, the optimizing means (70, S35) performs the reconstruction of the operation in the section based on the time series of the lip shape of the predetermined section, and optimizes the operation through the section. Done. The predetermined section is set to a time that includes a plurality of phonemes and word units, for example. Specifically, conversion such as simplification of operation, presentation using only important phonemes, relativity of operation amounts, conversion of operation speed, and the like is attempted through the section. Then, the issuance timing of each operation command included in the optimized lip motion is set, and the optimized lip motion is presented. Accordingly, it becomes possible to make the lip movement appear more natural and to perform the movement more quickly.

この発明によれば、発話音声の再生開始タイミングを基準として、動作指令発行から当該口唇形状を形成するまでにかかる動作遅延を考慮してアクチュエータの動作指令を発行するようにしたので、操作者の発話音声に適合した口唇動作を行うことができる。 According to the present invention, the actuator operation command is issued in consideration of the operation delay from the operation command issuance until the lip shape is formed on the basis of the reproduction start timing of the utterance voice. It is possible to perform the lip movement suitable for the uttered voice.

この発明の上述の目的，その他の目的，特徴および利点は、図面を参照して行う以下の実施例の詳細な説明から一層明らかとなろう。 The above object, other objects, features and advantages of the present invention will become more apparent from the following detailed description of embodiments with reference to the drawings.

図１を参照して、この実施例のアンドロイド制御システム（以下、単に「システム」という。）１０は、アンドロイド１２を含む。アンドロイド１２は、人間に酷似した姿形（外観など）を有する人型ロボットであり、人間に酷似した動作（振り、振る舞い、発話）を行う。アンドロイド１２は、制御装置１４に接続され、この制御装置１４には、環境センサ１６が接続される。 Referring to FIG. 1, an Android control system (hereinafter simply referred to as “system”) 10 of this embodiment includes an Android 12. The android 12 is a humanoid robot having a form (appearance or the like) that closely resembles a human, and performs an action (shake, behave, or speak) that resembles a human. The android 12 is connected to a control device 14, and an environmental sensor 16 is connected to the control device 14.

また、制御装置１４は、インターネットや電話通信回線のようなネットワーク１８を介して遠隔操作端末２０に接続される。遠隔操作端末２０は、ＰＣ或いはＰＤＡのような汎用のコンピュータであり、この遠隔操作端末２０には、スピーカ２２、マイク２４およびモニタ２６が接続される。図示は省略するが、当然のことながら、遠隔操作端末２０には、キーボードおよびコンピュータマウスのような入力装置が含まれる。また、遠隔操作端末２０の動作を制御するためのプログラムおよびデータは、その図示しないメモリに記憶されており、図示しないＣＰＵによって遠隔操作端末２０の全体的な動作が制御される。 The control device 14 is connected to the remote operation terminal 20 via a network 18 such as the Internet or a telephone communication line. The remote operation terminal 20 is a general-purpose computer such as a PC or PDA. A speaker 22, a microphone 24 and a monitor 26 are connected to the remote operation terminal 20. Although illustration is omitted, as a matter of course, the remote operation terminal 20 includes an input device such as a keyboard and a computer mouse. A program and data for controlling the operation of the remote operation terminal 20 are stored in a memory (not shown), and the overall operation of the remote operation terminal 20 is controlled by a CPU (not shown).

図２は、アンドロイド１２、制御装置１４および環境センサ１６の電気的な構成を示すブロック図である。この図２を参照して、アンドロイド１２は複数のアクチュエータ（たとえば、エアアクチュエータ）４０を含み、各アクチュエータ４０は、制御装置１４（１４ｂ）の制御ボード（アクチュエータ制御ボード）９８に接続される。アクチュエータ４０は、アンドロイド１２の身体動作を提示するために設けられる。いくつかのアクチュエータ４０はアンドロイド１２の口唇を主とした顔面動作を行うために用いられる。たとえば、上顎、下顎、上唇、下唇、口の左右の側面（頬）などの動きのためのアクチュエータ４０がそれぞれ設けられる。また、他のアクチュエータ４０は、アンドロイド１２の各関節や眼および腹部などその他の部位または部分を動かすために用いられる。ただし、簡単のため、この実施例では、コンプレッサは省略してある。 FIG. 2 is a block diagram showing an electrical configuration of the android 12, the control device 14, and the environment sensor 16. As shown in FIG. Referring to FIG. 2, the android 12 includes a plurality of actuators (for example, air actuators) 40, and each actuator 40 is connected to a control board (actuator control board) 98 of the control device 14 (14b). The actuator 40 is provided for presenting the body movement of the android 12. Some actuators 40 are used to perform facial movements mainly on the lips of Android 12. For example, actuators 40 for moving the upper jaw, the lower jaw, the upper lip, the lower lip, and the left and right side surfaces (cheeks) of the mouth are provided. The other actuators 40 are used to move other parts or portions of the android 12 such as the joints, eyes, and abdomen. However, for the sake of simplicity, the compressor is omitted in this embodiment.

また、アンドロイド１２は、触覚センサ４２、眼カメラ４４、衝突センサ４６、スピーカ４８およびマイク５０を含む。この触覚センサ４２、眼カメラ４４および衝突センサ４６は、制御装置１４（１４ｂ）のセンサ入出力ボード１００に接続される。 The android 12 includes a tactile sensor 42, an eye camera 44, a collision sensor 46, a speaker 48, and a microphone 50. The tactile sensor 42, the eye camera 44, and the collision sensor 46 are connected to the sensor input / output board 100 of the control device 14 (14b).

触覚センサ４２ないし皮膚センサは、たとえばタッチセンサであり、アンドロイド１２の触覚の一部を構成する。つまり、触覚センサ４２は、人間や他の物体等がアンドロイド１２に触れたか否かを検出するために用いられる。触覚センサ４２からの出力（検出データ）は、センサ入出力ボード１００を介してＭＰＵ９０に与えられる。ＭＰＵ９０は、触覚センサ４２からの検出データを第１コンピュータ１４ａのＣＰＵ７０に送信する。したがって、ＣＰＵ７０は、人間や他の物体等がアンドロイド１２に触れたことを検出することができる。ただし、触覚センサ４２としては、圧力センサを用いることもできる。かかる場合には、人間や他の物体等がアンドロイド１２の肌に触れたか否かのみならず、その触れ方（強弱）を知ることができる。 The tactile sensor 42 or the skin sensor is a touch sensor, for example, and constitutes a part of the tactile sense of the Android 12. That is, the tactile sensor 42 is used to detect whether or not a human or another object touches the Android 12. Output (detection data) from the touch sensor 42 is given to the MPU 90 via the sensor input / output board 100. The MPU 90 transmits detection data from the touch sensor 42 to the CPU 70 of the first computer 14a. Therefore, the CPU 70 can detect that a human or another object touches the Android 12. However, a pressure sensor can also be used as the tactile sensor 42. In such a case, it is possible to know not only whether or not a human or other object has touched the skin of Android 12, but also how to touch it (strength).

眼カメラ４４は、イメージセンサであり、アンドロイド１２の視覚の一部を構成する。つまり、眼カメラ４４は、アンドロイド１２の眼から見た映像ないし画像を検出するために用いられる。この実施例では、眼カメラ４４の撮影映像（動画ないし静止画）に対応するデータ（画像データ）は、センサ入出力ボード１００を介してＭＰＵ９０に与えられる。ＭＰＵ９０は、画像データを第１コンピュータ１４ａのＣＰＵ７０に送信し、ＣＰＵ７０は、撮影映像の変化を検出するのみならず、その画像データを、ネットワーク１８を介して遠隔操作端末２０に送信する。そして、遠隔操作端末２０は、受信した画像データをモニタ２６に出力する。したがって、眼カメラ４４の撮影映像がモニタ２６に表示される。 The eye camera 44 is an image sensor and constitutes part of the vision of the Android 12. That is, the eye camera 44 is used to detect a video or an image viewed from the eyes of the Android 12. In this embodiment, data (image data) corresponding to a captured video (moving image or still image) of the eye camera 44 is given to the MPU 90 via the sensor input / output board 100. The MPU 90 transmits the image data to the CPU 70 of the first computer 14 a, and the CPU 70 not only detects the change in the captured video but also transmits the image data to the remote operation terminal 20 via the network 18. Then, the remote control terminal 20 outputs the received image data to the monitor 26. Therefore, the captured image of the eye camera 44 is displayed on the monitor 26.

衝突センサ４６は、人間や他の物体等がアンドロイド１２に衝突したか否かを判断する。衝突センサ４６の出力（検出データ）は、センサ入出力ボード１００を介してＭＰＵ９０に与えられる。ＭＰＵ９０は、衝突センサ４６からの検出データを第１コンピュータ１４ａのＣＰＵ７０に送信する。したがって、ＣＰＵ７０は、人間や他の物体等がアンドロイド１２に衝突したことを検出することができる。 The collision sensor 46 determines whether a human or another object has collided with the android 12. The output (detection data) of the collision sensor 46 is given to the MPU 90 via the sensor input / output board 100. The MPU 90 transmits detection data from the collision sensor 46 to the CPU 70 of the first computer 14a. Therefore, the CPU 70 can detect that a human or another object has collided with the android 12.

また、スピーカ４８およびマイク５０は、制御装置１４（１４ａ）の音声入出力ボード８０に接続される。スピーカ４８は、アンドロイド１２が発話を行う際に音声を出力する。遠隔操作端末２０の操作者ないしオペレータ（以下、「遠隔オペレータ」という。）が直接発話を行う場合、当該音声が出力される。具体的には、遠隔オペレータがマイク２４を通して発話すると、対応する音声データが遠隔操作端末２０からネットワーク１８を介して制御装置１４ａ（ＣＰＵ７０）に与えられる。そして、ＣＰＵ７０は、その音声データを、音声入出力ボード８０を介してスピーカ４８から出力する。なお、予めプログラミングされた所定の動作を行う場合には、スピーカ４８からは合成音声が出力される。 The speaker 48 and the microphone 50 are connected to the audio input / output board 80 of the control device 14 (14a). The speaker 48 outputs sound when the Android 12 speaks. When the operator or operator (hereinafter referred to as “remote operator”) of the remote operation terminal 20 directly speaks, the voice is output. Specifically, when the remote operator speaks through the microphone 24, the corresponding voice data is given from the remote operation terminal 20 to the control device 14 a (CPU 70) via the network 18. Then, the CPU 70 outputs the audio data from the speaker 48 via the audio input / output board 80. Note that when a predetermined operation programmed in advance is performed, synthesized speech is output from the speaker 48.

マイク５０は、音センサであり、アンドロイド１２の聴覚の一部を構成する。このマイク５０は、指向性を有し、主として、アンドロイド１２と対話（コミュニケーション）する人間（ユーザ）の音声を検出するために用いられる。 The microphone 50 is a sound sensor and forms part of the hearing of the Android 12. The microphone 50 has directivity and is mainly used to detect the voice of a human (user) who interacts (communicates) with the Android 12.

制御装置１４は、第１コンピュータ１４ａおよび第２コンピュータ１４ｂによって構成される。たとえば、第１コンピュータ１４ａがメインのコンピュータであり、第２コンピュータ１４ｂがサブのコンピュータである。なお、この実施例では、制御装置１４を２台のコンピュータ（１４ａ，１４ｂ）で構成するようにしてあるが、処理能力が高ければ、１台のコンピュータで構成することもできる。 The control device 14 includes a first computer 14a and a second computer 14b. For example, the first computer 14a is a main computer, and the second computer 14b is a sub computer. In this embodiment, the control device 14 is constituted by two computers (14a, 14b). However, if the processing capability is high, it can also be constituted by one computer.

第１コンピュータ１４ａは、ＣＰＵ７０を含み、ＣＰＵ７０には内部バス７２を介してメモリ７４、通信ボード７６、ＬＡＮボード７８、音声入出力ボード８０およびセンサ入出力ボード８２が接続される。メモリ７４は、たとえば、ハードディスク装置（ＨＤＤ）のような主記憶装置、ＲＯＭおよびＲＡＭを含む。詳細な説明は省略するが、このメモリ７４には、制御装置１４の全体の動作を制御するためのプログラムおよびデータが記憶されており、特にたとえば、アンドロイド１２の動作についてのコマンド名に対応して、そのコマンド名が示す動作を実行するための制御情報が記憶されている。 The first computer 14 a includes a CPU 70, and a memory 74, a communication board 76, a LAN board 78, a voice input / output board 80 and a sensor input / output board 82 are connected to the CPU 70 via an internal bus 72. The memory 74 includes, for example, a main storage device such as a hard disk device (HDD), ROM, and RAM. Although detailed description is omitted, the memory 74 stores a program and data for controlling the entire operation of the control device 14, and particularly corresponds to a command name for the operation of the Android 12, for example. Control information for executing the operation indicated by the command name is stored.

ここで、動作とは、振り、振る舞いのような身体動作および発話動作をいう。したがって、この実施例では、制御情報は、アクチュエータ４０を駆動制御するための制御データのみならず、必要に応じて、発話内容についての合成音声の音声データを含む。ただし、身体動作には、自然の動作（無意識動作）も含まれる。無意識動作の代表的な例としては、瞬きや呼吸が該当する。また、このような生理的な動作のみならず、人間の癖による動作も無意識動作に含まれる。たとえば、癖による動作としては、たとえば、髪の毛を触る動作、顔を触る動作や爪を噛む動作などが該当する。 Here, the movement refers to a body movement such as swinging and behavior and a speech movement. Therefore, in this embodiment, the control information includes not only the control data for driving and controlling the actuator 40 but also the voice data of the synthesized voice regarding the utterance contents as necessary. However, the body movement includes a natural movement (unconscious movement). Representative examples of unconscious movements include blinking and breathing. In addition to such physiological movements, movements caused by human heels are also included in the unconscious movements. For example, the operation with a heel corresponds to, for example, an operation of touching hair, an operation of touching a face, or an operation of biting a nail.

このような動作は、アンドロイド１２が外部（環境）からの刺激に対応して実行されたり、遠隔オペレータからの命令（遠隔操作命令）に従って実行されたりする。 Such an operation is executed by the Android 12 in response to a stimulus from the outside (environment) or according to a command (remote operation command) from a remote operator.

図２に戻って、通信ボード７６は、他のコンピュータ（この実施例では、第２コンピュータ１４ｂ）とデータ通信するためのインターフェイスである。たとえば、通信ボード７６は、後述する第２コンピュータ１４ｂの通信ボード９６と、ＲＳ２３２Ｃのようなケーブル（図示せず）を用いて接続される。ＬＡＮボード７８は、ネットワーク１８を介して他のコンピュータ（この実施例では、遠隔操作端末２０）とデータ通信するためのインターフェイスである。この実施例では、ＬＡＮボード７８は、ＬＡＮケーブル（図示せず）を用いて接続される。 Returning to FIG. 2, the communication board 76 is an interface for data communication with another computer (in this embodiment, the second computer 14b). For example, the communication board 76 is connected to the communication board 96 of the second computer 14b described later using a cable (not shown) such as RS232C. The LAN board 78 is an interface for data communication with another computer (in this embodiment, the remote operation terminal 20) via the network 18. In this embodiment, the LAN board 78 is connected using a LAN cable (not shown).

なお、この実施例では、各コンピュータがケーブルを用いた有線のネットワークを構成するように説明してあるが、これに限定される必要はなく、無線のネットワークを構成するようにしてもよく、有線と無線とが混在するネットワークを構成するようにしてもよい。 In this embodiment, each computer is described as configuring a wired network using a cable. However, the present invention is not limited to this, and a wireless network may be configured. And a wireless network may be configured.

音声入出力ボード８０は、音声を入力および出力するためのインターフェイスであり、上述したように、アンドロイド１２のスピーカ４８およびマイク５０が接続される。この音声入出力ボード８０は、ＣＰＵ７０によって与えられた音声データを音声信号に変換して、スピーカ４８に出力する。また、音声入出力ボード８０は、マイク５０を通して入力された音声信号を音声データに変換して、ＣＰＵ７０に与える。 The voice input / output board 80 is an interface for inputting and outputting voice, and the speaker 48 and the microphone 50 of the Android 12 are connected as described above. The voice input / output board 80 converts the voice data given by the CPU 70 into a voice signal and outputs it to the speaker 48. Further, the voice input / output board 80 converts a voice signal input through the microphone 50 into voice data, and gives the voice data to the CPU 70.

なお、詳細な説明は省略するが、制御装置１４ａは、音声認識機能を備える。したがって、たとえば、人間がアンドロイド１２に対して発話した内容はマイク５０を通して入力されると、ＣＰＵ７０は、ＤＰマッチングや隠れマルコフ法により、人間の発話内容を音声認識するのである。ただし、音声認識用の辞書データはメモリ７４に記憶されているものとする。 Although a detailed description is omitted, the control device 14a has a voice recognition function. Therefore, for example, when the content uttered by the human being to the android 12 is input through the microphone 50, the CPU 70 recognizes the speech content of the human speech by DP matching or the hidden Markov method. However, it is assumed that dictionary data for speech recognition is stored in the memory 74.

センサ入出力ボード８２は、各種センサからの出力をＣＰＵ７０に与え、制御データを各種センサに出力するためのインターフェイスである。この実施例では、センサ入出力ボード８２には、環境センサ１６が接続され、環境センサ１６は、全方位カメラ６０、ＰＴＺカメラ６２およびフロアセンサ６４を含む。 The sensor input / output board 82 is an interface for giving outputs from various sensors to the CPU 70 and outputting control data to the various sensors. In this embodiment, the environmental sensor 16 is connected to the sensor input / output board 82, and the environmental sensor 16 includes an omnidirectional camera 60, a PTZ camera 62, and a floor sensor 64.

全方位カメラ６０は、アンドロイド１２が配置される部屋ないし場所（区画）に設置され、当該部屋ないし場所の全体（３６０度）を撮影することができる。全方位カメラ６０の撮影映像に対応する画像データは、ＣＰＵ７０に与えられる。ＣＰＵ７０は、画像データに基づいて撮影映像の変化を検出するのみならず、その画像データを遠隔操作端末２０に送信する。遠隔操作端末２０は、受信した画像データをモニタ２６に出力する。したがって、全方位カメラ６０の撮影映像がモニタ２６に表示される。 The omnidirectional camera 60 is installed in a room or place (section) where the android 12 is arranged, and can photograph the entire room or place (360 degrees). Image data corresponding to the captured image of the omnidirectional camera 60 is given to the CPU 70. The CPU 70 not only detects a change in the captured video based on the image data, but also transmits the image data to the remote operation terminal 20. The remote operation terminal 20 outputs the received image data to the monitor 26. Therefore, the captured image of the omnidirectional camera 60 is displayed on the monitor 26.

ＰＴＺカメラ６２は、オペレータの指示に従って、パン（Ｐａｎ）、チルト（Ｔｉｌｔ）およびズーム（Ｚｏｏｍ）の各々を制御（調整）することができるカメラである。たとえば、遠隔オペレータが、パン、チルト、ズームの指示を入力すると、対応する制御信号（コマンド）が遠隔操作端末２０からネットワーク１８を介して第１コンピュータ１４ａに与えられる。すると、第１コンピュータ１４ａのＣＰＵ７０は、そのコマンドに従ってＰＴＺカメラ６２を駆動制御する。ＰＴＺカメラ６２の撮影映像に対応する画像データもまた、ＣＰＵ７０に与えられる。ＣＰＵ７０は、画像データに基づいて撮影映像の変化を検出するのみならず、その画像データを遠隔操作端末２０に送信する。遠隔操作端末２０は、受信した画像データをモニタ２６に出力する。したがって、ＰＴＺカメラ６２の撮影映像もモニタ２６に表示される。 The PTZ camera 62 is a camera that can control (adjust) each of pan, tilt, and zoom according to an operator's instruction. For example, when the remote operator inputs pan, tilt, and zoom instructions, corresponding control signals (commands) are given from the remote operation terminal 20 to the first computer 14 a via the network 18. Then, the CPU 70 of the first computer 14a drives and controls the PTZ camera 62 according to the command. Image data corresponding to the video shot by the PTZ camera 62 is also given to the CPU 70. The CPU 70 not only detects a change in the captured video based on the image data, but also transmits the image data to the remote operation terminal 20. The remote operation terminal 20 outputs the received image data to the monitor 26. Therefore, the captured image of the PTZ camera 62 is also displayed on the monitor 26.

なお、この実施例では、眼カメラ４４，全方位カメラ６０およびＰＴＺカメラ６２の撮影画像が、遠隔操作端末２０に接続されるモニタ２６に画面を分割されて表示される。したがって、遠隔オペレータはモニタ２６を見て、アンドロイド１２の視線の映像やアンドロイド１２の周囲の状況を知ることができる。 In this embodiment, the captured images of the eye camera 44, the omnidirectional camera 60, and the PTZ camera 62 are displayed on the monitor 26 connected to the remote operation terminal 20 with the screen divided. Therefore, the remote operator can see the image of the line of sight of the Android 12 and the situation around the Android 12 by looking at the monitor 26.

フロアセンサ６４ないし床圧力センサは、図示は省略するが、多数の検出素子（感圧センサ）を含み、この多数の検出素子はアンドロイド１２が配置される部屋ないし場所の床に埋め込まれる（敷き詰められる）。したがって、フロアセンサ６４からの出力に基づいて、アンドロイド１２の周囲に人間が存在するか否か、存在する人間の人数、アンドロイド１２から見た人間の方向、アンドロイド１２と人間との距離などを知ることができる。 Although not shown, the floor sensor 64 or the floor pressure sensor includes a large number of detection elements (pressure-sensitive sensors), and the large number of detection elements are embedded (laid down) in the floor of the room or place where the android 12 is disposed. ). Therefore, based on the output from the floor sensor 64, whether or not there is a human around the android 12, the number of existing humans, the direction of the human viewed from the android 12, the distance between the android 12 and the human, and the like are known. be able to.

第２コンピュータ１４ｂは、ＭＰＵ９０を含み、ＭＰＵ９０には、内部バス９２を介して、メモリ９４、通信ボード９６、制御ボード９８およびセンサ入出力ボード１００が接続される。 The second computer 14 b includes an MPU 90, and a memory 94, a communication board 96, a control board 98, and a sensor input / output board 100 are connected to the MPU 90 via an internal bus 92.

メモリ９４は、ＨＤＤ、ＲＯＭおよびＲＡＭを含み、メモリ９４には、制御装置１４ｂの動作を制御するためのプログラムおよびデータが記憶されている。通信ボード９６は、他のコンピュータ（この実施例では、第１コンピュータ１４ａ）とデータ通信するためのインターフェイスである。制御ボード９８は、制御対象としての複数のアクチュエータ４０を制御するための制御データを出力するとともに、各アクチュエータ４０からの角度情報（回転角度，曲げ角度）を入力するためのインターフェイスである。したがって、ＭＰＵ９０は複数のアクチュエータ４０をフィードバック制御することができる。ただし、ＭＰＵ９０は、第１コンピュータ１４ａのＣＰＵ７０からの動作指令（制御データ）に従って各アクチュエータ４０を駆動制御する。 The memory 94 includes an HDD, a ROM, and a RAM, and the memory 94 stores a program and data for controlling the operation of the control device 14b. The communication board 96 is an interface for data communication with another computer (in this embodiment, the first computer 14a). The control board 98 is an interface for outputting control data for controlling a plurality of actuators 40 as control objects and inputting angle information (rotation angle, bending angle) from each actuator 40. Therefore, the MPU 90 can feedback control the plurality of actuators 40. However, the MPU 90 drives and controls each actuator 40 in accordance with an operation command (control data) from the CPU 70 of the first computer 14a.

センサ入出力ボード１００は、各種センサからの出力をＭＰＵ９０に与え、ＭＰＵ９０からの制御データを各種センサに出力するためのインターフェイスである。このセンサ入出力ボード１００には、上述したように、触覚センサ４２、眼カメラ４４および衝突センサ４６が接続される。 The sensor input / output board 100 is an interface for giving outputs from various sensors to the MPU 90 and outputting control data from the MPU 90 to various sensors. As described above, the tactile sensor 42, the eye camera 44, and the collision sensor 46 are connected to the sensor input / output board 100.

なお、この実施例では、アンドロイド１２とは別に制御装置１４を設けるようにしてあるが、制御装置１４を含めてアンドロイドと呼んでもよい。さらには、環境センサ１６も含めてアンドロイドと呼んでもよい。 In this embodiment, the control device 14 is provided separately from the android 12, but the control device 14 may be referred to as an android. Furthermore, it may be called an android including the environment sensor 16.

また、アンドロイド１２の遠隔地にオペレータが存在することを想定して制御装置１４と遠隔操作端末２０とをネットワーク１８を介して接続しているが、オペレータがアンドロイド１２の存在する場所やその近傍に存在するような場合には、遠隔操作端末２０をネットワーク１８を介さずに制御装置１４に直接接続することも可能である。 In addition, the control device 14 and the remote operation terminal 20 are connected via the network 18 on the assumption that an operator is present in a remote place of the Android 12, but the operator is located at or near the place where the Android 12 exists. In such a case, the remote control terminal 20 can be directly connected to the control device 14 without going through the network 18.

アンドロイド１２は、或る会社や或るイベント会場などに配置され、人間の代役として働くことができる。たとえば、アンドロイド１２は、会社やイベント会場の受付や案内役として機能し、訪問者に応対する。アンドロイド１２は、アンドロイド１２や環境センサ１６のセンサ群（４２、４４、４６、５０、６０、６２、６４）によって検出される外部刺激に応じて所定の対応動作を実行する。状況に応じてアンドロイド１２が実行すべき（実行可能な）動作、すなわちＣＰＵ７０が指示するべき制御情報は予め決定されており、その内容はメモリ７４に記憶されている。 The Android 12 is arranged in a certain company, a certain event venue, or the like, and can work as a substitute for a human being. For example, the Android 12 functions as a receptionist or guide for a company or event venue, and responds to visitors. The android 12 performs a predetermined corresponding operation according to the external stimulus detected by the sensor group (42, 44, 46, 50, 60, 62, 64) of the android 12 and the environment sensor 16. The operation to be executed (executable) by the android 12 according to the situation, that is, the control information to be instructed by the CPU 70 is determined in advance, and the contents thereof are stored in the memory 74.

ただし、センサ群の反応に応じた対応動作がメモリ７４に記憶されていない場合や外部からの刺激を認識できない場合などのように、アンドロイド１２が刺激に対して対応動作を実行することができない場合には、アンドロイド１２は遠隔オペレータを呼び出す。つまり、制御装置１４（ＣＰＵ７０）は、アンドロイド１２が対応できない旨の情報を遠隔操作端末２０に通知する。たとえば、アンドロイド１２が人間の質問（発話内容）を理解（音声認識）できない場合には、呼び出された遠隔オペレータが人間の質問を理解し、遠隔操作によって、アンドロイド１２の動作を制御する。 However, when the corresponding action according to the sensor group response is not stored in the memory 74 or when the android 12 cannot execute the corresponding action for the stimulus, such as when the stimulus from the outside cannot be recognized. The Android 12 calls a remote operator. That is, the control device 14 (CPU 70) notifies the remote operation terminal 20 of information that the Android 12 cannot be handled. For example, when the Android 12 cannot understand (speech recognition) a human question (utterance content), the called remote operator understands the human question and controls the operation of the Android 12 by remote operation.

また、上述したように、眼カメラ４４、全方位カメラ６０およびＰＴＺカメラ６２の撮影映像が遠隔操作端末２０のモニタ２６に表示されるため、遠隔オペレータは、その撮影映像等によりアンドロイド１２が存在する近傍、周囲およびアンドロイド１２と対話する人間の様子を知ることができる。したがって、遠隔オペレータは、アンドロイド１２（制御装置１４）からの呼び出しが無くても、必要に応じて遠隔操作し、アンドロイド１２に命令動作を実行させることもできる。 Further, as described above, since the captured images of the eye camera 44, the omnidirectional camera 60, and the PTZ camera 62 are displayed on the monitor 26 of the remote operation terminal 20, the remote operator has the Android 12 based on the captured images. It is possible to know the neighborhood, surroundings, and the appearance of a human being interacting with the Android 12. Therefore, even if there is no call from the android 12 (control device 14), the remote operator can remotely control the android 12 to execute a command operation as necessary.

遠隔オペレータは、所定の合成音声の出力を指示する代わりに、マイク２４に発話することによって、自身の発話音声をアンドロイド１２のスピーカ４８から出力して、人間と直接対話することができる。つまり、遠隔オペレータが発話すると、マイク２４で検出された音声入力に対応する音声データが取得され、当該音声データが遠隔操作端末２０からネットワーク１８を介して制御装置１４に与えられ、アンドロイド１２のスピーカ４８から出力される。 Instead of instructing the output of the predetermined synthesized voice, the remote operator can speak directly to the person by outputting his / her voice from the speaker 48 of the Android 12 by speaking to the microphone 24. That is, when the remote operator speaks, voice data corresponding to the voice input detected by the microphone 24 is acquired, and the voice data is given from the remote operation terminal 20 to the control device 14 via the network 18, and the speaker of the Android 12. 48.

なお、アンドロイド１２と対話する人間が発話したときには、その音声はアンドロイド１２のマイク５０を通して入力され、対応する音声データが制御装置１４からネットワーク１８を介して遠隔操作端末２０に送信され、スピーカ２２から出力される。 When a person interacting with the Android 12 speaks, the voice is input through the microphone 50 of the Android 12, and the corresponding voice data is transmitted from the control device 14 to the remote operation terminal 20 via the network 18 and from the speaker 22. Is output.

アンドロイド１２は、人間に酷似した姿形を有して人間の動作に酷似した動作を行うロボットであるから、遠隔オペレータの発話音声を出力する際に、たとえば口を動かさなかったり単に音声に関係なく口を動かしたりするだけでは人間に強い違和感を与えてしまう。したがって、このアンドロイド１２では、出力される遠隔オペレータの発話音声に合わせてその口唇を主とした顔面を動作させる。 Since the Android 12 is a robot having a shape very similar to that of a human and performing an operation very similar to that of a human, when outputting the speech voice of a remote operator, for example, the mouth is not moved or it is not related to the voice. Just moving your mouth gives a strong sense of incongruity to humans. Therefore, in the Android 12, the face mainly composed of the lips is operated according to the uttered voice of the remote operator that is output.

このシステム１０の動作を図３および図４に示すフロー図を参照しながら説明する。図３には、制御装置１４の発話処理の動作の一例が示される。制御装置１４ａのＣＰＵ７０は、この発話処理を一定時間ごとに繰り返し実行する。 The operation of the system 10 will be described with reference to the flowcharts shown in FIGS. FIG. 3 shows an example of the speech processing operation of the control device 14. The CPU 70 of the control device 14a repeatedly executes this utterance process at regular intervals.

図３のステップＳ１では、音声データを受信したか否かを判断する。遠隔オペレータが発話したとき、遠隔操作端末２０からマイク２４で取得された発話音声の音声データが送信されてくるので、この音声データをネットワーク１８を介して受信したか否かが判断される。なお、遠隔操作端末２０は、発話音声を所定のサンプリングレート（たとえば８ｋＨｚ）で音声データとして取得し、取得した音声データを所定のパケット長（たとえば２０ｍｓ）で一定時間ごとに送信する。 In step S1 of FIG. 3, it is determined whether audio data has been received. When the remote operator speaks, the voice data of the utterance voice acquired by the microphone 24 is transmitted from the remote operation terminal 20, so it is determined whether or not this voice data has been received via the network 18. The remote operation terminal 20 acquires the uttered voice as voice data at a predetermined sampling rate (for example, 8 kHz), and transmits the acquired voice data at a predetermined packet length (for example, 20 ms) at regular intervals.

ステップＳ１で“ＹＥＳ”であれば、ステップＳ３で、音声記憶処理を開始する。音声記憶処理はＣＰＵ７０によって他の処理と並列的に実行される。この音声記憶処理によって、受信される音声データが順次メモリ７４に記憶される。音声記憶処理は、発話音声が検出されなくなって音声データが受信されなくなったときに終了される。 If “YES” in the step S1, the voice storing process is started in a step S3. The voice storage process is executed by the CPU 70 in parallel with other processes. By this voice storage process, the received voice data is sequentially stored in the memory 74. The voice storage process is terminated when speech data is no longer detected and voice data is no longer received.

続いて、ステップＳ５で、口唇動作制御処理を開始する。口唇動作制御処理はＣＰＵ７０によって他の処理と並列的に実行される。この口唇動作制御処理では、取得された発話音声の解析が行われて、当該音声に基づいて口唇動作が制御される。口唇動作制御処理の動作の一例は後述する図４に示される。 Subsequently, in step S5, the lip movement control process is started. The lip movement control process is executed by the CPU 70 in parallel with other processes. In this lip movement control process, the acquired utterance voice is analyzed, and the lip movement is controlled based on the voice. An example of the operation of the lip movement control process is shown in FIG.

ステップＳ７では、音声取得から一定時間経過したか否かを判断する。この実施例では、取得した発話音声を一定量の遅延のもとに再生するようにしているので、この判定によって、音声データの取得（受信）から一定時間の経過を待つ。 In step S7, it is determined whether or not a predetermined time has elapsed since the voice acquisition. In this embodiment, since the acquired uttered voice is reproduced with a certain amount of delay, the determination waits for a certain period of time from the acquisition (reception) of the voice data.

ステップＳ７で“ＹＥＳ”であれば、ステップＳ９で、音声再生処理を開始する。音声再生処理はＣＰＵ７０によって他の処理と並列的に実行される。この音声再生処理では、取得された音声データがメモリ７４から読み出されて音声入出力ボード８０に与えられ、これによって、アンドロイド１２のスピーカ４８から当該音声が出力される。音声再生処理は、取得した音声データをすべて再生し終わったときに終了される。 If “YES” in the step S7, the sound reproduction process is started in a step S9. The audio reproduction process is executed by the CPU 70 in parallel with other processes. In this audio reproduction process, the acquired audio data is read from the memory 74 and applied to the audio input / output board 80, whereby the audio is output from the speaker 48 of the Android 12. The audio reproduction process is terminated when all the acquired audio data has been reproduced.

なお、ステップＳ１で“ＮＯ”の場合、つまり、発話が行われていないときには、そのまま図３の発話処理を終了する。 If “NO” in the step S1, that is, if the utterance is not performed, the utterance process of FIG. 3 is ended as it is.

ステップＳ５で開始される口唇動作制御処理の動作の一例を図４を参照して説明する。まず、ステップＳ２１で、音響特徴の変動量を抽出する。 An example of the operation of the lip motion control process started in step S5 will be described with reference to FIG. First, in step S21, the fluctuation amount of the acoustic feature is extracted.

アンドロイド１２のような物体の場合、画像のようにフレームごとに口唇形状を制御することは困難である。従って、まず、遠隔オペレータの音声の周波数やケプストラムの解析を行い、音響特徴の変動が高い位置を検出する。音響特徴の変動量は、たとえば、ある時刻における前後所定時間（たとえば２０ｍｓ程度）のフレームのパラメータ（たとえばＭＦＣＣ）の平均二乗誤差として算出される。なお、音声信号から取得する音響特徴としては、ＬＰＣ−Ｃｅｐｓｔｒａｌ係数、ＭＦＣＣ係数、ＬＳＰ係数、フォルマント周波数、およびＦ０（基本周波数）、ＲＭＳ（Root Mean Square）などが挙げられる。フォルマント周波数と口唇形状には、母音の場合、直感的な関係があるが、精度のよい自動検出は難しいので、この実施例では、音声認識で多く使用されるＭＦＣＣ係数を用いる。 In the case of an object such as Android 12, it is difficult to control the lip shape for each frame like an image. Accordingly, first, the frequency and cepstrum of the remote operator's voice are analyzed to detect a position where the acoustic feature variation is high. The variation amount of the acoustic feature is calculated as, for example, a mean square error of a frame parameter (for example, MFCC) for a predetermined time before and after a certain time (for example, about 20 ms). Note that acoustic features acquired from the audio signal include LPC-Cepstral coefficient, MFCC coefficient, LSP coefficient, formant frequency, F0 (fundamental frequency), RMS (Root Mean Square), and the like. In the case of vowels, the formant frequency and the lip shape have an intuitive relationship, but accurate automatic detection is difficult. In this embodiment, MFCC coefficients that are often used in speech recognition are used.

次に、ステップＳ２３で、この変動量（ＭＦＣＣ平均二乗誤差など）が閾値を超えたか否かを判断する。実験によって、音素の変化を表す程度に、この変動量に閾値を設定しておく。閾値を超えた変動量のピーク位置がアンドロイド１２の動作指令発行時点を決める際の基礎となる。 Next, in step S23, it is determined whether or not the fluctuation amount (MFCC mean square error or the like) exceeds a threshold value. By experiment, a threshold value is set for this variation so as to represent a change in phonemes. The peak position of the fluctuation amount exceeding the threshold is the basis for determining the operation command issuance time point of the Android 12.

ステップＳ２３で“ＮＯ”の場合、処理はステップＳ２１へ戻り、次の時刻を基点とする音声データについて処理を繰り返す。 If “NO” in the step S23, the process returns to the step S21, and the process is repeated for the audio data having the next time as a base point.

一方、ステップＳ２３で“ＹＥＳ”の場合、ステップＳ２５で、音響特徴の高い変動量が検出された時点の前後所定時間（たとえば１００ｍｓ程度）の音声から音響特徴（たとえばＭＦＣＣ）を抽出し、ステップＳ２７で、非線形モデルを用いて口唇形状の推定を行う。推定の手法として、線形回帰分析、ニューラル・ネットワーク、ＨＭＭ、ＫＮＮなどが挙げられる。また、一度音素情報を認識して、その音素における口唇形状をマッピングする手法があるが、音素認識の精度はあまり高くないので、音響特徴から口唇形状の直接マッピングの方が効率がよいと考えられる。また、音響特徴と口唇形状の間には非線形な関係があるので、ニューラル・ネットワークのような非線形なモデルを用いる。なお、そのためには、予め収録したビデオデータまたはモーションキャプチャによる口唇形状のデータベースによってモデル学習を行っておき、メモリ７４にモデル学習による非線形マッピングのための情報を記憶しておく。 On the other hand, if “YES” in the step S23, an acoustic feature (for example, MFCC) is extracted from a voice for a predetermined time (for example, about 100 ms) before and after the time when a high variation amount of the acoustic feature is detected in a step S25, and the step S27. Then, the lip shape is estimated using a non-linear model. Examples of the estimation method include linear regression analysis, neural network, HMM, and KNN. In addition, there is a method of recognizing phoneme information once and mapping the lip shape in the phoneme, but since the accuracy of phoneme recognition is not so high, direct mapping of the lip shape is considered to be more efficient from the acoustic features . In addition, since there is a non-linear relationship between acoustic features and lip shape, a non-linear model such as a neural network is used. For this purpose, model learning is performed using prerecorded video data or a lip shape database by motion capture, and information for nonlinear mapping by model learning is stored in the memory 74.

続いて、ステップＳ２９で、推定された口唇形状を形成するための制御情報を設定し、ステップＳ３１で動作遅延を推定する。具体的には、アンドロイド１２のアクチュエータ４０の制御情報に関しては、アクチュエータ制御の静的特徴と動的特徴を考慮する。つまり、静的特徴としては、特定の口唇形状に近づけるためのアンドロイド１２のアクチュエータ４０の制御情報を予め手動的に取得しておき、口唇形状と制御情報とを対応付けたデータベースをメモリ７４に記憶しておく。また、動的特徴としては、特定の形状をターゲットとして口唇を動かした際に、指令を発行した時点からアンドロイド１２が実際にターゲットの形状に辿りつくまでにかかる時間（これを動作遅延と呼ぶ。）を実験により取得しておき、制御情報（口唇形状）と動作遅延とを対応付けたデータベースをメモリ７４に記憶しておく。後述のステップＳ３７では、この動作遅延の情報を基に、音声と同期を取るために、動作指令を送る時点が早められたり遅くされたりする。 Subsequently, control information for forming the estimated lip shape is set in step S29, and an operation delay is estimated in step S31. Specifically, regarding the control information of the actuator 40 of the Android 12, the static characteristics and dynamic characteristics of the actuator control are considered. That is, as a static feature, control information of the actuator 40 of the Android 12 for approaching a specific lip shape is manually acquired in advance, and a database in which the lip shape is associated with the control information is stored in the memory 74. Keep it. As dynamic features, when a lip is moved with a specific shape as a target, it takes time from when the command is issued until the Android 12 actually reaches the shape of the target (this is called an operation delay). ) Is obtained by experiment, and a database in which control information (lip shape) is associated with an operation delay is stored in the memory 74. In step S37, which will be described later, in order to synchronize with the voice based on the information of the operation delay, the time point at which the operation command is sent is advanced or delayed.

なお、この動作遅延の情報を基にして、音声再生開始までの遅延時間が決められる。つまり、上述のように、この実施例では、音声は常に一定遅延で再生されるようにするので、この再生遅延時間を、最大の動作遅延の値よりも大きい値に設定しておく。 Note that the delay time until the start of audio reproduction is determined based on the information on the operation delay. That is, as described above, in this embodiment, since the sound is always reproduced with a constant delay, the reproduction delay time is set to a value larger than the maximum operation delay value.

ステップＳ３３では、所定時間の推定を行ったか否かを判断する。この実施例では、音響特徴を抽出した範囲よりも広い範囲、たとえば複数の音素や単語単位で、口唇動作の再構成をすることを想定しているので、このステップＳ３３の判定を行う。ステップＳ３３で“ＮＯ”の場合、ステップＳ２１に戻って処理を繰り返す。 In step S33, it is determined whether or not a predetermined time has been estimated. In this embodiment, since it is assumed that the lip movement is reconstructed in a range wider than the range from which the acoustic features are extracted, for example, a plurality of phonemes or words, the determination in step S33 is performed. If “NO” in the step S33, the process returns to the step S21 to repeat the process.

ステップＳ３３で“ＹＥＳ”であれば、ステップＳ３５で、区間を通じた口唇動作の最適化処理を行う。つまり、比較的短い期間の音声に関して、ステップＳ２１やステップＳ２５の処理を行い、これらの音声を束ねたより長い区間を通じて動作の最適化を試みる。推定された口唇形状は完全にはアンドロイド１２では再現できない場合もあるため、推定された口唇形状の時系列を元に、この口唇動作の変換を行う。たとえば、以下のような変換が考えられる。
（ａ）動作の簡略化：一部の部位のみを動かす。動作に応じて使用される部位が異なり、また、音や動作によって特に強調されるべき部位が異なるので、省略可能な部位の動作は省略する。たとえば、音や動作に応じて、顎の関節、唇の前後方向の動き、または口の側面の動きなど、最も特徴的な動きのみになるように変換する。
（ｂ）重要音素のみでの提示：イントネーションなどを考慮し、音素系列のうち強調され目立つ動作のみを行う。たとえば、韻律特徴（ピッチや強さなど）を用いて、ピッチが高く、パワーが強く、長めに発声された音節を強調されたものとみなす。
（ｃ）動作量の相対化：動きの存在だけを提示し、動きの量は追随させないようにする。たとえば、指定されたアクチュエータ４０の動きの大きさを元々の指定通りに設定するのではなく、より少ない動きで同等な効果が得られると判断される場合には、少しだけの動きに留めるように動作量を変更する。これによって、より自然な動作を見せることができ、素早く動作を行うことができる。また、動きを目立たせるため、時間的に連続する同じ部位の動作を抑制または増幅する。たとえば、動作の大きさを１−５段階で表す（５＝大きい）場合、単一のアクチュエータ４０について、３→５→３という動作系列が実行されるが「３」の部分はそれほど重要ではないとき、１→３→１や０→３→０などに置き換える。相対的な動きの度合いのみが、見る者には重要であることが多いため、このような置換を実行しても、同じように見せることが可能になる。このように、前後の動作も含めて一定期間の動作に抑制または増幅を施すことによって、動作を素早く行うことが可能になるとともに、場合によっては、より強調したい部分を見せるなど、より強い効果を見る者に与えることができる。
（ｄ）動作速度の変換：特定の音声が発せられる時点で、口唇形状はその最大変化を示している方がより自然に見えるため、素早くその目的形状を形成する。 If “YES” in the step S33, an optimizing process of the lip movement through the section is performed in a step S35. That is, with respect to the voice of a relatively short period, the processing of step S21 and step S25 is performed, and the optimization of the operation is attempted through a longer section in which these voices are bundled. Since the estimated lip shape may not be completely reproduced by the Android 12, the lip movement is converted based on the time series of the estimated lip shape. For example, the following conversion can be considered.
(A) Simplification of operation: Move only some parts. The parts to be used differ depending on the operation, and the parts to be particularly emphasized differ depending on the sound and the action, and therefore the operations of the parts that can be omitted are omitted. For example, conversion is performed so that only the most characteristic movements such as the chin joint, the movement of the lips in the front-rear direction, or the movement of the side of the mouth are performed according to the sound and movement.
(B) Presentation with only important phonemes: Considering intonation and the like, only the highlighted and conspicuous actions are performed. For example, using prosodic features (pitch, strength, etc.), syllables with a high pitch, strong power, and long utterances are considered emphasized.
(C) Relativity of motion amount: Only the presence of motion is presented and the amount of motion is not allowed to follow. For example, instead of setting the magnitude of the movement of the designated actuator 40 as originally designated, if it is determined that the same effect can be obtained with a smaller number of movements, the movement is limited to a little. Change the amount of movement. As a result, a more natural operation can be shown and the operation can be performed quickly. Further, in order to make the movement stand out, the operation of the same part that is continuous in time is suppressed or amplified. For example, when the magnitude of the motion is expressed in 1-5 steps (5 = large), the motion sequence of 3 → 5 → 3 is executed for the single actuator 40, but the part “3” is not so important. At this time, it is replaced with 1 → 3 → 1 or 0 → 3 → 0. Since only the relative degree of movement is often important to the viewer, even if such a substitution is performed, it can appear the same. In this way, by suppressing or amplifying the operation for a certain period including the previous and subsequent operations, the operation can be performed quickly, and in some cases, a stronger effect such as showing a part to be emphasized more can be obtained. Can be given to the viewer.
(D) Conversion of motion speed: When a specific voice is emitted, the lip shape looks more natural when it shows its maximum change, so it quickly forms its target shape.

なお、この最適化処理では、遅延なども考慮して、可能な限り簡素な動作に変換するようにするのが望ましいが、より自然に見せるためには多からず少なからぬ口唇動作が必要であり、実験によってパラメータを適宜に設定する。なお、発話された音声によっては最適化が行われない場合もあり得る。 In this optimization process, it is desirable to convert the operation to the simplest possible, taking into account delays, etc., but in order to make it appear more natural, not a few lip movements are necessary. The parameters are appropriately set by experiment. Note that optimization may not be performed depending on the spoken voice.

また、必要に応じて、ステップＳ３５では、最適化を行った後に、ステップＳ３１の各動作の動作遅延の推定をやり直して、最適化された動作に適切な遅延を取得するようにしてよい。特に動作の増幅が行われる場合、動作遅延の再推定は必要である。また、動作の抑制や速度変換が行われる場合には、動作遅延の再推定によって、特定形状への動作完了から当該音声の発話までをより素早くまたはより適切なタイミングで行うことができる。 If necessary, in step S35, after performing optimization, the operation delay of each operation in step S31 may be reestimated to obtain an appropriate delay for the optimized operation. Especially when the operation is amplified, it is necessary to re-estimate the operation delay. In addition, when motion suppression or speed conversion is performed, from the completion of the motion to the specific shape to the speech of the speech can be performed more quickly or at a more appropriate timing by re-estimating the motion delay.

なお、他の実施例では、ステップＳ３１の処理は、ステップＳ３５の後に実行するようにしてもよい。また、その他の実施例では、ステップＳ３３およびステップＳ３５の処理は省略されてもよい。つまり、推定された口唇形状をそのまま提示するようにしてもよい。 In another embodiment, the process of step S31 may be executed after step S35. In other embodiments, the processes of step S33 and step S35 may be omitted. That is, the estimated lip shape may be presented as it is.

続いて、ステップＳ３７で、動作遅延に基づいて、音声再生開始タイミングを基準として、各動作指令の発行タイミングを設定する。つまり、特定の口唇形状を形成するための動作指令の発行タイミングは、当該音声との同期をとるために、当該推定遅延に基づいて音声再生開始タイミングを基準として設定される。基本的には、動作指令の発行タイミングは、当該口唇形状に対応する音声の再生時点よりも早められる。発話を終えて口を閉じる動作の場合はこの限りではなく、遅くされる場合もあり得る。 Subsequently, in step S37, issuance timing of each operation command is set based on the operation delay with reference to the audio reproduction start timing. That is, the operation command issuance timing for forming a specific lip shape is set based on the audio reproduction start timing based on the estimated delay in order to synchronize with the audio. Basically, the operation command issuance timing is earlier than the playback time of the sound corresponding to the lip shape. This is not the case when the mouth is closed and the mouth is closed.

そして、ステップＳ３９で、動作指令発行処理を開始する。動作指令発行処理はＣＰＵ７０によって他の処理と並列的に実行される。この動作指令発行処理では、各動作指令の発行タイミングになったと判断されたときに、当該動作指令が発行される。具体的には、ＣＰＵ７０は、アクチュエータ制御情報を含む動作指令を通信ボード７６に出力して制御装置１４ｂのＭＰＵ９０に与える。これに応じて、ＭＰＵ９０は制御ボード９８に制御情報を与え、これによって、制御情報において指定されたアクチュエータ４０が駆動され、アンドロイド１２の顔面において、制御情報に対応する口唇形状が形成されることとなる。上述のように、発行タイミングには各口唇形状を形成する際の動作遅延が考慮されているので、当該口唇形状が実際にアンドロイド１２の顔面において形成されてから、当該口唇形状に対応する音声が出力される。また、音声の出力が終わってから当該音声に対応する口唇形状が変えられる。なお、動作指令発行処理は、すべての動作指令の発行が完了したと判断されたときに終了される。 In step S39, the operation command issuing process is started. The operation command issuing process is executed by the CPU 70 in parallel with other processes. In this operation command issuing process, when it is determined that the timing for issuing each operation command has come, the operation command is issued. Specifically, the CPU 70 outputs an operation command including actuator control information to the communication board 76 and gives it to the MPU 90 of the control device 14b. In response to this, the MPU 90 gives control information to the control board 98, whereby the actuator 40 specified in the control information is driven, and a lip shape corresponding to the control information is formed on the face of the Android 12. Become. As described above, since the operation timing at the time of forming each lip shape is considered in the issue timing, the sound corresponding to the lip shape is generated after the lip shape is actually formed on the face of the android 12. Is output. Further, the lip shape corresponding to the sound is changed after the output of the sound is finished. The operation command issuance process is terminated when it is determined that all the operation commands have been issued.

ステップＳ４１では、未処理の音声データが残っているか否かを判断し、“ＹＥＳ”であれば、ステップＳ２１に戻って処理を繰り返す。このようにして、アンドロイド１２においては、遠隔オペレータの発話音声が当該音声に適合した口唇動作を伴って出力される。一方、ステップＳ４１で“ＮＯ”であれば、この口唇動作制御処理を終了する。 In step S41, it is determined whether or not unprocessed audio data remains. If “YES”, the process returns to step S21 to repeat the process. In this way, in the android 12, the utterance voice of the remote operator is output with the lip movement adapted to the voice. On the other hand, if “NO” in the step S41, the lip movement control process is ended.

この実施例によれば、遠隔オペレータの発話音声の音響特徴から非線形モデルを用いて口唇形状を推定し、発話音声の再生開始タイミングを基準として当該口唇形状を形成するまでにかかる動作遅延を考慮してアクチュエータ４０の動作指令の発行タイミングを設定するようにしたので、アンドロイド１２において遠隔オペレータの発話音声に適合させた口唇動作を実現することができる。したがって、アンドロイド１２の応対する人間に対して違和感を与えることなく、自然な対話を行うことができる。 According to this embodiment, the lip shape is estimated using the nonlinear model from the acoustic characteristics of the utterance voice of the remote operator, and the operation delay required until the lip shape is formed with reference to the reproduction start timing of the utterance voice is considered. Thus, the issuance timing of the operation command of the actuator 40 is set, so that the lip motion adapted to the voice of the remote operator in the Android 12 can be realized. Therefore, a natural dialogue can be performed without giving a sense of incongruity to the human being who receives the Android 12.

なお、上述の実施例では、制御装置１４で発話処理を実行する場合を説明したが、他の実施例では、遠隔操作端末２０が発話処理を実行して、制御装置１４に対して動作指令を送信するようにしてもよい。つまり、たとえば、図３において、ステップＳ１でマイク２４で音声を検出したか否かを判定する。さらにステップＳ７で音声の取得（検出）から一定時間経過したことが判定されたときには、ステップＳ９で音声データを含む再生指示を制御装置１４に対して送信する。制御装置１４では、この再生指示の受信に応じて、当該音声を出力する。また、遠隔操作端末２０では、図４のステップＳ３９で開始される動作指令発行処理で、制御装置１４に対して制御情報を含む各動作指令を送信する。制御装置１４では、この動作指令の受信に応じて、当該動作指令に含まれる制御情報に従ってアクチュエータ４０を制御する。 In the above-described embodiment, the case where the control device 14 executes the utterance process has been described. However, in another embodiment, the remote operation terminal 20 executes the utterance process and issues an operation command to the control device 14. You may make it transmit. That is, for example, in FIG. 3, it is determined whether or not voice is detected by the microphone 24 in step S1. Further, when it is determined in step S7 that a predetermined time has elapsed since the acquisition (detection) of the sound, a reproduction instruction including the sound data is transmitted to the control device 14 in step S9. The control device 14 outputs the sound in response to reception of the reproduction instruction. Further, the remote operation terminal 20 transmits each operation command including control information to the control device 14 in the operation command issuing process started in step S39 of FIG. In response to receiving this operation command, the control device 14 controls the actuator 40 in accordance with the control information included in the operation command.

また、上述の各実施例では、遠隔オペレータの発話音声のみに基づいて、口唇形状を推定するようにしているが、他の実施例では、発話音声と画像処理とを組み合わせて口唇形状を推定するようにしてもよい。つまり、遠隔操作端末２０にカメラを設けて、発話する遠隔オペレータの顔を撮影し、当該撮影画像に画像処理を施すことによって、口唇形状を検出する。そして、発話音声から推定した口唇形状と撮影画像から推定した口唇形状とに基づいて、最終的な口唇形状の推定を行う。これによって、口唇形状の推定の精度を高めることができる。あるいは、その他の実施例では、発話音声とモーションキャプチャとを組み合わせて口唇形状を推定するようにしてもよい。つまり、遠隔操作端末２０にモーションキャプチャシステムを組み合わせる。具体的には、遠隔オペレータの顔面の適宜な位置にマーカを取り付けるとともに、複数のカメラを設ける。そして、発話する遠隔オペレータの顔を撮影し、当該撮影画像に基づいて当該マーカの３次元位置を計測して、各マーカの位置に基づいて口唇形状を検出する。そして、発話音声から推定した口唇形状と３次元位置から推定した口唇形状とに基づいて、最終的な口唇形状の推定を行う。これによっても、口唇形状の推定の精度をよくすることができる。 In each of the above-described embodiments, the lip shape is estimated based only on the utterance voice of the remote operator. In other embodiments, the lip shape is estimated by combining the utterance voice and image processing. You may do it. That is, a camera is provided in the remote operation terminal 20, the face of the remote operator who speaks is photographed, and the lip shape is detected by performing image processing on the photographed image. Then, the final lip shape is estimated based on the lip shape estimated from the speech voice and the lip shape estimated from the captured image. Thereby, the accuracy of estimation of the lip shape can be improved. Alternatively, in another embodiment, the lip shape may be estimated by combining the speech sound and the motion capture. That is, the remote operation terminal 20 is combined with a motion capture system. Specifically, a marker is attached at an appropriate position on the face of the remote operator, and a plurality of cameras are provided. Then, the face of the speaking remote operator is photographed, the three-dimensional position of the marker is measured based on the photographed image, and the lip shape is detected based on the position of each marker. Then, based on the lip shape estimated from the speech voice and the lip shape estimated from the three-dimensional position, the final lip shape is estimated. This also improves the accuracy of lip shape estimation.

この発明のアンドロイド制御システムの一例を示す図解図である。It is an illustration figure which shows an example of the android control system of this invention. 図１に示すアンドロイド、制御装置および環境センサの電気的な構成を示すブロック図である。It is a block diagram which shows the electrical structure of the android shown in FIG. 1, a control apparatus, and an environment sensor. 図２に示すＣＰＵ７０の発話処理における動作の一例を示すフロー図である。It is a flowchart which shows an example of the operation | movement in the speech process of CPU70 shown in FIG. 図３に示す口唇動作制御処理の動作の一例を示すフロー図である。It is a flowchart which shows an example of operation | movement of the lip movement control process shown in FIG.

Explanation of symbols

１０ …アンドロイド制御システム
１２ …アンドロイド
１４ …制御装置
２０ …遠隔操作端末
２２，４８ …スピーカ
２４，５０ …マイク
４０ …アクチュエータ
７０ …ＣＰＵ
７４，９４ …メモリ
８０ …音声入出力ボード
９０ …ＭＰＵ
９８ …制御ボード DESCRIPTION OF SYMBOLS 10 ... Android control system 12 ... Android 14 ... Control apparatus 20 ... Remote operation terminal 22, 48 ... Speaker 24, 50 ... Microphone 40 ... Actuator 70 ... CPU
74, 94 ... Memory 80 ... Voice I / O board 90 ... MPU
98… Control board

Claims

A system for controlling an utterance operation of an Android remotely operated by an operator according to the utterance voice of the operator,
A voice acquisition means for acquiring a voice emitted by the operator;
A sound playback means for playing back the acquired sound with a certain time delay;
Lip shape estimating means for estimating the lip shape from the acquired acoustic features using a nonlinear model;
An operation delay estimating means for estimating an operation delay indicating time information taken from when an operation command for forming a lip shape is issued until the lip shape is formed;
Based on the estimated operation delay, the operation command setting means for setting the issue timing of the operation command with reference to the reproduction start timing by the sound reproduction means, and each operation command according to the issue timing set by the operation command setting means A system comprising operation command issuing means for issuing.

The system according to claim 1, further comprising optimization means for optimizing an operation through the section based on a time series of the lip shape of the predetermined section estimated by the lip shape estimation means.