JP2008116551A

JP2008116551A - Expression attaching voice generator

Info

Publication number: JP2008116551A
Application number: JP2006297845A
Authority: JP
Inventors: Tomoko Yonezawa; 朋子米澤; Noriko Suzuki; 紀子鈴木; Shinji Abe; 伸治安部; Kenji Mase; 健二間瀬; Kiyoshi Kogure; 潔小暮
Original assignee: ATR Advanced Telecommunications Research Institute International
Current assignee: ATR Advanced Telecommunications Research Institute International
Priority date: 2006-11-01
Filing date: 2006-11-01
Publication date: 2008-05-22

Abstract

<P>PROBLEM TO BE SOLVED: To provide an expression attaching voice generator capable of generating voice with intensity which is visually suitable for the intensity of expression attaching of body motion. <P>SOLUTION: The expression attaching voice generator 10 includes a computer 22. Based on sensor values from sensors 161a to 164, 181 and 182 provided in a glove type sensor 14 which is mounted on a hand for operating a hand doll 12, the computer determines the intensity of expression attaching of body motion of the hand doll by referring to an interpretation table 24, and a morphing factor is determined by referring to a body motion-song voice expression table 28, so that it may become expression attaching of song voice of the intensity suitable for the intensity of expression attaching of the body motion. According to the morphing factor, morphing is performed on original song voice which is stored in a song voice data base 26 beforehand, and sound is output from a speaker 34. According to the invention, the song voice with the intensity which is visually suitable for the intensity of expression attaching of the body motion can be generated. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

この発明は表情付け音声発生装置に関し、特にたとえば、音声モーフィングの技法を使って表情付けされた音声（Expressive Voice）を出力する、表情付け音声発生装置に関する。 The present invention relates to a facial expression voice generation device, and more particularly to a facial expression voice generation device that outputs voice expressed using a voice morphing technique (Expressive Voice).

従来の感情付き音声表現において、感情音声（Emotional Speech）に関する研究として、非特許文献１で述べられるように、F0（基本周波数）や話速などのルールベースのアプローチや、非特許文献２のようなコーパスベースのアプローチが考えられる。 As described in Non-Patent Document 1, as a research on emotional speech (Emotional Speech) in conventional speech expression with emotion, a rule-based approach such as F0 (fundamental frequency) and speech speed, and Non-Patent Document 2 A corpus-based approach can be considered.

ルールベースでは韻律情報を主に扱うのに対し、コーパスベースの手法では韻律情報が一定の歌声の表情付けについても音声の声色を取り扱うことができるが、表情付けの変化を伴うときはコーパス間における表情付けの不連続性が目立つ。 The rule base mainly handles prosodic information, while the corpus-based method can handle voice tone even for facial expression of singing voices with constant prosodic information. The discontinuity of expression is conspicuous.

また、本件発明者等は、非特許文献３および４などで公知のSTRAIGHT（音声分析変換合成システム）を利用して音声モーフィングを行なうことによって、表情付けの強度を連続的に変化できる表現手法として、ＥＳＶＭ（Expressive Singing Voice Morphing）を提案している(非特許文献５)。 In addition, the present inventors have developed an expression technique that can continuously change the intensity of facial expression by performing speech morphing using STRAIGHT (speech analysis conversion synthesis system) known in Non-Patent Documents 3 and 4, etc. ESVM (Expressive Singing Voice Morphing) has been proposed (Non-Patent Document 5).

非特許文献５に示すＥＳＶＭでは、自然な表情付けが可能となり、色々な方面への利用が期待されている。 The ESVM shown in Non-Patent Document 5 enables natural expression and is expected to be used in various directions.

発明者等はこのようなＥＳＶＭによる音声モーフィングの技法を使って、手人形（hand-puppet）の身体動作（ジェスチャ）に応じて表情付けされた音声（expressive voice）を出力できる表情付け音声発生装置を提案した（非特許文献６）。
Schroder, M., “Emotional Speech Synthesis: A Review,” Proc. Eurospeech, volume 1, pp. 561-564, 2001 Iida, A., Iga, S., Higuchi, F., Campbell, N., Yasumura, M., “A Speech Synthesis System with Emotion for Assisting Communication”, Proc. ISCA Workshop on Speech and Emotion, pp. 167-172, 2000 Kawahara, H., Masuda-Kasuse, L, and Cheveigne, A., “Restructuring speech representations using a pitch-adaptive time-frequency smoothing and aninstantaneous-frequency-based FO extraction: Possible role of a reptitive structure in sounds,”Speech Communication, 27, pp. 187-207, 1999 http://www.wakayama-u.ac.jp／〜kawahra.STRAIGHTadv/（高品質音声分析変換合成システム STRAIGHT）米澤朋子，鈴木紀子，間瀬健二，小暮潔，“表情付けられた歌声モーフィングの知覚的検討,”日本音響学会春期研究発表会(音講論)，pp. 809−810，2004 米澤朋子，鈴木紀子，間瀬健二，小暮潔，“擬人化ジェスチャ表現を用いた歌声への連続的表情付与システム,”日本音響学会論文誌, Vol.62, No.3, pp.233-243, 2006. The inventors have used the ESVM voice morphing technique to output an expressive voice that can be expressed in response to a hand-puppet's body movement (gesture). (Non-Patent Document 6).
Schroder, M., “Emotional Speech Synthesis: A Review,” Proc. Eurospeech, volume 1, pp. 561-564, 2001 Iida, A., Iga, S., Higuchi, F., Campbell, N., Yasumura, M., “A Speech Synthesis System with Emotion for Assisting Communication”, Proc. ISCA Workshop on Speech and Emotion, pp. 167- 172, 2000 Kawahara, H., Masuda-Kasuse, L, and Cheveigne, A., “Restructuring speech representations using a pitch-adaptive time-frequency smoothing and aninstantaneous-frequency-based FO extraction: Possible role of a reptitive structure in sounds,” Speech Communication, 27, pp. 187-207, 1999 http://www.wakayama-u.ac.jp/~kawahra.STRAIGHTadv/ (High Quality Speech Analysis Conversion Synthesis System STRAIGHT) Yonezawa Atsuko, Suzuki Noriko, Mase Kenji, Kogure Kiyoshi, “Perceptual examination of singing voice morphing with facial expression,” Acoustical Society of Japan Spring Conference (Sound lecture), pp. 809-810, 2004 Yoko Yonezawa, Noriko Suzuki, Kenji Mase, Kiyoshi Kogure, “Continuous facial expression system for singing voice using anthropomorphic gesture expression,” The Acoustical Society of Japan, Vol.62, No.3, pp.233-243, 2006.

非特許文献６において、身体動作（ジェスチャ）の表情付けと音声の表情付けの強度が視覚的に適合していることが、より自然で望ましい。 In Non-Patent Document 6, it is more natural and desirable that the intensity of the expression of the body movement (gesture) and the intensity of the expression of the voice are visually matched.

それゆえに、この発明の主たる目的は、新規な、表情付け音声発生装置を提供することである。 Therefore, a main object of the present invention is to provide a novel facial expression sound generator.

この発明の他の目的は、身体動作の表情付けの強度に視覚的に適合する強度で表情付けした音声を発生することができる、表情付け音声発生装置を提供することである。 Another object of the present invention is to provide an expression voice generating device capable of generating a voice that is expressed with an intensity that visually matches the intensity of the expression of body movement.

請求項１の発明は、身体動作を入力するための身体動作入力手段が入力した身体動作の表情付けに応じて表情付けした音声を発生する表情付け音声発生装置であって、表情付けが異なる少なくとも２つの音声のそれぞれの音声信号を予め記憶しておく音声信号データベース、身体動作の表情付けの強度と適合する音声の表情付けの強度を設定する設定手段、音声データベースから読み出した２以上の音声信号を設定手段が設定した音声の表情付けの強度に応じたモーフィング率でモーフィングするモーフィング手段、およびモーフィング手段によってモーフィングした結果の音声信号によって音声を出力する音声出力手段を備える、表情付け音声発生装置である。 The invention according to claim 1 is a facial expression voice generating device that generates a voice with a facial expression according to a facial expression of a physical motion input by a physical motion input means for inputting a physical motion, and has at least different facial expressions. Audio signal database for storing respective audio signals of two sounds in advance, setting means for setting the intensity of voice expression that matches the intensity of expression of body motion, and two or more audio signals read from the audio database A morphing means for morphing at a morphing rate according to the intensity of expression of the voice set by the setting means, and a voice output means for outputting a sound by a voice signal resulting from morphing by the morphing means. is there.

請求項１の発明では、コンピュータ（２２）を用い、このコンピュータ（２２）に音声信号データベース（２６）を設定しておく。この音声信号データベース（２６）にはたとえば、表情付けされていない音声「no」と、異なる表情付け「da」，「wh」，「we」がされているそれぞれの音声信号が予め収録されている。コンピュータ（２２）では、たとえば手人形（１２）を操作する手に装着する手袋（１４）に設けられるセンサ（１６１‐１８２）からのセンサ信号に基づいて、たとえば手人形のそのときのジェスチャすなわち身体動作の表情付けの強度を判定し、そして、同じくコンピュータ（２２）あるいは他の回路であるモーフィング手段は、その身体動作の表情強度に従ったモーフィング率で、音声信号データベースの元歌声をモーフィングする。 In the first aspect of the invention, a computer (22) is used, and an audio signal database (26) is set in the computer (22). In the audio signal database (26), for example, audio signals with no expression and audio signals with different expression “da”, “wh”, and “we” are recorded in advance. . In the computer (22), for example, based on a sensor signal from a sensor (161-182) provided in a glove (14) attached to a hand that operates the hand puppet (12), for example, a current gesture of the hand puppet (body) The morphing means, which is also the computer (22) or other circuit, morphs the original singing voice in the speech signal database at a morphing rate according to the facial expression intensity of the body motion.

請求項１の発明では、身体動作の表情強度に従ったモーフィング率で音声が発生されるので、その身体動作の表情付けの強度にうまく適合した表情付け強度の音声を発生することができる。 According to the first aspect of the present invention, the sound is generated with the morphing rate according to the expression intensity of the body motion, so that it is possible to generate the sound with the expression strength well suited to the expression strength of the body motion.

請求項２の発明は、モーフィング手段は逆シグモイド関数に従って２以上の音声信号をモーフィングし、設定手段はシグモイド関数またはそれの変形関数に従って音声の表情付けの強度を設定する、請求項１記載の表情付け音声発生装置である。 The invention according to claim 2 is the expression according to claim 1, wherein the morphing means morphs two or more speech signals according to the inverse sigmoid function, and the setting means sets the intensity of voice expression according to the sigmoid function or its deformation function. It is an attached sound generator.

請求項２の発明で利用されるシグモイド関数またはそれの変形関数としては、後述の式（２）‐（６）のような関数が考えられる。 As a sigmoid function used in the invention of claim 2 or a deformation function thereof, a function such as the following equations (2) to (6) can be considered.

この発明によれば、身体動作に応じた表情付けをした音声を発生するとき、身体動作の表情強度に従ったモーフィング率でモーフィングするので、身体動作の表情付けの強度に視覚的に適合する強度で表情付けした音声を出力することができる。 According to the present invention, when generating a voice with a facial expression according to the body motion, the morphing is performed at a morphing rate according to the facial expression intensity of the physical motion, so that the strength visually matches the facial expression strength of the physical motion. You can output voices with facial expressions.

この発明の上述の目的，その他の目的，特徴および利点は、図面を参照して行う以下の実施例の詳細な説明から一層明らかとなろう。 The above object, other objects, features and advantages of the present invention will become more apparent from the following detailed description of embodiments with reference to the drawings.

この発明の一実施例の表情付け音声発生装置１０（図６）は、上述のＥＳＶＭのモーフィングを利用してたとえば歌声の表情付け（歌声表現）を制御するものであり、その歌声の表情付けの強度の制御ためにジェスチャ（姿勢）を利用しようとするものである。ただし、ＥＳＶＭについては、同時係属中の特許出願（特開２００６‐１７８０５２号）に詳しく説明されているので、必要に応じて可能な限度でその記述を参照する。 The expression voice generating apparatus 10 (FIG. 6) according to an embodiment of the present invention controls expression of a singing voice (singing voice expression), for example, using the above-described ESVM morphing. It is intended to use a gesture (posture) for intensity control. However, since the ESVM is described in detail in a co-pending patent application (Japanese Patent Laid-Open No. 2006-178052), the description is referred to as much as possible.

そして、そのようなジェスチャすなわち姿勢の表情強度を入力するための手段の一例として、手で操作する手人形を用いるが、手人形は、図１に示すぬいぐるみ１２を含む。このぬいぐるみ１２は、全体として、布やフェルトなどの柔軟な素材で形成されていて、被験者の手のひらが入る手のひら部１２０と、その手のひら部１２０と内部で連通し、被験者の親指、人差し指、および中指がそれぞれ挿入できる親指部１２１、人差し指部１２２、および中指部１２３を含む。実施例のぬいぐるみ１２は、図示のように、人差し指部１２２が頭部で、それを挟む親指部１２１および中指部１２３が翼または羽である、鳥を表現している。しかしながら、当然、このようなぬいぐるみ１２の形状は任意に変更できるものである。 Then, as an example of means for inputting such gesture, that is, a posture expression intensity, a hand puppet operated by hand is used, and the hand puppet includes the stuffed toy 12 shown in FIG. The stuffed toy 12 as a whole is formed of a flexible material such as cloth or felt, and communicates with the palm 120 into which the palm of the subject enters, the palm 120 of the subject, and the subject's thumb, index finger, and middle finger. Includes a thumb part 121, an index finger part 122, and a middle finger part 123 into which each can be inserted. As illustrated, the stuffed toy 12 of the embodiment represents a bird in which the index finger 122 is a head, and the thumb 121 and the middle finger 123 sandwiching it are wings or wings. However, naturally, the shape of the stuffed toy 12 can be arbitrarily changed.

このように、歌声の表情付けを制御する姿勢の表情強度を入力するためにぬいぐるみ１２を用いるには、適切な擬人化表現を取り入れることが重要であり、実施例では、２つの腕（羽）と頭とを持つぬいぐるみ１２の動作を、３本の指で制御することにした。口の動きにより発声タイミングを制御することも考えられるが、この実施例では、ぬいぐるみ１２の全身的なジェスチャにより「表情付け」をコントロールする。 Thus, in order to use the stuffed toy 12 to input the facial expression intensity for controlling the expression of the singing voice, it is important to incorporate an appropriate personification expression. In the embodiment, two arms (wings) are used. The operation of the stuffed toy 12 having the head and the head is controlled by three fingers. Although it is conceivable to control the utterance timing by the movement of the mouth, in this embodiment, the “expression” is controlled by the whole body gesture of the stuffed toy 12.

ぬいぐるみ１２の外見を活かし、入力デバイスとして利用するためには、手の動きをぬいぐるみ１２の動きとして計測することが重要である。歌声の表情付けのコントローラとして十分な精度で動作データを得るために、手人形は、擬人化用カバーとしてのぬいぐるみ１２と、手の動きを計測する独立した手袋型センサ１４とを含む。 In order to utilize the appearance of the stuffed animal 12 and use it as an input device, it is important to measure the movement of the hand as the movement of the stuffed animal 12. In order to obtain motion data with sufficient accuracy as a controller for singing voice expression, the hand puppet includes a stuffed animal 12 as an anthropomorphic cover and an independent glove sensor 14 that measures hand movement.

すなわち、ぬいぐるみ１２に図２に示すように被験者の手が挿入されるのであるが、その手には、手袋型センサ１４を装着する。この手袋型センサ１４には手のひらを受容する手のひら部１４０と、その手のひら部１４０と内部で連通しておりかつそれぞれに親指、人差し指、および中指が挿入される親指部１４１、人差し指部１４２、および中指部１４３が形成される。ただし、薬指および小指のための指部も当然形成されるのであるが、ここでは言及しない。 That is, the subject's hand is inserted into the stuffed toy 12 as shown in FIG. 2, and a glove-type sensor 14 is attached to the hand. The glove-type sensor 14 includes a palm portion 140 that receives a palm, a thumb portion 141 that is in communication with the palm portion 140 and into which the thumb, index finger, and middle finger are inserted, an index finger portion 142, and a middle finger, respectively. A portion 143 is formed. However, the finger parts for the ring finger and little finger are naturally formed, but are not mentioned here.

図３および図４を参照して、手袋型センサ１４には上述の手のひら部１４０、および指部１４１−１４３を含む。親指部１４１の表面に親指第１曲げセンサ１６１ａがその親指部１４１の少なくとも第１関節および第２関節をカバーできる長さで設けられる。親指部１４１の側面に親指第２曲げセンサ１６１ｂが同じく親指部１４１の第１関節および第２関節をカバーできる長さで設けられる。人差し指部１４２の表面に人差し指第１曲げセンサ１６２ａが少なくともその人差し指部１４２の第１関節および第２関節をカバーできる長さで設けられ、人差し指部１４２の側面に人差し指第２曲げセンサ１６２ｂが同様に第１関節および第２関節をカバーできる長さで設けられる。さらに、中指部１４３の表面に中指第１曲げセンサ１６３ａが中指部１４３の少なくとも第１関節および第２関節をカバーできる長さで設けられ、中指部１４３の、親指部や人差し指部とは反対側の側面に中親指第２曲げセンサ１６３ｂが中指部１４３の少なくとも第１関節および第２関節をカバーできる長さで設けられる。中指第２曲げセンサ１６３ｂを親指第２曲げセンサ１６１ｂや人差し指第２曲げセンサ１６２ｂとは反対側にしたのは、人差し指部１４２とこの中指第２曲げセンサ１６３ｂとの干渉を避けるためであるので、干渉が少ないか、なければ、他の第２曲げセンサ１６１ｂおよび１６２ｂと同じ側に設けてもよい。 Referring to FIGS. 3 and 4, glove-type sensor 14 includes the above-described palm portion 140 and finger portions 141-143. The thumb first bending sensor 161a is provided on the surface of the thumb 141 with a length that can cover at least the first joint and the second joint of the thumb 141. A thumb second bending sensor 161b is provided on the side surface of the thumb portion 141 with a length that can cover the first joint and the second joint of the thumb portion 141. The index finger first bending sensor 162 a is provided on the surface of the index finger 142 with a length that can cover at least the first joint and the second joint of the index finger 142, and the index finger second bending sensor 162 b is similarly provided on the side of the index finger 142. It is provided with a length that can cover the first joint and the second joint. Further, a middle finger first bending sensor 163a is provided on the surface of the middle finger portion 143 in such a length that can cover at least the first joint and the second joint of the middle finger portion 143, and the middle finger portion 143 is opposite to the thumb portion and the index finger portion. The middle thumb second bending sensor 163b is provided on the side surface of the middle finger portion 143 in such a length that can cover at least the first joint and the second joint. The reason why the middle finger second bending sensor 163b is opposite to the thumb second bending sensor 161b and the index finger second bending sensor 162b is to avoid interference between the index finger 142 and the middle finger second bending sensor 163b. If there is little or no interference, it may be provided on the same side as the other second bending sensors 161b and 162b.

上述の親指第１曲げセンサ１６１ａおよび親指第２曲げセンサ１６１ｂは、図５に示すように、前者が親指部１４１の表面（手の甲側）に配置され、後者が親指部１４１の側面に、前者とは９０度の角度差で設けられる。これによって、９０度の角度差を有するＸ方向とＹ方向との２方向の曲げ角度をそれぞれ計測できるようにしている。人差し指第１曲げセンサ１６２ａおよび人差し指第２曲げセンサ１６２ｂ、ならびに中指第１曲げセンサ１６３ａおよび中指第２曲げセンサ１６３ｂも、同様の理由で９０度ずれた位置関係にある。 As shown in FIG. 5, the first thumb bending sensor 161 a and the second thumb bending sensor 161 b described above are arranged such that the former is arranged on the surface (back side of the hand) of the thumb portion 141 and the latter is placed on the side of the thumb portion 141. Are provided with an angle difference of 90 degrees. As a result, the bending angles in the two directions of the X direction and the Y direction having an angle difference of 90 degrees can be measured. The index finger first bending sensor 162a and the index finger second bending sensor 162b, and the middle finger first bending sensor 163a and the middle finger second bending sensor 163b are also in a positional relationship shifted by 90 degrees for the same reason.

さらに、これら曲げセンサ１６１ａ，１６１ｂ，１６２ａ，１６２ｂ，１６３ａおよび１６３ｂは、いずれも、ピエゾ（圧電）素子であり、その主面と直角な方向の曲げ角度に応じて異なる電圧を出力する。したがって、この電圧を検出することによって、各曲げセンサすなわち指部の当該方向での曲げ角度を検出または計測することができる。 Further, each of the bending sensors 161a, 161b, 162a, 162b, 163a, and 163b is a piezo (piezoelectric) element, and outputs a different voltage depending on a bending angle in a direction perpendicular to the main surface. Therefore, by detecting this voltage, it is possible to detect or measure the bending angle of each bending sensor, that is, the finger portion in the direction.

また、図４に示すように、手袋型センサ１４の親指部１４１および中指部１４３のそれぞれの指先には、指の腹側に、圧力センサ１８１および１８２が設けられる。この圧力センサ１８１および１８２もピエゾ素子であり、その表面にかかった圧力の大きさに応じた大きさの電圧を出力する。２つの圧力センサ１８１および１８２は、親指部１４１の先端と中指部１４３の先端とが互いに合わさった状態を検出できるようにするためである。 In addition, as shown in FIG. 4, pressure sensors 181 and 182 are provided at the fingertips of the thumb part 141 and the middle finger part 143 of the glove-type sensor 14 on the ventral side of the finger. The pressure sensors 181 and 182 are also piezoelectric elements, and output a voltage having a magnitude corresponding to the magnitude of pressure applied to the surface thereof. The two pressure sensors 181 and 182 are for detecting the state in which the tip of the thumb portion 141 and the tip of the middle finger portion 143 are aligned with each other.

なお、実施例では、人差し指部１４２(手人形の頭部)では、曲げだけではなく反り返りも計測できるようにするために、予め人差し指がある程度手のひら側に曲がった状態で、ぬいぐるみ１２の頭部１２２が正面を向く構造になっている。そして、手袋型センサ１４の手のひら部１４０の手の甲側内面に図３で点線で示すもう１の曲げセンサ１６４が設けられる。この手首曲げセンサ１６４もピエゾ素子であり、人差し指部１４２の反り返り、すなわち手の甲側への曲げの程度を検出する。 In the embodiment, the index finger 142 (the head of the hand doll) can measure not only the bending but also the warping, and the head 122 of the stuffed toy 12 with the index finger bent to the palm side to some extent in advance. The structure is facing the front. And the other bending sensor 164 shown with a dotted line in FIG. 3 is provided in the back side inner surface of the palm part 140 of the glove-type sensor 14. The wrist bending sensor 164 is also a piezo element, and detects the degree of bending of the index finger 142, that is, the degree of bending toward the back of the hand.

ただし、手首曲げセンサ１６４は、手袋型センサ１４の中にもう１つ別の手袋（図示せ
ず）を設け、その中手袋の手の甲（表面）に付着させるようにしてもよい。 However, the wrist bending sensor 164 may be provided with another glove (not shown) in the glove-type sensor 14 and attached to the back of the hand (surface) of the glove.

また、長手の曲げセンサ１６１ａ，１６１ｂ，１６２ａ，１６２ｂ，１６３ａ，１６３ｂおよび１６４（以下、「１６１ａ‐１６４」と表記することがある。）は、いずれも、手袋型センサ１４（および中手袋）に付着されるが、その付着方法は、糸で緩やかに縫い付ける方法が適当である。しっかりと縫い付けたり、接着してしまうと、手袋の指部の特に手のひら側への曲げ角度が大きいときに、曲げセンサが引きつった状態となり、手袋の指部の曲がりに曲げセンサがうまく追従できなくなり、破損するなどの故障が起きるからである。 In addition, the longitudinal bending sensors 161a, 161b, 162a, 162b, 163a, 163b and 164 (hereinafter sometimes referred to as “161a-164”) are all included in the glove-type sensor 14 (and medium gloves). Although it adheres, the method of adhering gently with a thread | yarn is suitable for the adhesion method. If it is sewn or bonded firmly, the bending sensor will be pulled when the bending angle of the finger part of the glove, especially the palm side, is large, and the bending sensor will follow the bending of the finger part of the glove well. This is because a failure such as being impossible or being damaged occurs.

上で説明した曲げセンサ１６１ａ‐１６４ならびに圧力センサ１８１および１８２からの出力電圧は、図６に示すように、Ａ／Ｄ変換器２０によってディジタルデータに変換されて、コンピュータ２２に入力される。このコンピュータ２２は、これらセンサ１６１ａ‐１６４，１８１および１８２からの電圧に基づいて手および指の動作をぬいぐるみ１２のジェスチャすなわち姿勢の表情強度として検出し、その姿勢の表情強度に応じたモーフィング率で音声モーフィングを行なうことによって、歌声の表情付けを行なう。 Output voltages from the bending sensors 161a-164 and the pressure sensors 181 and 182 described above are converted into digital data by the A / D converter 20 and input to the computer 22 as shown in FIG. The computer 22 detects the hand and finger movements as gestures of the stuffed toy 12 based on the voltages from the sensors 161a-164, 181 and 182, that is, the facial expression intensity of the posture, and has a morphing rate according to the facial expression strength of the posture. Perform voice expression with voice morphing.

センサ１６１ａ‐１６４，１８１および１８２からの電圧値を姿勢の表情強度として解釈するために、解釈テーブル２４が、コンピュータ２２のメモリ（図示せず）内に予め設定される。 An interpretation table 24 is preset in a memory (not shown) of the computer 22 in order to interpret the voltage values from the sensors 161 a-164, 181 and 182 as facial expression intensity.

各センサ１６１ａ‐１６４，１８１および１８２からの電圧値は、曲げ角度に対して正比例の関係にはなく、図７に示すように、曲げ角度が小さいときには変化が大きく、曲げ角度が大きくなるにつれて変化が小さくなる、各電圧値は一種の飽和曲線のように変化する。したがって、電圧値をそのまま動作（曲げ）の程度であると解釈すると、間違った解釈になる。 The voltage values from the sensors 161a-164, 181 and 182 are not directly proportional to the bending angle, and as shown in FIG. 7, the change is large when the bending angle is small, and changes as the bending angle increases. Each voltage value changes like a kind of saturation curve. Accordingly, if the voltage value is interpreted as the degree of operation (bending) as it is, it will be interpreted incorrectly.

そこで、解釈テーブル２４には、図７のような変化曲線を、曲げ角度と電圧値が直線的に変化するような変換テーブルまたは変換式を設定している。したがって、コンピュータ２２は、解釈テーブル２４によってセンサ値（電圧値）を変換し、その変換後の電圧値（センサ値）から各曲げ角度や圧力を推定し、それによって姿勢の表情強度を同定または特定する。 Therefore, in the interpretation table 24, a conversion table or a conversion equation is set such that the bending angle and the voltage value linearly change the change curve as shown in FIG. Therefore, the computer 22 converts the sensor value (voltage value) using the interpretation table 24, estimates each bending angle and pressure from the converted voltage value (sensor value), and thereby identifies or identifies the facial expression intensity of the posture. To do.

図７は１つの曲げセンサのセンサ値と角度との関係を示し、横軸に「１．０」と表示しているが、その位置が曲げ角度が１００パーセントの位置で、これを基準にして、曲げ角度の程度（％）が識別できる。ジェスチャの程度とは、この曲げ角度の程度と同様に、そのジェスチャによる最大変化時を１００パーセントとしたときの、それ以下の％値のことである。 FIG. 7 shows the relationship between the sensor value and the angle of one bending sensor, and the horizontal axis indicates “1.0”. The position is a position where the bending angle is 100%, and this is used as a reference. The degree of bending angle (%) can be identified. Like the degree of the bending angle, the degree of gesture is a% value less than that when the maximum change due to the gesture is 100%.

表１に、このセンサ付手袋１４の各センサ１６１ａ‐１８２から実際に出力されるセンサ値と各指部や手首の曲げ角度、さらには親指部と中指部の接触圧力との関係を示す。この表１においては、実施例では、各センサ値は、１２８の分解能（０−１２７）で表される。たとえば、親指第１曲げセンサの場合、それを手のひら方向に曲げたとき電圧を出力し、最大値が１１０、最小値が６０であった。ただし、動きが最も大きいときのセンサ値が最小値となる。手首曲げセンサに付いていえば、手首を手の甲側に曲げたとき、２０‐３０から５０‐６０まで変化するセンサ値が出力される。圧力センサでは、最大値は１２７で、最小値が１２７未満ということで、この圧力センサは、タッチセンサであり、親指部と中指部との接触を検知する。 Table 1 shows the relationship between the sensor value actually output from each sensor 161a-182 of the glove 14 with sensor, the bending angle of each finger part and wrist, and the contact pressure between the thumb part and the middle finger part. In Table 1, in the example, each sensor value is represented by 128 resolutions (0-127). For example, in the case of the thumb first bending sensor, a voltage was output when it was bent in the palm direction, and the maximum value was 110 and the minimum value was 60. However, the sensor value when the movement is largest is the minimum value. If attached to the wrist bending sensor, when the wrist is bent toward the back of the hand, a sensor value that changes from 20-30 to 50-60 is output. In the pressure sensor, the maximum value is 127 and the minimum value is less than 127, so that the pressure sensor is a touch sensor and detects contact between the thumb part and the middle finger part.

コンピュータ２２は、入力される表１のセンサ値に基づいて解釈テーブル２４を参照して、そのときのぬいぐるみ１２の姿勢の表情強度(ジェスチャ)を、解釈する。この実施例では、表１および図８‐図１１に示すように、表情なし姿勢(neu)と、表情付けが知覚されやすいような、手による操作の可動域において特徴的で左右対称な３つの姿勢(bak、drp、str)を取り上げる。 The computer 22 interprets the facial expression strength (gesture) of the posture of the stuffed animal 12 with reference to the interpretation table 24 based on the input sensor value of Table 1. In this embodiment, as shown in Table 1 and FIGS. 8 to 11, there are three characteristic left-right symmetric positions in the range of motion without hand (neu) and the range of hand operation so that expression is easily perceived. Take up posture (bak, drp, str).

図８が「平静（表情付けなし）」の状態を示し、図９は、「反り返り(back:bak)」と呼ばれる動作で、人差し指部（頭部）１２２および親指部１２１ならびに中指部１２３を手の甲側にそらせる表現である。図１０は、「うなだれ(droop:drp)」と呼ばれる動作で、頭部１２２を下向きにした状態を示す。図１１は、「前伸ばし(stretch:str)」と呼ばれる動作で、親指部１２１および中指部１２３を前方へ突き出した姿勢である。 FIG. 8 shows a state of “calm (no expression)”, and FIG. 9 shows an operation called “back (bak)” in which the forefinger (head) 122, thumb 121, and middle finger 123 are moved to the back of the hand. It is an expression that deflects to the side. FIG. 10 shows a state in which the head 122 faces downward in an operation called “droop: drp”. FIG. 11 shows a posture in which the thumb part 121 and the middle finger part 123 are projected forward by an operation called “stretch: str”.

連続的に強度の異なる歌声を得るためSTRAIGHT(非特許文献３)を用いた音声モーフィングを適用し、表情付けられたモーフィング歌声を作成した。 Voice morphing using STRAIGHT (Non-Patent Document 3) was applied to obtain singing voices with different intensities continuously, and a morphing singing voice with expression was created.

まず、特定歌手の音声として、声色のみ異なる表現を取得するのに適したアマチュアの２０代女性の歌声をモーフィング用歌声としてサンプリング周波数４４．１ｋＨｚで録音した。歌唱中一貫した表情付けの「no」（表情付けのない「平坦」な歌声)、口腔、鼻腔、呼気といった独立的な調音要素による表情付けの例として「da」、「wh」、「we」(各表情付き歌声)の計４種類（表２)を収録した。日本で長く親しまれてきた唱歌『ふるさと』より音素数やリズムのバランスがよい「こぶなつりし」の２小節を採用し、Foと歌唱速度を揃えるため、同一の伴奏(ハ長調の音階、速度は３／４拍子、１２０拍/分)に合わせて歌ってもらった。収録された歌声の話速は約２．０モーラ／秒、Ｆｏ範囲は平均約３００Ｈｚ〜４５０Ｈｚとなった。各歌声音声の長さは平均約３.０秒である。このような元歌声が、歌声データベース２８に採録されているのである。 First, as a voice of a specific singer, an amateur 20's singing voice suitable for acquiring expressions differing only in voice color was recorded as a morphing singing voice at a sampling frequency of 44.1 kHz. “Da”, “wh”, “we” as examples of facial expression with independent articulation elements such as “no” with consistent expression during singing (“flat” singing voice without expression), oral cavity, nasal cavity, exhalation A total of four types (Table 2) were recorded. Adopting two measures of “Kubana Tsurushi”, which has a better balance of phonemes and rhythms than the traditional song “Furusato” in Japan, and the same accompaniment (C major scale, The speed was 3/4 time and 120 beats / minute). The speaking speed of the recorded singing voice was about 2.0 mora / second, and the Fo range was about 300 Hz to 450 Hz on average. The average length of each singing voice is about 3.0 seconds. Such an original singing voice is recorded in the singing voice database 28.

このデータを用い、表情強度が段階的に異なる歌声を作成するため、表情付けのない歌声と表情付き歌声の二者間({no・→da、no→wh、no←→we}の計３種類)で音声モーフィングを行った。このときモーフィング率は、表情なし元音声「no」を０、表情付き元音声を１としたときの特徴量の比率を示す。表情付け強度はモーフィング率により調整するものとし、このモーフィング率の調整の一例が図１２に示される。 Using this data, in order to create singing voices with different facial expression intensities, there is a total of 3 voices with no expression and singing voice with expression ({no · da, no → wh, no ← → we}). Voice morphing was performed on the “Type”. At this time, the morphing rate indicates the ratio of the feature amount when the original voice without expression “no” is 0 and the original voice with expression is 1. Assume that the expression intensity is adjusted by the morphing rate, and an example of the adjustment of the morphing rate is shown in FIG.

実施例では、表１に示す４種類の元音声（元歌声）「no」、「da」、「wh」、「we」を、図示しないメモリ内の歌声データベース２８に予め登録ししていて、このような元音声を合成(モーフィング)した歌声の表情付けを行なう。 In the embodiment, four types of original voices (original singing voices) “no”, “da”, “wh”, “we” shown in Table 1 are registered in advance in a singing voice database 28 in a memory (not shown), The expression of the singing voice synthesized (morphed) is performed.

具体的に、図１２を参照して、３種類の音声の間での音声モーフィングを行なう際のモーフィング率の決定の方法について説明する。今、３種類の音声Ａ、音声Ｂおよび音声Ｃの間でのモーフィングを行なうものとする。図１２に示すように、これら３つの音声に対応する頂点１００、１０２および１０４を有する３角形を考える。 Specifically, a method for determining a morphing rate when performing voice morphing between three types of voice will be described with reference to FIG. Now, it is assumed that morphing is performed between three kinds of voices A, B, and C. Consider a triangle with vertices 100, 102, and 104 corresponding to these three sounds, as shown in FIG.

この３角形の各辺を所定数に分割し、各辺と並行な線で分割点同士を結ぶことにより、図１２においてメッシュ１１０を作成できる。このメッシュ１１０を構成する各点に対応したモーフィング音声は以下のようにして作成できる。 The mesh 110 in FIG. 12 can be created by dividing each side of this triangle into a predetermined number and connecting the dividing points with a line parallel to each side. Morphing speech corresponding to each point constituting the mesh 110 can be created as follows.

たとえば、音声Ａおよび音声Ｂの間での各分割点に対応する中間音声は、たとえばシグモイド（sigmoid）関数を使って２つの音声が一定の割合で音声が変化するようにモーフィング率を決定する。このときのモーフィング率が上記身体動作-歌声表現テーブル２６で決まる。同様の方法で、音声Ａおよび音声Ｃの間、音声Ｂおよび音声Ｃの間でのモーフィングもそれぞれ行なうことができる。さらに、メッシュ１１０の各交点（たとえば交点１１２）での中間音声は、その交点を通る任意の線の両端（たとえば点１１４、１１６)の中間音声を、その両端からその交点までの距離の比に応じたモーフィング率で
モーフィングすることにより作成できる。したがって、メッシュ１１０の各点に対応する中間段階の音声を全て作成できる。 For example, the intermediate speech corresponding to each division point between speech A and speech B determines the morphing rate using two sigmoid functions so that the two speeches change at a constant rate. The morphing rate at this time is determined by the body motion-singing voice expression table 26. In the same manner, morphing between the voice A and the voice C and between the voice B and the voice C can be performed, respectively. Further, the intermediate sound at each intersection (for example, intersection 112) of the mesh 110 is obtained by converting the intermediate sound at both ends (for example, points 114 and 116) of an arbitrary line passing through the intersection to the ratio of the distance from the both ends to the intersection. It can be created by morphing at a corresponding morphing rate. Accordingly, it is possible to create all intermediate-level sounds corresponding to the respective points on the mesh 110.

このようにして、この方法は、元となる音声が図１２に示す３種類の場合だけでなく、実施例のように元の音声が４種類「no」、「da」、「wh」、「we」（“normal”, “dark”, “whisper”, “wet”）あっても、またはそれ以上あっても、２つの音声間のモーフィング率の決定を繰り返すことによって、同様に適用できる。 In this way, this method is not limited to the case where the original voices are the three types shown in FIG. 12, but the original voices are the four types “no”, “da”, “wh”, “ Even if there is “we” (“normal”, “dark”, “whisper”, “wet”) or more, it can be similarly applied by repeating the determination of the morphing rate between two voices.

なお、上述のシグモイド関数を利用したこのようなモーフィング率の決定については、先に言及した特開２００６‐１７８０５２号に詳しい。 The determination of such a morphing rate using the sigmoid function described above is detailed in Japanese Patent Application Laid-Open No. 2006-178052 mentioned above.

クロスモダリティ(Cross Modality:複数のモダリティ(様相)の相互作用)を検討する際には、基本となる表現の種類の組合せが自然に受容されることが望ましい。身体動作表現に対する歌声表現のマッピングとして、共通の単純で特徴的なメタファがある組合せが望ましいと仮定し、歌声を発声する状況の典型的な例として表３に示すメタファ(metaphor:隠喩または暗喩)を予め設定した。このメタファに基づいて歌声と姿勢の表情付けの種類を組合せた。 When examining cross modalities (cross modalities), it is desirable that the combination of the basic representation types be naturally accepted. Assuming that a combination of common and characteristic metaphors is desirable as a mapping of singing voice expressions to body movement expressions, the metaphors (metaphors or metaphors) shown in Table 3 are typical examples of situations in which singing voices are uttered. Was preset. Based on this metaphor, we combined singing voice and posture expression types.

コンピュータ２２には、さらに、図示しないメモリ内に、身体動作（ジェスチャ）-歌声表現テーブル２８が予め設定されている。この身体動作-歌声表現テーブル２８は、後述のように、解釈したジェスチャすなわち姿勢の表情強度を歌声の表情付けにマッピングするためのテーブルである。つまり、身体動作-歌声表現テーブル２８は、ぬいぐるみ１２の姿勢の表情強度によって、４つの元歌声をどのようなモーフィング率でモーフィングするかを決める。この実施例ではこのテーブル２８に後述の図２６に示すシグモイド関数f(x)およびそれの変形を設定しておき、解釈テーブル２４に従って解釈されたジェスチャおよびそれの強度、すなわち身体動作の表情強度に対応したモーフィング率を決定する。 In the computer 22, a physical motion (gesture) -singing voice expression table 28 is preset in a memory (not shown). The body motion-singing voice expression table 28 is a table for mapping the interpreted gesture, that is, the expression intensity of the posture, to the expression of the singing voice as described later. That is, the body motion / singing voice expression table 28 determines at what morphing rate the four original singing voices are morphed according to the expression intensity of the posture of the stuffed toy 12. In this embodiment, a sigmoid function f (x) and its deformation shown in FIG. 26, which will be described later, are set in this table 28, and the gesture interpreted according to the interpretation table 24 and its intensity, that is, the facial expression intensity of the body movement. Determine the corresponding morphing rate.

そして、図６に示す実施例では、身体動作-歌声表現テーブル２８に基づいて決定したモーフィング率でモーフィングを行うために、先に説明したSTRAIGHTを用いた音声モーフィングを行う音声合成部３０を設けた。この音声合成部３０は、コンピュータ２２とはハード的には別の専用回路（たとえばＡＳＩＣ）として形成されてもよく、コンピュータ２２に十分な能力があれば、コンピュータ２２の一機能として実行されてもよい。この音声合成部３０では、歌声データベース２８に予め登録または格納しておいた少なくとも２つ（実施例では４つ）の元音声（元歌声）を図８に従ったモーフィング率でモーフィングする。 In the embodiment shown in FIG. 6, in order to perform morphing at the morphing rate determined based on the body motion-singing voice expression table 28, the speech synthesizer 30 that performs speech morphing using STRAIGHT described above is provided. . The speech synthesizer 30 may be formed as a dedicated circuit (for example, ASIC) separate from the computer 22 in hardware, and may be executed as a function of the computer 22 if the computer 22 has sufficient capability. Good. In this speech synthesizer 30, at least two (four in the embodiment) original voices (original singing voices) registered or stored in advance in the singing voice database 28 are morphed at a morphing rate according to FIG.

図６の実施例において、コンピュータ２２は、図１３の最初のステップＳ１１において、Ａ／Ｄ変換器２０から、各センサ１６１ａ‐１６４，１８１および１８２から出力される電圧（センサ値）を読み取る。 In the embodiment of FIG. 6, the computer 22 reads the voltage (sensor value) output from each of the sensors 161a-164, 181 and 182 from the A / D converter 20 in the first step S11 of FIG.

そして、コンピュータ２２は、図１３のステップＳ１３において、解釈テーブル２４を参照して、そのときのぬいぐるみ１２のジェスチャすなわち身体動作の表情強度を、先の表１のように、同定する。 Then, in step S13 in FIG. 13, the computer 22 refers to the interpretation table 24 and identifies the gesture intensity of the stuffed animal 12 at that time, that is, the facial expression intensity of the body movement, as shown in Table 1 above.

その後、コンピュータ２２は、ステップＳ１５で、図６に示す身体動作-歌声表現テーブル２８を参照して、姿勢（身体動作）の表情強度に基づいて、先に図１２を参照して説明したようにモーフィング率を決定する。 Thereafter, in step S15, the computer 22 refers to the body motion-singing voice expression table 28 shown in FIG. 6 and based on the facial expression strength of the posture (body motion) as described above with reference to FIG. Determine the morphing rate.

そして、ステップＳ１７で、コンピュータ２２が、あるいは音声合成部３０が、歌声データベース２６から読み出した各元歌声（音声）信号を、ステップＳ１５で決定したモーフィング率に従ってモーフィングする。 In step S17, the computer 22 or the speech synthesizer 30 morphs each original singing voice (voice) signal read from the singing voice database 26 according to the morphing rate determined in step S15.

そして、ステップＳ１９において、コンピュータ２２は、反り返りジェスチャの程度に応じた音量でモーフィング音声がスピーカ３４から出力されるように、その反り返り（手首曲げセンサのセンサ値）に応じた音量設定信号をＤ／Ａ変換器３２に与える。 In step S19, the computer 22 outputs a volume setting signal corresponding to the warping (the sensor value of the wrist bending sensor) so that the morphing sound is output from the speaker 34 at a volume corresponding to the degree of the warping gesture. The A converter 32 is given.

このようにして、コンピュータ２２は、被験者またはユーザが自分の手（手袋型センサ１４を装着した）でぬいぐるみ１２のジェスチャすなわち姿勢の表情を変更することによって、たとえば４種類のモーフィング元音声「no」，「da」，「wh」，「we」をモーフィング（音声合成）した、モーフィング音声が発生され得る。 In this way, the computer 22 changes, for example, four types of morphing source voice “no” by changing the gesture of the stuffed toy 12, that is, the facial expression of the stuffed animal 12, by the subject or the user with his / her hand (wearing the glove-type sensor 14). , “Da”, “wh”, “we” can be morphed (speech synthesized) to generate morphed speech.

この発明が向けられるクロスモダリティを検討する際には、上述のように、基本となる表現の種類の組合せが自然に受容されることが望ましい。そこで、発明者等は、さらに、擬人的表現における表情強度のクロスモダリティ特性について注目する。 When examining the cross-modality to which the present invention is directed, it is desirable that the combination of the basic expression types is naturally accepted as described above. Therefore, the inventors pay attention to the cross-modality characteristic of the expression intensity in the anthropomorphic expression.

擬人的表現が伴うクロスモダリティでは、目指す表情強度や適合性の高い適切な組合せが存在すると仮定し、
1.基本特性として、表現された表情強度と実際に知覚される強度の関係
−身体動作の表情強度の知覚的な補間傾向
−歌声と身体動作の複合表現の表情強度における知覚的な補間傾向
2.表情強度の異なるクロスモダリティの知覚傾向
−表情強度の知覚
−適合性の知覚
について調べるため、図６のシステム１０を用いて知覚評価を行った。実験１では基本的な表情強度の知覚特性を確認し、実験２ではモダリティ間で表情強度の異なる組合せの複合表現において、適合性および表情強度の知覚実験を行ない、実験３では表情強度が時系列的に変化するときにも実験２と同様の結果が得られるか確認した。 In the cross modality with anthropomorphic expressions, we assume that there is an appropriate combination with high facial expression strength and suitability,
1. As a basic characteristic, the relationship between the expressed facial expression intensity and the actually perceived intensity-Perceptual interpolation tendency of facial expression intensity of body movement-Perceptual interpolation tendency of facial expression intensity of compound expression of singing voice and physical movement
2. Perceptual evaluation of cross modalities with different facial expression intensities—perception of facial expression intensity—perception of suitability was evaluated using the system 10 of FIG. Experiment 1 confirms the basic perception characteristics of facial expression strength, Experiment 2 conducts perception experiments of suitability and facial expression strength in combined expressions with different facial expression strengths between modalities, and Experiment 3 performs facial expression strength over time. It was confirmed that the same result as in Experiment 2 could be obtained even when the temperature changed.

評価実験では、ぬいぐるみを用いた身体動作のうち、姿勢の表情強度を全被験者に見せる際再現性を保障することが必要である。しかし、ロボットなどの内部機構を用いてぬいぐるみを動かす場合は完全な再現性を保障することは現在の制御技術では困難である。そのため、手の運動能力に問題のない演者による手人形の身体動作を映したビデオを用いた視聴覚提示による知覚評価を導入した。
実験１(表情強度の知覚的補間効果)
歌声の表情付けの度合いが異なる音声では、補間された表情強度が知覚的にも補間されることが既に確認されている。実験１は、基本的な知覚特性として、身体動作の単独表現/身体動作と歌声の複合表現における表情強度においても知覚補間されることを確認するため行った実験である。 In the evaluation experiment, it is necessary to ensure reproducibility when showing the facial expression strength of the posture among all the body movements using the stuffed toy. However, when moving the stuffed animal using an internal mechanism such as a robot, it is difficult to ensure complete reproducibility with the current control technology. Therefore, we introduced a perceptual evaluation by audiovisual presentation using a video showing the physical movement of a hand puppet by a performer who has no problem with hand movement ability.
Experiment 1 (Perceptual interpolation effect of facial expression intensity )
It has already been confirmed that the interpolated facial expression intensity is perceptually interpolated for voices with different expression levels of singing voices. Experiment 1 is an experiment conducted to confirm that perceptual interpolation is performed as a basic perceptual characteristic even in the expression of the body motion in the single expression / complex expression of the body motion and singing voice.

実験１を行なうに際して以下のIおよびIIの仮説を立てた。
I 身体動作(姿勢)の段階的な表情強度は知覚補間される。
II歌声と身体動作の複合表現でも表情強度は知覚補間される。 In conducting Experiment 1, the following hypotheses I and II were made.
I Stepwise facial expression intensity of body movement (posture) is perceptually interpolated.
In the combined expression of II singing voice and body movement, the expression intensity is perceptually interpolated.

被験者：男性１１名女性９名の２０名（２０代〜３０代)
実験環境：被験者は防音室でヘッドフォンを着用し、大型ディスプレイの前で実験プログラムを実行した。実験プログラムはTcl/TK（商品名）で記述されており、被験者の知覚評価の入力受け付けと、QuickTimeTcl3.1（商品名）によるビデオ表示を行うようにした。ビデオ映像のぬいぐるみが実物大に映されるよう、大型ディスプレイに３２０×２４０画素のビデオ映像を表示させた。
実験手法：
実験１-I：表情強度の異なる姿勢のぬいぐるみを映す
無音ビデオ(g)を視聴させ、３種類の表情付け｛bak(実験１-I-aとする)，drp(実験１-I-b)，str(実験１-I-c)｝についてそれぞれ表情強度を７段階で評価する。
実験１-II：姿勢に相当させた表情強度の歌声(v)をgに加えたビデオv+gを視聴させ、３種類の表情付け｛da:bak(Ａ，実験１-II-aとする)、wh:drp(Ｂ，実験１-II-b)、we:str(Ｃ，実験１-II-c)｝（表３。以降、歌声の表情：姿勢の表情と記述する。)についてそれぞれ表情強度を以下表４の７段階で主観評価する。
［表４］
７：非常に当てはまる
６：かなり当てはまる
５：少し当てはまる
４：どちらともいえない
３：余り当てはまらない
２：ほとんど当てはまらない
１：まったく当てはまらない
各実験前に評価基準を揃えるため、評価対象の表情付きビデオと表情なしビデオを被験者に見せる(教示内容で後述)。 Subjects: 20 men (20s-30s), 11 men and 9 women
Experimental environment: Subjects wore headphones in a soundproof room and executed the experimental program in front of a large display. The experiment program is written in Tcl / TK (trade name), and accepts the input of the subject's perceptual evaluation and displays the video by QuickTimeTcl 3.1 (trade name). The video image of 320 × 240 pixels was displayed on a large display so that the stuffed toy of the video image was projected in full size.
Experimental method:
Experiment 1-I: Shows a stuffed toy with different facial expression intensity. Listen to a silent video (g), add 3 types of facial expression {bak (Experiment 1-Ia), drp (Experiment 1-Ib), str (Experiment For 1-Ic)}, the facial expression strength is evaluated in 7 stages.
Experiment 1-II: Watch a video v + g with the expression voice singing voice (v) corresponding to the posture added to g, and add 3 types of expression {da: bak (A, Experiment 1-II-a ), Wh: drp (B, Experiment 1-II-b), we: str (C, Experiment 1-II-c)} (Table 3). The facial expression strength is subjectively evaluated according to the 7 levels shown in Table 4 below.
[Table 4]
7: Very applicable 6: Very applicable 5: A little applicable 4: Neither applicable 3: Not applicable 2: Almost not applicable 1: Not applicable at all 1 And show the subject a video without facial expression (see below for teaching).

実験条件(刺激ビデオ)：ぬいぐるみの姿勢と歌声の表情付けのさまざまな種類・強度による組合せでビデオを作成する。評価実験中に表情付けの名称を述べる必要があるときは、表３に準拠した｛Ａ、Ｂ、Ｃ｝のラベルを用い、形容詞など特定の印象を与えないようにした。刺激ビデオの音声と動画の組合せにおいては、約３秒の音声に対し、前後に０．２秒ずつ長い動画を組合せることにより、音声と動画が同時に再生されている印象を被験者に与えるようにした。 Experimental conditions (stimulus video): Create a video with a combination of various kinds and strengths of stuffed animal posture and singing voice expression. When it was necessary to state the name of the expression during the evaluation experiment, the label of {A, B, C} based on Table 3 was used so as not to give a specific impression such as an adjective. In the combination of the sound of the stimulating video and the moving image, the subject is given an impression that the sound and the moving image are simultaneously reproduced by combining a moving image having a length of 0.2 seconds before and after the sound of about 3 seconds. did.

知覚補間の連続性を評価するため、７段階の表情強度で刺激ビデオを作成した。
g：表情付け毎に、０をneu、１を表情付き姿勢｛bak，drp，str｝として、７段階の表情強度の静止姿勢画像によりビデオを作成した（７段階×３種類の表情付け=２１種類、音声なし)。-実験１-I
v+g：７段階の表情強度の歌声を、gで相当する表情強度の静止画像と組合せた（７×３種類（Ａ〜Ｃ）＝２１種類）。-実験１-II
姿勢の映像は、動きを記録したセンサ値から表情強度の割り出し（解釈）、７段階｛０，０．１７，０．３３，０．５，０．６７，０．８３，１｝に該当する画像を、ぬいぐるみ１２の動きを記録した動画から切り出した。また、歌声部分は上述のモーフィングを利用して作成した、表情強度が段階的な歌声合成を利用し、７段階の表情付き歌声(モーフィング率｛０，０．１７，０．３３，０．５，０．６７，０．８３，１｝の歌声)を準備した。 In order to evaluate the continuity of perceptual interpolation, a stimulus video was created with 7 levels of facial expression intensity.
g: For each facial expression, a video was created by using a static posture image with seven levels of facial expression intensity with 0 as neu and 1 as a posture with a facial expression {bak, drp, str} (7 levels x 3 types of facial expression = 21 Type, no sound). -Experiment 1-I
v + g: A singing voice having 7 levels of expression intensity was combined with a still image having an expression intensity corresponding to g (7 × 3 types (A to C) = 21 types). -Experiment 1-II
The image of the posture corresponds to 7 steps {0, 0.17, 0.33, 0.5, 0.67, 0.83, 1} by calculating (interpreting) the expression intensity from the sensor value recording the movement. The image was cut out from a moving image in which the movement of the stuffed animal 12 was recorded. Further, the singing voice part is created by using the above-described morphing, and the singing voice with seven expression levels (morphing rate {0, 0.17, 0.33, 0.5 , 0.67, 0.83, 1}).

教示内容：例として、実験１-Ｉ-aでは
1.実験前に予め刺激ビデオgのneuを視聴し、平坦であることを確認した後、刺激ビデオgのbakを視聴し、表情付けＡとすることを確認し、評価基準とすること、
2.実験ビデオを視聴し、ぬいぐるみ１２の姿勢の表情付けがＡであると思う度合いを７段階で評価すること
を教示した。実験1-Ib,cや実験1-IIも同様に、基準となる刺激ビデオを視聴した後に実験を開始することを教示した。 Instructions: As an example, in Experiment 1-Ia
1. Before the experiment, watch the stimulation video g neu in advance and confirm that it is flat, then watch the stimulation video g bak, confirm that it is a facial expression A, and use it as an evaluation standard.
2. I watched an experimental video and taught me how to evaluate the degree of expression of the stuffed toy 12's posture in 7 steps. Experiment 1-Ib, c and Experiment 1-II also taught starting the experiment after viewing the reference stimulus video.

実験結果：実験の結果を表情強度毎に集計した平均値と標準偏差を図１４、図１５および図１６で示す。また、線形近似におけるＲ^２値を表５に示す。 Experimental results: The average values and standard deviations obtained by collecting the experimental results for each facial expression intensity are shown in FIGS. 14, 15 and 16. FIG. Further, Table ² shows R ² values in the linear approximation.

この表５によると、gは右肩上がりのグラフを描き,Ｒ^２値も高いことより、仮説IおよびIIが成立することを確かめた。 According to the table 5, g is drawing a graph soaring, it confirmed that from that higher R ² value, hypotheses I and II are satisfied.

gと比較してv+gでＲ^２値が全体的に高いことに対し、モダリティの違いと表情強度の違いおよびそれらの交互作用が及ぼす影響を調べるため、２（g，v+g）＊７(表情強度)=１４条件２要因の反復測定分散分析（有意水準α＝０．０５,自由度φ＝１,６，３８）を行った。表６に検定の結果を示す。これによると、表情強度による差異はgとv+gの間でＦ＝３．４６、ｐ＝．０７となり有意傾向にとどまった他、モダリティの個数｛g, v+g｝や交互作用による有意差は認められなかった。 In order to investigate the difference between modality and facial expression intensity and their interaction, the R ² value is generally higher for v + g compared to g, 2 (g, v + g) * 7 (Facial expression intensity) = 14 Conditional two-way analysis of variance (significance level α = 0.05, degree of freedom φ = 1, 6, 38) was performed. Table 6 shows the results of the test. According to this, the difference due to facial expression strength is F = 3.46, p =. No significant difference was observed due to the number of modalities {g, v + g} or interaction.

実験２（表情強度を組み替えたクロスモダリティ知覚効果）
実験２では、ぬいぐるみ１２による擬人的表現において歌声と姿勢の表情付け強度が互いに異なるとき、適合性および表情強度の知覚において、歌声‐姿勢間で相互作用があるかどうか調べた。実験２を行なうに際して以下のIおよびIIの仮説を立てた。
I 歌声と姿勢において、互いに対応付けて教示した強度の表情付けで、適合していると知覚される
II 知覚される表情強度は歌声と姿勢の表情強度の平均になる。
被験者：実験１と同一の被験者
実験環境：実験1と同一の環境
実験手法：
実験２-I：４段階の歌声×４段階の姿勢＝計１６種類のビデオをランダム順に視聴させ、適合性について７段階の評価を求める。これをＡ（実験２-I-aと呼ぶ）、Ｂ（実験２-I-b)、Ｃ（実験２-I-c）の組合せにおいてそれぞれ実行する。実験前に、実験１-IIと同様に評価基準となるビデオとして、表情なしビデオおよび該当する表情付きビデオの双方を見せる。 Experiment 2 (Cross modality perception effect by changing facial expression strength)
In Experiment 2, when the expression strengths of the singing voice and the posture are different from each other in the anthropomorphic expression by the stuffed toy 12, whether or not there is an interaction between the singing voice and the posture in the perception of the suitability and the expression strength. In conducting Experiment 2, the following hypotheses I and II were made.
I In the singing voice and posture, it is perceived as being compatible with the expression of strength taught in association with each other
II The perceived facial expression intensity is the average of singing voice and posture facial expression intensity.
Subject: Subject experiment environment identical to Experiment 1: Environment experiment method identical to Experiment 1:
Experiment 2-I: 4 stages of singing voices × 4 stages of posture = A total of 16 types of videos are viewed in random order, and a 7-level evaluation is performed for suitability. This is performed for each combination of A (referred to as Experiment 2-Ia), B (Experiment 2-Ib), and C (Experiment 2-Ic). Before the experiment, both the facial expressionless video and the video with the facial expression are shown as the evaluation standard videos as in Experiment 1-II.

実験２-II：上記の実験２-Iの評価の後、評価項目を表情強度に設定し、7段階の評価を求める。これをＡ（実験２‐II-aと呼ぶ）、Ｂ（実験２-II-b）、Ｃ（実験２-II-c）の組合せにおいてそれぞれ実行する。 Experiment 2-II: After the evaluation of Experiment 2-I above, set the evaluation item to facial expression strength and obtain a seven-level evaluation. This is performed for each combination of A (referred to as Experiment 2-II-a), B (Experiment 2-II-b), and C (Experiment 2-II-c).

実験条件(刺激ビデオ)：２要素の表情強度について相互作用を検討するときに最低限必要な段階数を考慮し、歌声および姿勢の各モダリティの表情強度を４段階与え、各強度毎に組合せた。４段階の歌声×４段階の姿勢、計１６種類の表情強度を組み替えたビデオを、３種類の表情付けＡ(da:bak)、Ｂ(wh:drp)、Ｃ(we:str)について、それぞれ準備した(図１７)。身体動作の可動範囲および収録したアマチュア歌手の歌声における表情強度の下限と上限をそれぞれ０．０、１．０とし、互いに対応付けてこの実験２を行った。 Experimental conditions (stimulus video): Considering the minimum number of stages required when examining the interaction between the facial expression intensities of two elements, the expression intensity for each modality of singing voice and posture is given in four stages and combined for each intensity. . 4 levels of singing voices x 4 levels of posture, a total of 16 types of facial expressions, with a total of 16 different facial expressions for 3 types of facial expressions A (da: bak), B (wh: drp), and C (we: str) Prepared (FIG. 17). This experiment 2 was performed by setting the lower limit and the upper limit of the expression intensity in the movable range of body movement and the recorded voice of the amateur singer as 0.0 and 1.0, respectively.

歌声部分は先に説明した表情強度が段階的な歌声合成（モーフィング）を利用し、４段階の表情付き歌声(モーフィング率｛０、０．３３、０．６７、１｝の歌声)を準備した。姿勢の映像は、動きを記録したセンサ値から表情強度を割り出し、４段階｛０、０．３３、０．６７、１｝に該当する画像を動画から切り出した。 For the singing voice part, the singing voice with morphing rate {0, 0.33, 0.67, 1} was prepared using the singing voice synthesis (morphing) in which the facial expression intensity was explained in steps. . For the posture image, facial expression intensity was determined from the sensor value recording the movement, and images corresponding to the four steps {0, 0.33, 0.67, 1} were cut out from the moving image.

教示内容：
実験２-I-aおよび実験２-II-aでは以下の教示を行った。
1.実験前に予め歌声とぬいぐるみ１２の映像(no：neu、表情なし)を視聴し平坦であることを確認した後、(da：bak、表情付けＡ)を視聴し表情付けＡとすることを確認すること、
2.実験ビデオを視聴し、ぬいぐるみ１２の姿勢について、以下の項目を表４に示す７段階で評価すること
I.表情付けの歌声‐姿勢間の適合性（実験２-I-a）
II.表情付けがＡである度合い（実験２-II-a)
実験b、cでも同様に、評価基準となるビデオを視聴した後に、実験を開始することを教示した。 Teaching contents:
In Experiment 2-Ia and Experiment 2-II-a, the following teaching was performed.
1. Before the experiment, watch the singing voice and the video of the stuffed toy 12 (no: neu, no facial expression) and confirm that it is flat, then watch (da: bak, facial expression A) and make the facial expression A. To check the
2. Watch the experiment video and evaluate the posture of the stuffed toy 12 according to the following 7 items shown in Table 4.
I. Singing voice with expression-compatibility between postures (Experiment 2-Ia)
II. Degree of facial expression being A (Experiment 2-II-a)
Similarly, in Experiments b and c, it was taught to start the experiment after watching the video as the evaluation standard.

適合性については、「歌声と姿勢の間の表情付けの組合せが自然かどうか」とした。 Conformity was “whether the combination of expression between singing voice and posture is natural”.

実験結果Ｉ[適合性の知覚]：実験２‐I‐a〜cにおける知覚評価の平均値をそれぞれ図１８、１９、２０に示す。なお、これらの図１８‐２０において、後述の図２１‐２３においても同様であるが、縦軸が評価（MOS of Appropriateness）を示し、横軸の左側が歌声の表情強度（Strength of Singing Voice Expression）を、右側が身体動作（ジェスチャ）の表情強度（Strength of Gestural Expression）をそれぞれ示している。これによると、歌声の表情強度｛０、０．３３｝のときは、姿勢の表情強度が強まると適合性が低くなり、反対に歌声の表情強度｛０．６７、１｝のときは、姿勢の表情強度が強まると適合性が高くなる様子がわかる。クロスモダリティの作用により適合性の評価値が変化したと考えられるが、必ずしも歌声と姿勢が互いに相当する強度において適合性が高いとはいえない。たとえば、歌声の表情付け強度が０．６７に対し、適合性が最も高いと判断された姿勢の強度は１であった。つまり仮説Ｉは必ずしも成り立つとは限らない。 Experimental result I [Perception of suitability]: Average values of perceptual evaluation in Experiments 2-Iac are shown in FIGS. In these FIGS. 18-20, the same applies to FIGS. 21-23 described later, but the vertical axis indicates evaluation (MOS of Appropriateness), and the left side of the horizontal axis indicates the strength of expression of the singing voice (Strength of Singing Voice Expression). ), And the right side shows the strength of expression of body movement (gesture). According to this, when the facial expression strength of the singing voice is {0, 0.33}, the compatibility becomes low when the facial expression strength of the posture is strong, and conversely when the facial expression strength of the singing voice is {0.67, 1}. It can be seen that the adaptability increases as the facial expression intensity increases. Although it is considered that the evaluation value of suitability has changed due to the action of the cross modality, the suitability is not necessarily high at the strength corresponding to the singing voice and the posture. For example, the facial expression strength of the singing voice is 0.67, whereas the posture strength determined to be the highest suitability is 1. That is, hypothesis I does not always hold.

これらの結果に対し、帰無仮説を「適合性の知覚では姿勢の表情強度による差、歌声の表情強度による差、交互作用による差のいずれもない」として、２要因の反復測定分散分析(α＝０．０５、φ＝３，３，７６)を行った。その結果を表7に示す。 In response to these results, the null hypothesis is that “there is no difference due to facial expression intensity in posture, difference due to singing voice facial expression intensity, or difference due to interaction in perception of suitability”. = 0.05, φ = 3, 3, 76). The results are shown in Table 7.

検定の結果、２‐I‐aで「bak」の表情付け強度による平均の差の検定がＦ＝．１２、ｐ＝．９５となったものと、２‐I‐cで「we」の表情付け強度による平均の差の検定がＦ＝２．０３、ｐ＝．１２であった以外では全ての検定で有意差が認められた。 As a result of the test, the test of the average difference according to the expression intensity of “bak” in 2-Ia is F =. 12, p =. The test of the average difference according to the expression intensity of “we” in 2-Ic is F = 2.03, p =. A significant difference was observed in all tests except for 12.

これにより、歌声と姿勢の表情強度において、それらの交互作用が適合性に影響を及ぼすことが共通して確認された。また、歌声および姿勢の表情付け自身もそれぞれ適合性に影響を及ぼすことが、３種類の表情付けのうちそれぞれ２種類において確認された。 As a result, it was confirmed in common that the interaction between the singing voice and the expression intensity of the posture affects the compatibility. In addition, it was confirmed that the expression of the singing voice and the posture itself also affected the suitability in each of the three types of expression.

実験結果ＩＩ[表情付け強度の知覚]：実験２-II-a〜cにおける知覚評価の平均値を図２１，２２，２３にそれぞれ示す。それによると、姿勢の表情強度が知覚される表情強度への影響が強いことが明らかである。この結果について、帰無仮説を「表情強度の知覚では、姿勢の表情強度による差、歌声の表情強度による差、交互作用による差のいずれもない」として、２要因の反復測定分散分析（α＝０．０５，φ＝３，３，７６）を行った。 Experimental result II [Perception of facial expression intensity]: Average values of perceptual evaluation in Experiment 2-II-ac are shown in FIGS. According to this, it is clear that the expression intensity of the posture has a strong influence on the perceived expression intensity. For this result, the null hypothesis is that there is no difference due to facial expression intensity, difference due to facial expression intensity of singing voice, or difference due to interaction in the perception of facial expression strength. 0.05, φ = 3, 3, 76).

その結果を表８に示す。 The results are shown in Table 8.

姿勢の表情強度による平均の差の検定において、２-II-a（bakの強度)でＦ＝３９４．３８、２-II-b（drpの強度)でＦ＝５７５．９９、２-II-c（strの強度)でＦ＝４０３．８５となり、すべての表情付けで有意差が認められたのに対し、歌声の表情強度および交互作用による平均の差の検定ではすべて有意差はなかった。これにより、姿勢の強度が表情強度に影響を及ぼすことが確認されたが、歌声の表情付けおよび交互作用は共に影響が確認されなかった。つまり、歌声と姿勢の双方の表情強度が影響すると予想した仮説IIと結果は大幅に異なり、複合表現の表情強度の印象では、歌声の表情付けの影響よりも姿勢の表情付けの影響の方が明らかに強いことが示された。
実験３（表情強度の変化と適合性の知覚）
実際の表現における表情強度は時々刻々と変化する性質のものであることを考慮し、実験２-IIで確認された適合性について、表情強度が時間に伴い変化するときにおいても同様の結果が得られるか確認するため、実験３を行った。実験３を行なうために次の仮説を立てた。 In the test of the average difference according to the facial expression intensity of posture, F = 394.38 at 2-II-a (bak intensity), F = 575.99 at 2-II-b (drp intensity), 2-II- In c (str intensity), F = 403.85, and a significant difference was observed in all the facial expressions. On the other hand, there was no significant difference in the expression difference between singing voices and the average difference due to interaction. As a result, it was confirmed that the strength of the posture had an influence on the expression intensity, but neither the expression of the singing voice nor the interaction was confirmed. In other words, the result is significantly different from Hypothesis II, which is expected to affect the expression intensity of both singing voice and posture, and the impression of the expression intensity of the combined expression is more influenced by the expression of the posture than the influence of the expression of the singing voice. It was clearly strong.
Experiment 3 (Change in facial expression intensity and perception of fitness)
Considering that the facial expression strength in actual expression changes from moment to moment, the same result was obtained when the facial expression strength changed with time for the conformity confirmed in Experiment 2-II. Experiment 3 was performed to confirm whether or not The following hypothesis was established for conducting Experiment 3.

仮説：歌声と姿勢の複合表現において、一方の表情強度が変化するのに対応してもう一方の表情強度が変化する時は適合性が高く、表情強度の変化の対応が反対であるときは適合性が低い。 Hypothesis: In the combined expression of singing voice and posture, the compatibility is high when the expression intensity of the other changes in response to the change of the expression intensity of one expression, and conforms when the change in the expression intensity is opposite The nature is low.

被験者：実験２と同一の被験者
実験環境：実験２と同一の環境
実験手法：表情付けの変化における歌声姿勢間の組合せを正順および逆順で準備し、ランダム順に刺激を視聴し、それぞれの適合性について７段階（表４）で評価する。 Subject: Same subject as in Experiment 2 Experiment environment: Same environment as in Experiment 2 Experiment method: Prepare combinations of singing voice postures in facial expression changes in forward and reverse order, listen to stimuli in random order, and suit each Are evaluated in 7 stages (Table 4).

実験条件(刺激ビデオ)：表９に示すように、歌声の表情強度が０から１に変化するとき、姿勢の表情強度が正順に（０から１へ)変化する組合せと、逆順に（１から０へ）変化する組合せのビデオを準備した。順序効果を考慮し、それぞれ逆順の表情強度の変化を伴う歌声（１から０）も準備した。３種類の表情付けにおいて同様に刺激ビデオを作成した。これは図１８，１９，２０の対角線を結ぶ表情強度の変化といえる。 Experimental conditions (stimulation video): As shown in Table 9, when the expression intensity of the singing voice changes from 0 to 1, a combination in which the expression intensity of the posture changes in the forward order (from 0 to 1), and in the reverse order (from 1 to 0) A video of changing combinations was prepared. Considering the order effect, singing voices (1 to 0) accompanied by changes in facial expression intensity in reverse order were also prepared. Stimulation videos were similarly created for the three types of facial expressions. This can be said to be a change in facial expression intensity connecting the diagonal lines of FIGS.

教示内容:適合性の評価基準は実験２と同一とし、実験２と同様に７段階で適合性を評価するよう教示した。 Contents of teaching: The conformity evaluation criteria were the same as those in Experiment 2, and it was taught that the conformity was evaluated in seven stages as in Experiment 2.

実験結果：結果を図２４に示す。図中の「-」は歌声と姿勢の時系列変化が相応していない組合せを指す。これらの結果に対し、帰無仮説を「正順と逆順の間に平均の差がない」として検定を行った結果（α＝．０５，φ＝１９）、３種類の表情付けすべてでｐ＜．０１となり仮説が棄却された。よって、表情強度の変化が伴うときも、歌声と姿勢の表情強度が正順で対応しているほうが、逆順と比べ適合性が高いことが示された。これによって、動的な変化を伴う表情強度においても、適合性の知覚に対してクロスモダリティによる影響があることがわかった。 Experimental results: The results are shown in FIG. In the figure, “-” indicates a combination in which the singing voice and posture change in time series are not appropriate. As a result of testing the null hypothesis as “there is no average difference between normal and reverse order” (α = 0.05, φ = 19), p < . The hypothesis was rejected as 01. Therefore, it was shown that when the facial expression intensity changes, the compatibility of the singing voice and the facial expression intensity in the normal order is higher than the reverse order. As a result, it was found that cross-modality has an effect on the perception of fitness even in facial expression intensity with dynamic changes.

以上の実験１、実験２および実験３の分析結果から明らかになったことを整理しながら複合的な分析を行い、擬人的表現における歌声身体動作(姿勢)間のクロスモダリティについて考察する。
表情強度の知覚のモダリティ間バランス
まず、実験１-Iの結果により、身体動作の姿勢提示における表情強度に対し、順序性を保って表情強度が知覚されることが確認された。また、歌声の表情強度が補間されることが既に分かっている手法である音声モーフィングを用いた身体動作との複合表現に対しても、表情強度の提示に対し順序性を保って表情強度が知覚された(実験１-II)。つまりこれらによって、クロスモダリティの検討の前提として、各モダリティまたは複合表現の表情強度が知覚されることを確認した。 A composite analysis will be conducted while organizing what has been clarified from the analysis results of Experiment 1, Experiment 2 and Experiment 3, and the cross modality between singing voice movements (postures) in anthropomorphic expression will be considered.
Balance between expression intensity perception modalities First, the results of Experiment 1-1 confirmed that expression intensity was perceived in order with respect to the expression intensity in the body motion posture presentation. In addition, the expression strength is perceived while maintaining the order for the expression of the expression strength, even for compound expressions with body movements using voice morphing, which is a technique that is already known to interpolate the expression strength of the singing voice. (Experiment 1-II). In other words, it was confirmed that the expression intensity of each modality or compound expression was perceived as a premise for studying the cross modality.

次に、歌声と身体動作の各影響を考える。実験１の知覚曲線における線形近似のＲ^２値は、gよりもv+gが高かった（表５）。実験１の検定においてモダリティの違いによる知覚の差は認められなかったものの、図６のシステム１０では、図２５に示すように、歌声の知覚における表情強度がシグモイド関数に近似されていることを考慮すると、姿勢による表情強度の知覚曲線が、複合表現に比べ上に膨らむ円弧のような形状であることより、歌声の表情強度が影響した可能性も否定できない。 Next, consider the effects of singing voice and body movement. The R ² value of linear approximation in the perception curve of Experiment 1 was higher in v + g than in g (Table 5). Although the difference in perception due to the difference in modality was not recognized in the test of Experiment 1, the system 10 in FIG. 6 considers that the expression intensity in the perception of singing voice is approximated to a sigmoid function as shown in FIG. Then, the perception curve of the expression intensity depending on the posture has an arc-like shape that swells upward compared to the composite expression, and thus the possibility that the expression intensity of the singing voice has influenced cannot be denied.

しかし、実験２-IIにおける表情強度の傾向を見ると、歌声の表情強度の影響もクロスモダリティの効果も有意ではなく、姿勢の表情強度の影響が大きい。言い換えると、実施例で単純化を図り設定した擬人的表現のビデオ提示刺激における、実験で用いた構成と表情付けでは、視覚的な表情強度が知覚に与える影響は、聴覚的な表情強度に比べ大きいことが観察された。したがって、モダリティ間で表情強度のバランスが異なっていたといえる。原因としては擬人的媒体における視聴覚の表情強度に対する知覚感度が本質的に異なるか、上述の実験において扱ったアマチュア歌手の声色とぬいぐるみの姿勢の間における表情強度のバランスが特有である可能性がそれぞれ推測される。 However, looking at the tendency of facial expression intensity in Experiment 2-II, the influence of facial expression intensity on singing voice and the effect of cross-modality are not significant, and the influence of facial expression intensity on the posture is large. In other words, in the video presentation stimulus of anthropomorphic expression simplified and set in the example, the effect of visual expression intensity on perception is different from auditory expression intensity in the composition and expression used in the experiment. It was observed to be large. Therefore, it can be said that the balance of expression intensity was different between modalities. This may be because the perceived sensitivity of the audiovisual expression intensity in anthropomorphic media is essentially different, or the balance of facial expression intensity between the singer's voice color and the stuffed toy posture treated in the above experiment may be unique. Guessed.

これらに対し、視覚的な表情強度の影響と聴覚的な表情強度の影響をつりあわせるためには、クロスモダリティの効果を前提に表情強度の幅を調節し、各モダリティの表情強度のバランスをとるという検討も必要だと考えられる。
適合性の知覚のモダリティ間バランス
表情強度の知覚が姿勢から強い影響を受けるのに対し、実験２-Iの検定で歌声-姿勢間交互作用において有意な差が存在したことより、歌声と姿勢の表情強度の組合せが適合性の知覚に影響することが明らかになった。また実験３では、時系列的な変化を伴う場合も歌声と姿勢の表情強度の組合せが適合性に影響することが示された。つまり、実施例で設定した擬人的表現のビデオ提示刺激における表情強度では、身体動作の表情付けが優先的に知覚され、歌声の表情強度は関係していないが、適合性では歌声との組合せが重要になり、擬人化における表情強度の身体動作-歌声問クロスモダリティ歌声の影響が存在することがわかる。 On the other hand, in order to balance the effects of visual facial expression intensity and auditory facial expression intensity, the width of facial expression intensity is adjusted on the premise of the effect of cross-modality to balance the facial expression intensity of each modality. It is considered necessary to consider this.
While the perception of balance perception between the modalities of fitness perception is strongly influenced by posture, there is a significant difference in the interaction between singing voice and posture in the test of Experiment 2-I. It became clear that the combination of facial expression intensity affects the perception of fitness. In Experiment 3, it was shown that the combination of the singing voice and the facial expression intensity of the posture affects the compatibility even when time-series changes are involved. In other words, the facial expression intensity in the video presentation stimulus of the anthropomorphic expression set in the embodiment perceives the facial expression of the body motion preferentially and is not related to the facial expression intensity of the singing voice. It becomes important and it turns out that the influence of the body movement of the expression strength in anthropomorphic-singing voice question cross modality singing voice exists.

しかし、実験２-Iでは、歌声-姿勢間で対応するすべての表情強度の組合せにおいて適合性が高いと必ずしもいえず、歌声の表情強度が中間的であるとき（０または１ではないとき)、姿勢は０か１どちらか近い方と適合しているという結果であった。歌声の表情強度はシグモイド関数に近似されているが、上記のような結果から、０．３３や０．６７のような中間的な表情強度が強調されて知覚された可能性がある。 However, in Experiment 2-I, it is not necessarily said that the compatibility is high in all combinations of facial expression strengths corresponding to the singing voice-posture, and when the facial expression strength of the singing voice is intermediate (when it is not 0 or 1), The result was that the posture matched with 0 or 1 whichever was closer. The expression intensity of the singing voice is approximated to a sigmoid function, but from the above results, there is a possibility that an intermediate expression intensity such as 0.33 or 0.67 is emphasized and perceived.

これに対し、この構成においてモダリティ間で相応した表情強度で適合性を高めるには、逆シグモイド関数によってモーフィング間隔を設定するなど、歌声の表情強度曲線を補正することが有効だとも考えられる。 On the other hand, in this configuration, correcting the expression intensity curve of the singing voice, such as setting the morphing interval by an inverse sigmoid function, is effective for improving the compatibility with the expression intensity corresponding to the modalities.

また、上で示した歌声の表情強度の変化幅の調節を検討する際は、単純に線形に縮尺するのではなく、上述の表情強度曲線の補正を用いながら縮尺し、表情強度と適合性の両方における歌声‐身体動作間バランスがとれ、調和した自然な表現を実現できる。 In addition, when considering the adjustment of the change range of the facial expression intensity of the singing voice shown above, instead of simply reducing the scale linearly, using the correction of the facial expression intensity curve described above, The balance between singing voice and body movement in both can be achieved, and a harmonious natural expression can be realized.

このような実験結果で得た知見に基づいて、ここでは、歌声の表情強度とジェスチャ（姿勢。身体動作）の表情強度との組合せを設定するため、歌声に対する身体動作の表情付けの組合せとして、関数ｆ（ｘ）を定義する。そして、この関数ｆ（ｘ）を用い、(1) 音声の表情付け強度を決定付けるための計算を、組合せで定義した逆関数ｆ’（ｘ）を用いて行う、もしくは、(2) 身体動作の表情付け強度を決定付けるため、組合せで定義した関数ｆ（ｘ）を用いて出力強度を決定付ける。なお、以下の各式ではeを自然対数とする以外、ａ，ｂ，ｃ，ｄ，ｆは定数を示し、ｘは変数である。 Based on the knowledge obtained from such experimental results, here, in order to set the combination of facial expression intensity of singing voice and facial expression intensity of gesture (posture, body movement), Define a function f (x). Then, using this function f (x), (1) the calculation for determining the expression intensity of speech is performed using the inverse function f ′ (x) defined by the combination, or (2) body movement In order to determine the expression intensity of the output, the output intensity is determined using the function f (x) defined by the combination. In the following expressions, a, b, c, d, and f are constants, and x is a variable, except that e is a natural logarithm.

図６のシステム１０の身体動作‐歌声表現テーブル２８における関係性のマッピング設定において、まず、適合度の理想状態は、歌声と姿勢とで同じ表情強度のとき高い適合度とすると、次式（１）が考えられる。
［数１］
a-sqrt(power(v,g)) …（１）
ここで、aは初期値たとえば７で、vは歌声の表情強度であり、g=ジェスチャ（身体動作）の表情強度である。 In the relationship mapping setting in the body movement-singing voice expression table 28 of the system 10 of FIG. 6, first, assuming that the ideal state of the fitness is high when the singing voice and the posture have the same facial expression strength, the following formula (1 ) Is considered.
[Equation 1]
a-sqrt (power (v, g)) (1)
Here, a is an initial value, for example, 7, v is the expression intensity of singing voice, and g = expression intensity of gesture (body motion).

しかしながら、先の実験２-Iから分かるように、現状では、中程度（０．３３，０．６７）の歌声の表情付け強度において、最も適合すると捉えられたのはそれぞれ０，１の身体動作であった。つまり、中程度の歌声の表情付けが、身体動作に比べて強調されて知覚されている。したがって、歌声の表情強度（モーフィング率）を逆シグモイド関数によって補正する図６のシステム１０において歌声の表情強度と身体動作の表情強度とを適合させるためには、中程度の歌声の表情付け変化を弱く、０や１に近い表情付けの変化は激しく伝わるようにすればよい。図６実施例を利用した実験では、表情強度の知覚に対する歌声の影響は、身体動作の表情強度による影響に対して明らかに少なかった。よって、身体動作の変化幅を狭くするか、または歌声の変化幅を広げることにより、歌声と身体動作の表情強度バランスが取れ、両者の表情付けをうまく適合できると考えられる。 However, as can be seen from the previous experiment 2-I, at present, the body motions of 0 and 1 are considered to be the most suitable in the expression intensity of the singing voice of medium (0.33, 0.67) respectively. Met. That is, the expression of a medium singing voice is perceived as being emphasized compared to physical movement. Therefore, in order to adapt the expression intensity of singing voice and the expression intensity of body movement in the system 10 of FIG. 6 in which the expression intensity (morphing rate) of the singing voice is corrected by an inverse sigmoid function, a moderate singing voice expression change is required. It is only necessary that the change in expression that is weak and close to 0 or 1 is transmitted violently. In the experiment using the embodiment of FIG. 6, the influence of the singing voice on the perception of the expression intensity was clearly less than the influence of the expression intensity of the body movement. Therefore, by narrowing the change width of the body motion or widening the change width of the singing voice, it is considered that the expression intensity balance between the singing voice and the body motion can be balanced, and the expression of both can be matched well.

そこで、図２６の線Ａで示す１：１のマッピングに代えて、図２６の線Ｂで示すシグモイド関数によってマッピングする。ただし、図２６において、横軸は歌声の表情強度を示し、縦軸が身体動作の表情強度を示す。そして、図２６の線Ｂのシグモイド関数は、次式（２）となる。
［数２］
f(x)=(1/1+power(e, ax+b)) …（２）
ただし、eは自然対数、a,bは定数、例としてａ＝‐８，ｂ＝４とする。 Therefore, instead of the 1: 1 mapping indicated by line A in FIG. 26, mapping is performed by a sigmoid function indicated by line B in FIG. However, in FIG. 26, the horizontal axis represents the expression intensity of singing voice, and the vertical axis represents the expression intensity of body movement. And the sigmoid function of the line B of FIG. 26 becomes following Formula (2).
[Equation 2]
f (x) = (1/1 + power (e, ax + b)) (2)
However, e is a natural logarithm, a and b are constants, for example, a = −8 and b = 4.

さらに、図２６の線Ｂのシグモイド関数を同図線Ｃのように、ｃ（０＜ｃ＜１）倍するようにしてもよい。この場合には、歌声の表情強度を身体動作の表情強度に対して一層抑制することができる。図２６の線Ｃは、次式（３）で与えられる。
［数３］
f(x)=(1/1+power(e, ax+b)) * c …（３）
なお、倍数ｃは、一例としてｃ＝０．４とした。 Furthermore, the sigmoid function of line B in FIG. 26 may be multiplied by c (0 <c <1) as shown in FIG. In this case, the expression intensity of the singing voice can be further suppressed with respect to the expression intensity of the body movement. Line C in FIG. 26 is given by the following equation (3).
[Equation 3]
f (x) = (1/1 + power (e, ax + b)) * c (3)
The multiple c is set to c = 0.4 as an example.

さらに、歌声の表情強度と身体動作（姿勢）の表情強度とを適合させるためには、表情強度の変化開始位置を調整するようにしてもよい。たとえば、まったく表情のない歌声が存在しない場合もしくはそのような歌声が合成できない場合には、歌声について、モーフィング率０として既に表情付けられている。そのため、そのモーフィング率（表情強度）に相当する身体動作は所定の一定値ｄ（ｄ＞０，たとえば０．２）から開始させるようにする（図２６の線Ｄ）。線Ｄは、次式（４）で示される。
［数４］
f(x)=(1/1+power(e, ax+b)) * c + d…（４）
逆に、身体動作について、もしも表情付けのない身体動作が可動範囲に制限されるなどの理由により表現不可能な場合は、さらに変数fを取り入れ、次式（５）で与えられる図２６の線Ｅに示すような関数ｆ（ｘ）を用いることも考えられる。
［数５］
f(x)=(1/1+power(e, (a-f)x+ b+f) * c…（５）
一例として、音声の表情強度０．２の位置に身体動作の表情強度の０をもってくる場合、たとえばｆ＝約２として（このｆは０．２や上記の定数により逆算すれば求められる）計算すれば、音声表情強度の開始地点を繰り上げられる。 Furthermore, in order to adapt the expression intensity of the singing voice and the expression intensity of the body movement (posture), the change start position of the expression intensity may be adjusted. For example, when there is no singing voice with no expression at all, or when such a singing voice cannot be synthesized, the singing voice is already expressed with a morphing rate of 0. Therefore, the body motion corresponding to the morphing rate (expression intensity) is started from a predetermined constant value d (d> 0, for example, 0.2) (line D in FIG. 26). The line D is represented by the following formula (4).
[Equation 4]
f (x) = (1/1 + power (e, ax + b)) * c + d ... (4)
On the other hand, if the body motion cannot be expressed because the body motion without facial expression is limited to the movable range, the variable f is further incorporated and the line of FIG. 26 given by the following equation (5) is obtained. It is also conceivable to use a function f (x) as shown in E.
[Equation 5]
f (x) = (1/1 + power (e, (af) x + b + f) * c ... (5)
As an example, when a facial expression intensity of 0 is brought to a position of a voice expression intensity of 0.2, for example, f = about 2 (this f is obtained by calculating back by 0.2 or the above constant). For example, the starting point of the voice expression strength can be raised.

なお、このような式（２）‐（５）のいずれを適用するかは、システム１０で固定的に設定してもよいし、あるいは、時間の経過に応じて適用する式すなわち関数を変更するようにしてもよい。 It should be noted that which of these formulas (2) to (5) is applied may be fixedly set by the system 10, or an applied formula, that is, a function is changed according to the passage of time. You may do it.

このようにして、歌声の表情強度に対する身体動作の表情強度をマッチングさせるためにための係数ｃ，ｄ，ｆが算出される。係数ｃ，ｄ，ｆが０の場合が式（２）と同様になる。 In this manner, the coefficients c, d, and f for matching the facial expression intensity of the body motion with the expression intensity of the singing voice are calculated. The case where the coefficients c, d, and f are 0 is the same as in the equation (2).

さらに、具体的に、身体動作の表情付け強度が線形に知覚されるような身体動作の場合には、歌声（音声）の表情付けをマッチングさせるために、設定したい強度数値に上記各式のいずれかで設定した関数ｆ（ｘ）の逆関数ｆ’（ｘ）を適用して歌声の表情付けのためのモーフィング率を決定することができる。たとえば式（５）の場合であれば、次式（６）の逆関数が適用可能である。なお、この式（６）の逆関数を図２６で参考までに示せば、図２６の線Ｆがこれに該当する。
［数６］
f’(x)=(log(c/x-1)-(b+i))/(a-i) （６）
また、身体動作の表情付け強度が知覚的に線形に捉えられない種類の場合、つまり、身体動作が特有の意味、たとえば万歳などの意味を伴う場合、仮に知覚強度が「ｌｏｇ（ｍ＊ｘ＋ｎ）」などの対数関数（ただし、一例としてｍ＝１．５，ｎ＝１）とすると、身体動作は動き方を変更するとスムーズに動いて見えないなど問題があるため、音声の表情付け強度でマッチングを調節する必要が出てくる。具体的には、この身体動作の知覚強度式をg(x)とし、たとえば設定されたマッチング関数が式（４）なら、マッチング式をｆ（ｘ）とし、ｆ’（ｇ（ｘ））とすれば、知覚的に適合している音声表情付けを可能にできる。 Furthermore, specifically, in the case of a body movement in which the expression intensity of the body movement is perceived linearly, in order to match the expression of the singing voice (speech), any of the above formulas is added to the intensity value to be set. By applying the inverse function f ′ (x) of the function f (x) set in the above, the morphing rate for singing voice expression can be determined. For example, in the case of equation (5), the inverse function of the following equation (6) is applicable. If the inverse function of the equation (6) is shown in FIG. 26 for reference, the line F in FIG. 26 corresponds to this.
[Equation 6]
f '(x) = (log (c / x-1)-(b + i)) / (ai) (6)
In the case where the expression intensity of body motion is not perceptually linear, that is, when the body motion has a specific meaning, for example, meanings such as a thousand years, the perceived intensity is “log (m * x + n)”. If the logarithmic function (such as m = 1.5, n = 1 as an example) is used, there is a problem that the body movement does not move smoothly when the movement is changed. It will be necessary to adjust. Specifically, the perceptual intensity formula of this body motion is g (x). For example, if the set matching function is formula (4), the matching formula is f (x), and f ′ (g (x)) In this way, perceptually adapted voice expression can be made possible.

このようにして、身体動作の表情強度と歌声（音声）の表情強度を視覚的に適合させるために、上記式（２）−（６）のどれかのマッチング関数ｆ（ｘ）に従ってモーフィング率を決定するが、このマッチング関数は図６に示す身体動作‐歌声表現テーブル２８に予め設定しておき、図１３のステップＳ１５で用いればよい。 In this way, in order to visually match the facial expression intensity of the body motion and the expression intensity of the singing voice (voice), the morphing rate is set according to any of the matching functions f (x) in the above formulas (2) to (6). This matching function may be set in advance in the body movement-singing voice expression table 28 shown in FIG. 6 and used in step S15 in FIG.

なお、上述の実施例では、身体動作入力手段としてぬいぐるみ１２および手袋型センサ１４を使った。この実施例は、たとえば、演奏者が子供たちに音楽表現や歌声表現がどのような身体表現を伴うのかを教えたり、人形劇のように、誰か別の人物になりきって身体動作表現と音声表現を同時に行うときなどに、有効であると考えられる。 In the above-described embodiment, the stuffed toy 12 and the glove-type sensor 14 are used as the body motion input means. In this embodiment, for example, the performer teaches children what kind of physical expression is accompanied by musical expression or singing voice expression, or becomes a different person like a puppet show, such as body movement expression and voice. It is considered effective when performing expressions simultaneously.

図２７に示すような状況を想定すると，i)Ａは手人形インタフェース１２（１４）を用いて歌声を演奏し、自分の手の動きによるぬいぐるみのジェスチャにより音声表現が変化することを体感する。ii)Ｂは演奏における歌声の変化を感じるだけではなく，ぬいぐるみ部分に触れ、その形状を外側から変化させるなどのやりとりとともに歌声表現が変化することを確認できる。iii)ＣはＡやＢと同様にぬいぐるみの動きの徐々に変化する様子とともに歌声が自然に連続的な表情変化を伴って演奏されるのを、オーディエンスとして聴くことができる。iv) ii)に基づき，ＡはＢが触れる感覚をぬいぐるみ内部から得ると同
時に歌声の表情変化が感じられる。 Assuming the situation as shown in FIG. 27, i) A plays a singing voice using the hand puppet interface 12 (14), and feels that the voice expression changes due to the stuffed toy gesture by the movement of his hand. ii) B not only feels the change of the singing voice in the performance, but also confirms that the expression of the singing voice changes with the exchange such as touching the stuffed part and changing its shape from the outside. iii) As with A and B, C can be heard as an audience that the singing voice is naturally played with continuous facial expression changes as the movement of the stuffed animal gradually changes. iv) Based on ii), A can feel the touch of B from the inside of the stuffed toy, and at the same time change the facial expression of the singing voice.

このような色々な利点が上述の実施例では得られるのではあるが、身体動作を入力する手段は、手人形ないしぬいぐるみ１２に限らない。 Although such various advantages can be obtained in the above-described embodiment, the means for inputting the body movement is not limited to the hand doll or the stuffed toy 12.

ぬいぐるみ１２を用いず、手袋型センサ１４だけを用いてもよい。ただし、この場合には、ぬいぐるみによる、たとえば癒しなどの効果は期待できない。 Only the glove-type sensor 14 may be used without using the stuffed toy 12. However, in this case, for example, an effect such as healing cannot be expected from the stuffed toy.

図２８に示すこの発明の他の実施例では、身体動作入力手段として、カメラ３６１，３６２および３６３を用いる。このカメラ３６１‐３６３は、被験者またはユーザ３８の全身を前方、側方、および上方から３次元撮影するものである。そして、これらカメラ３６１‐３６３からのカメラ信号がＡ／Ｄ変換器４０によって画像または映像データに変換され、コンピュータ２２に入力される。コンピュータ２２では解釈テーブル２４Ａを参照して、主としてパターン認識の手法を用いて、そのときの被験者（ユーザ）３８の行った身体動作の表情付けの強度を判定する。そして、その身体動作の表情付けの強度に基づいて、身体動作-歌声表現テーブル２８Ａを参照して、図１２に従ってモーフィング率を決定する。ただし、先の実施例と同様に、身体動作の表情強度と歌声（音声）の表情強度を視覚的に適合させるために、このマッチング関数は図２８に示す身体動作‐歌声表現テーブル２８Ａに予め設定しておき、図１３のステップＳ１５で用いればよい。 In another embodiment of the present invention shown in FIG. 28, cameras 361, 362 and 363 are used as body motion input means. The cameras 361 to 363 are used to three-dimensionally image the whole body of the subject or the user 38 from the front, the side, and the upper side. The camera signals from these cameras 361 to 363 are converted into image or video data by the A / D converter 40 and input to the computer 22. The computer 22 refers to the interpretation table 24A to determine the intensity of the facial motion expression performed by the subject (user) 38 at that time mainly using a pattern recognition technique. Then, based on the intensity of expression of the body motion, the morphing rate is determined according to FIG. 12 with reference to the body motion-singing voice expression table 28A. However, in the same way as in the previous embodiment, this matching function is preset in the body action-singing voice expression table 28A shown in FIG. 28 in order to visually match the expression intensity of the body action and the expression intensity of the singing voice (voice). Aside from that, step S15 in FIG.

図２８では、このように、ユーザの全身を使った身体動作で音声モーフィングを実行することができる。したがって、たとえばダンスと音楽との関連でこの実施例の表情付け音声発生装置１０を利用することができる。 In FIG. 28, the voice morphing can be executed by the body motion using the whole body of the user as described above. Therefore, for example, the expression voice generating apparatus 10 of this embodiment can be used in relation to dance and music.

図２９に示すこの発明のさらに他の実施例では、身体動作入力手段として、１つのカメラ３６Ａを用いる。このカメラ３６Ａ、被験者またはユーザの顔３８Ａを前方から２次元撮影するものである。そして、カメラ３６Ａからのカメラ信号がＡ／Ｄ変換器４０によって画像データに変換され、コンピュータ２２に入力される。コンピュータ２２では解釈テーブル２４Ｂを参照して、主としてパターン認識の手法を用いて、そのときの被験者（ユーザ）の顔３８Ａの表情をジェスチャとして同定する。つまり、この実施例ではユーザの顔３８Ａの表情が身体動作として利用できる。そして、その身体動作（顔表情）に基づいて、身体動作-歌声表現テーブル２８Ｂを参照して、図１３のステップＳ１５で、上記式（２）‐（６）のマッチング関数に従って図１２のモーフィング率を決定する。 In still another embodiment of the present invention shown in FIG. 29, one camera 36A is used as the body motion input means. The camera 36A and the face 38A of the subject or user are photographed two-dimensionally from the front. The camera signal from the camera 36 </ b> A is converted into image data by the A / D converter 40 and input to the computer 22. The computer 22 refers to the interpretation table 24B and identifies the facial expression of the face (A) of the subject (user) at that time as a gesture mainly using a pattern recognition technique. That is, in this embodiment, the facial expression of the user's face 38A can be used as a physical action. Then, based on the body motion (facial expression), the body motion-singing voice expression table 28B is referred to, and in step S15 of FIG. 13, the morphing rate of FIG. 12 is determined according to the matching function of the above equations (2)-(6). To decide.

図２９では、このように、ユーザの顔を使った身体動作で音声モーフィングを実行することができる。したがって、たとえばベッドで寝ている病人などにも有効にこの実施例の表情付け音声発生装置１０を利用することができる。 In FIG. 29, the voice morphing can be executed by the body motion using the user's face as described above. Therefore, for example, the facial expression voice generator 10 of this embodiment can be used effectively for a sick person sleeping in a bed.

図３０に示すこの発明のその他の実施例では、ロボット４２を用いる。このようなロボット４２では、その腕を上げたり曲げたり、さらには顔によって、色々な感情（怒り、悲しみなど）を表現できる。そして、そのような感情表現のためには、感情情報が、たとえば外部のコンピュータ（図示せず）からコネクタのコンピュータ２２に与えられる。この感情情報に基づいて、出力テーブル４４が、ロボット４２の各制御子（アクチュエータ）のための制御信号を与える。その制御信号に応じて各アクチュエータが回転したりすることによって、ロボット４２が全体で感情を表現することができる。 In another embodiment of the present invention shown in FIG. 30, a robot 42 is used. In such a robot 42, various emotions (anger, sadness, etc.) can be expressed by raising or bending the arm, and also by the face. For such emotional expression, emotion information is given to the connector computer 22 from an external computer (not shown), for example. Based on this emotion information, the output table 44 provides a control signal for each controller (actuator) of the robot 42. By rotating each actuator according to the control signal, the robot 42 can express the emotion as a whole.

そして、この図３０の実施例では、上述のように外部から与えられる感情情報（制御信号）をジェスチャ信号とし、それに基づいてジェスチャを同定し、身体動作-歌声表現テーブル２８Ｃを参照して、その感情情報すなわち身体動作に応じて、図１３のステップＳ１５で、上記式（２）‐（６）のマッチング関数に従って図１２のモーフィング率を決定する。 In the embodiment of FIG. 30, the emotion information (control signal) given from the outside is used as the gesture signal as described above, the gesture is identified based on the emotion information, and the physical action-singing voice expression table 28C is referred to. In step S15 of FIG. 13, the morphing rate of FIG. 12 is determined according to the matching function of the above equations (2)-(6) according to the emotion information, that is, the body motion.

図３０では、このように、ロボット４２の感情またはそれの所在で示される身体動作で音声モーフィングを実行することができる。したがって、たとえばコミュニケーションロボットなど、人間とのコミュニケーションのためのロボットでは、図３０の実施例の表情付け音声発生装置１０を利用することができる。 In FIG. 30, voice morphing can be performed in this manner with the emotion of the robot 42 or the physical movement indicated by its location. Therefore, for example, in a robot for communication with a human, such as a communication robot, the expression voice generating apparatus 10 of the embodiment of FIG. 30 can be used.

なお、図３０の実施例で、感情情報は外部からコンピュータ２２に入力する必要はなく、コンピュータ２２がロボット４２を制御するために自身の内部で作成した制御信号をそのまままたは変形して利用するようにしてもよい。 In the embodiment of FIG. 30, emotion information does not need to be input to the computer 22 from the outside, and the computer 22 uses the control signal created inside itself to control the robot 42 as it is or after being transformed. It may be.

図１はこの発明の一実施例でジェスチャ（姿勢の表情強度）を入力するために用いられるぬいぐるみの一例を示す図解図である。FIG. 1 is an illustrative view showing one example of a stuffed toy used for inputting a gesture (posture expression intensity) in one embodiment of the present invention. 図２はこのぬいぐるみに手（手袋）を挿入した状態を示す図解図である。FIG. 2 is an illustrative view showing a state in which a hand (gloves) is inserted into the stuffed toy. 図３は手袋型センサの手の甲側の一例を示す図解図である。FIG. 3 is an illustrative view showing one example of the back side of the hand of the glove-type sensor. 図４は手袋型センサの手のひら側の一例を示す図解図である。FIG. 4 is an illustrative view showing one example of a palm side of a glove-type sensor. 図５は手袋型センサの親指部と親指第１曲げセンサおよび親指第２曲げセンサとの位置関係を示す図解図である。FIG. 5 is an illustrative view showing a positional relationship between the thumb part of the glove-type sensor and the thumb first bending sensor and the thumb second bending sensor. 図６はこの発明の一実施例を示すブロック図である。FIG. 6 is a block diagram showing an embodiment of the present invention. 図７は図６実施例の解釈テーブルの一部を図解する図解図である。FIG. 7 is an illustrative view illustrating a part of the interpretation table of FIG. 6 embodiment. 図８は手人形（ぬいぐるみ）の「neu（表情付けなし）のときの状態を示す図解図である。FIG. 8 is an illustrative view showing a state when the hand doll (stuffed toy) is “neu (no expression)”. 図９は手人形で「bak（反り返り）の動作をさせたときの状態を示す図解図である。FIG. 9 is an illustrative view showing a state when a “puppet” operation is performed with a hand doll. 図１０は手人形で「drp」(うなだれ）の動作をさせたときの状態を示す図解図である。FIG. 10 is an illustrative view showing a state when a “drp” (unadura) operation is performed with a hand puppet. 図１１は手人形で「str」（前伸ばし）の動作をさせたときの状態を示す図解図である。FIG. 11 is an illustrative view showing a state when the hand puppet is operated as “str” (front stretching). 図１２は３つの元音声をモーフィングする際のモーフィング率の設定方法を示す図解図である。FIG. 12 is an illustrative view showing a morphing rate setting method when morphing three original voices. 図１３は図６実施例の動作の一例を示すフロー図である。FIG. 13 is a flowchart showing an example of the operation of the embodiment in FIG. 図１４は図６実施例を利用して実験１-II（表情強度の知覚的補間効果）、特に「da;dak」の表情付けの実験を行なったときの結果を表情強度毎に集計した平均値と標準偏差とを示すグラフである。FIG. 14 shows an average of the results of experiment 1-II (perceptual interpolation effect of facial expression intensity), especially the expression of “da; dak” using the embodiment of FIG. It is a graph which shows a value and a standard deviation. 図１５は図６実施例を利用して実験１-II（表情強度の知覚的補間効果）、特に「wh;drp」の表情付けの実験を行なったときの結果を表情強度毎に集計した平均値と標準偏差とを示すグラフである。FIG. 15 is an average of the results of experiment 1-II (perceptual interpolation effect on expression intensity), especially “wh; drp” expression for each expression intensity, using the embodiment of FIG. It is a graph which shows a value and a standard deviation. 図１６は図６実施例を利用して実験１-II（表情強度の知覚的補間効果）、特に「we;str」の表情付けの実験を行なったときの結果を表情強度毎に集計した平均値と標準偏差とを示すグラフである。FIG. 16 shows an average of the results of experiment 1-II (perceptual interpolation effect of facial expression intensity) using the embodiment of FIG. It is a graph which shows a value and a standard deviation. 図１７は図６実施例を利用して実験２-I（表情強度を組み替えたクロスモダリティの知覚効果）を行なうための実験条件を示すグラフである。FIG. 17 is a graph showing experimental conditions for conducting Experiment 2-I (perceptual effect of cross modality with different expression intensities) using the embodiment of FIG. 図１８は図６実施例を利用して上記実験２-I、特に「da;dak」の表情付けの実験を行なったときの結果を表情強度毎に集計した平均値を示すグラフである。FIG. 18 is a graph showing an average value obtained by summing up the results of the experiment 2-I, particularly the expression of “da; dak”, using the embodiment of FIG. 図１９は図６実施例を利用して上記実験２-I、特に「wh;drp」の表情付けの実験を行なったときの結果を表情強度毎に集計した平均値を示すグラフである。FIG. 19 is a graph showing an average value obtained by summing up the results of the experiment 2-I, in particular, the “wh; drp” facial expression experiment using the embodiment of FIG. 図２０は図６実施例を利用して上記実験２-I、特に「we;str」の表情付けの実験を行なったときの結果を表情強度毎に集計した平均値を示すグラフである。FIG. 20 is a graph showing an average value obtained by summing up the results of the experiment 2-I, especially the expression of “we; str” using the embodiment of FIG. 図２１は図６実施例を利用して実験２-II（表情付け強度の知覚）、特に「da;dak」の表情付けの実験を行なったときの結果を表情強度毎に集計した平均値を示すグラフである。FIG. 21 shows an average value obtained by summing up the results of experiment 2-II (perception of expression intensity), particularly “da; dak” expression expression for each expression intensity using the embodiment of FIG. It is a graph to show. 図２２は図６実施例を利用して実験２-II（表情付け強度の知覚）、特に「wh;drp」の表情付けの実験を行なったときの結果を表情強度毎に集計した平均値を示すグラフである。FIG. 22 shows an average value obtained by summing up the results of experiment 2-II (perception of facial expression intensity) using the embodiment of FIG. 6, especially the expression of “wh; drp” for each facial expression intensity. It is a graph to show. 図２３は図６実施例を利用して実験２-II（表情付け強度の知覚）、特に「we;str」の表情付けの実験を行なったときの結果を表情強度毎に集計した平均値を示すグラフである。FIG. 23 shows an average value obtained by summing up the results of the experiment 2-II (perception of expression intensity), particularly the expression of “we; str” expression for each expression intensity using the embodiment of FIG. It is a graph to show. 図２４は図６実施例を利用して実験３（表情強度の変化と適合性の知覚）を行なったときの変化する表情強度の組合せの適合性を示すグラフである。FIG. 24 is a graph showing the suitability of a combination of changing expression intensities when Experiment 3 (change in expression intensity and perception of suitability) is performed using the embodiment of FIG. 図２５は図６実施例で音声モーフィングが逆シグモイド関数に従って実行されることを示すグラフである。FIG. 25 is a graph showing that voice morphing is performed according to the inverse sigmoid function in the embodiment of FIG. 図２６は図６実施例を利用して行なった実験の結果に従って図６の身体動作-歌声テーブルに設定する関数f(x)の例を示すグラフである。FIG. 26 is a graph showing an example of a function f (x) set in the body movement-singing voice table of FIG. 6 according to the result of the experiment conducted using the embodiment of FIG. 図２７は図１に示す手人形（ぬいぐるみ）をインタフェースとして使用するときの効果または利点を説明するための図解図である。FIG. 27 is an illustrative view for explaining the effect or advantage when the hand doll (stuffed animal) shown in FIG. 1 is used as an interface. 図２８はこの発明の他の実施例を示すブロック図である。FIG. 28 is a block diagram showing another embodiment of the present invention. 図２９はこの発明のさらに他の実施例を示すブロック図である。FIG. 29 is a block diagram showing still another embodiment of the present invention. 図３０はこの発明のその他の実施例を示すブロック図である。FIG. 30 is a block diagram showing another embodiment of the present invention.

Explanation of symbols

１０ …表情付け音声発生装置
１２ …ぬいぐるみ
１４ …手袋型センサ
１６１ａ …親指第１曲げセンサ
１６１ｂ …親指第２曲げセンサ
１６２ａ …人差し指第１曲げセンサ
１６２ｂ …人差し指第２曲げセンサ
１６３ａ …中指第１曲げセンサ
１６３ｂ …中第２曲げセンサ
１６４ …手首曲げセンサ
１８１，１８２ …圧力センサ
２２ …コンピュータ
２４，２４Ａ，２４Ｂ，…解釈テーブル
２６ …歌声データベース
２８，２８Ａ，２８Ｂ，２８Ｃ …身体動作-歌声表現テーブル（マッピングテーブル）
３０ …音声合成部
３６１‐３６３、３６Ａ …カメラ
３８ …被験者
４２ …ロボット
４４ …出力テーブル DESCRIPTION OF SYMBOLS 10 ... Expression voice generating apparatus 12 ... Stuffed toy 14 ... Glove type sensor 161a ... Thumb 1st bending sensor 161b ... Thumb 2nd bending sensor 162a ... Index finger 1st bending sensor 162b ... Index finger 2nd bending sensor 163a ... Middle finger 1st bending sensor 163b ... Middle second bending sensor 164 ... Wrist bending sensor 181 and 182 ... Pressure sensor 22 ... Computer 24, 24A, 24B, ... Interpretation table 26 ... Singing voice database 28, 28A, 28B, 28C ... Body motion-singing voice expression table (mapping) table)
30 ... Speech synthesis unit 361-363, 36A ... Camera 38 ... Subject 42 ... Robot 44 ... Output table

Claims

A facial expression voice generating device that generates a voice with a facial expression according to a facial expression of a physical motion input by a physical motion input means for inputting a physical motion,
An audio signal database for storing in advance each audio signal of at least two sounds with different facial expressions;
Setting means for setting the intensity of expression of the voice that matches the intensity of expression of the body movement;
Morphing means for morphing two or more voice signals read from the voice database at a morphing rate according to the intensity of expression of the voice set by the setting means, and voice by a voice signal resulting from morphing by the morphing means A facial expression sound generator comprising a voice output means for outputting.

The morphing means morphs the two or more audio signals according to an inverse sigmoid function,
2. The expression voice generating apparatus according to claim 1, wherein the setting means sets the intensity of the voice expression according to a sigmoid function or a deformation function thereof.