JP5602753B2

JP5602753B2 - A toy showing nostalgic behavior

Info

Publication number: JP5602753B2
Application number: JP2011538069A
Authority: JP
Inventors: ヨハン・アダム・ドゥ・プレーズ; ルートヴィヒ・カール・シュワルト
Original assignee: Stellenbosch University
Current assignee: Stellenbosch University
Priority date: 2009-03-05
Filing date: 2009-11-27
Publication date: 2014-10-08
Anticipated expiration: 2029-11-27
Also published as: JP2012519501A

Description

本発明は、対話（あるいは交流、interaction）型玩具に関する。より詳細には、人間に懐く挙動を示す人形に関する。この懐く挙動（bonding behaviour）は、親と子の間で自然に行われる動作を模したものを意味する。本発明は、１人または複数の人間に対して、玩具が懐く挙動をシミュレートする方法も対象としている。 The present invention relates to an interactive toy. More specifically, the present invention relates to a doll that exhibits a nostalgic behavior to humans. This nostalgic behavior (bonding behavior) refers to a model that mimics the action that occurs naturally between a parent and a child. The present invention is also directed to a method of simulating the behavior of a toy for one or more people.

玩具、特に人形は、世界の人々に何百年もの間、所有されてきた。子供たちは、人形で遊び、対人関係を学び、時に安心感を覚える。子供（特に幼い子供）は、しばしば、人形との間に強い結付きを形成する。それは、子供の成長の一部を担う。多くの理由から、人形は大人にも所有される。審美的な特質または情緒的愛着から、コレクターのアイテムとなることもある。 Toys, especially dolls, have been owned by people around the world for hundreds of years. Children play with dolls, learn interpersonal relationships, and sometimes feel secure. Children (especially young children) often form strong bonds with dolls. It is part of the child's growth. For many reasons, dolls are also owned by adults. It can be a collector's item because of its aesthetic qualities or emotional attachment.

過去何年間に及ぶ技術的進歩により、人形は進化し、ますます洗練され、実際、生きているかのようになった。本件発明者は、人の限られた挙動をシミュレートできる人形の存在を知っている。例えば、泣く、眠る、話すである。さらには、食べる、あるいは排泄するといった身体機能をシミュレートできる人形も知っている。さらに、本件発明者は、マイクロホン、音響トランスデューサ、動作アクチュエータ、その他の電子機器が、人形に組み入れられていることも知っている。 The technological progress over the past years has made dolls evolve, become increasingly sophisticated and in fact, alive. The inventor knows the existence of a doll that can simulate the limited behavior of a person. For example, crying, sleeping, talking. They also know dolls that can simulate physical functions such as eating or excreting. Furthermore, the inventor also knows that microphones, acoustic transducers, motion actuators, and other electronic devices are incorporated into the doll.

例えば、米国特許出願番号ＵＳ２００７／０１２８９７９（対話型の先端技術の人形）に開示された人形は、人間のような顔の表情を有し、人が話す特定の言葉を認識する。さらには、予め決められた質問および回答のシナリオに基づいて、生きた人間と特定の会話を続けることができる。
人形が話し言葉を認識するのは、当該人形に組み入れられたプロセッサで制御される会話−音声認識技術に基づいており、それは、人形が特定の人の音声を識別できるよう訓練されることを可能としており、その人に対する特定の役割（母親など）を担う。
人形は、顔の内部に動作アクチュエータを備えている。これにより、話し言葉と同時に（あるいは独立して）、目、口および頬が、予め決められた顔の表情を作り出し、これにより、ヒトの感情をシミュレートする。
限られた会話を行う能力は、この分野で広く知られている基本的な、音声−会話認識技術に基づいている。各シナリオにおいて、人形は、予め記録された質問をし、特定の回答を待つ。人形は、予期された回答を受けると、にこやかに応答する。回答が予期されたものでない場合、人形の応答は、やや不機嫌となる。
しかしながら、出願明細書には、人形が長期的に渡る学習能力を有するとの説明はない。そうではなく、人形の挙動は、主として現在のユーザーによる入力および内蔵時計に応答するステートマシン（state machine）で統治されているようだ。 For example, the doll disclosed in US Patent Application No. US2007 / 0128979 (interactive advanced technology doll) has a human-like facial expression and recognizes certain words spoken by a person. Furthermore, a specific conversation can be continued with a living person based on a predetermined question and answer scenario.
The doll recognizes the spoken language based on a speech-to-speech recognition technology that is controlled by the processor embedded in the doll, which allows the doll to be trained to identify a particular person's voice. And play a specific role (such as mother) for that person.
The doll has a motion actuator inside the face. Thus, simultaneously with (or independently of) the spoken language, the eyes, mouth and cheeks create a predetermined facial expression, thereby simulating human emotions.
The ability to conduct limited conversations is based on basic voice-to-speech recognition techniques that are widely known in the field. In each scenario, the doll asks a pre-recorded question and waits for a specific answer. When the doll receives the expected answer, it responds with a smile. If the answer is not what was expected, the doll's response is somewhat grumpy.
However, there is no explanation in the application specification that the doll has long-term learning ability. Rather, the doll's behavior seems to be governed primarily by a state machine that responds to current user input and a built-in clock.

本発明は、対話型玩具を提供することを目的とする。より詳細には、人間に懐く挙動をシュミレートできる人形を提供することを目的としており、それは、上に概説した先行技術に対する改善である。 An object of the present invention is to provide an interactive toy. More specifically, the aim is to provide a doll that can simulate a nostalgic behavior, which is an improvement over the prior art outlined above.

Means for Solving the Problems and Effects of the Invention

本発明により提供される玩具は、本体を備えており、当該本体は、ヒトユーザーからの入力を受け取る、少なくとも１つの入力センサと、当該玩具が上記ユーザーと相互交流するための、少なくとも１つの出力装置と、入力センサおよび出力装置と交信するプロセッサと、プロセッサと交信するメモリと、を備える。
当該玩具は、以下のことを特徴とする。すなわち、プロセッサは、受け取った各入力を、ポジティブであるかネガティブであるかに分類し、当該分類に従って、メモリに記憶された蓄積された入力を調整し、蓄積された入力に依存する制御信号を、出力装置に送る、ようプログラムされており、それにより、ある期間に渡る一連の支配的なポジティブな入力に応答して、懐く挙動を強くするとともに、ある期間に渡る一連の支配的なネガティブな入力に応答して、懐く挙動を減じることを特徴としている。 A toy provided by the present invention comprises a main body, the main body receiving at least one input sensor for receiving input from a human user, and at least one output for the toy to interact with the user. A device, a processor in communication with the input sensor and the output device, and a memory in communication with the processor.
The toy is characterized by the following. That is, the processor classifies each received input as positive or negative, adjusts the accumulated input stored in the memory according to the classification, and generates a control signal that depends on the accumulated input. In response to a series of dominant positive inputs over a period of time, strengthening nostalgic behavior and a series of dominant negatives over a period of time. It is characterized by reducing nostalgic behavior in response to input.

本発明の特徴は、さらに次の特徴を有する。
上記ヒトユーザーから受け取る入力は、音、動作、画像のうちの１または２以上に対応する、玩具とヒトとの相互交流に対応している。
上記プロセッサは、叫びに関連する音、および物理的虐待に関連する動作を、ネガティブな入力として分類する。
本発明の玩具は、少なくとも２つの入力センサを含み、第１のセンサは、音声およびその振幅を検出するマイクロフォンであり、第２のセンサは、玩具の動作および加速度を検出する加速度計である。
上記蓄積された入力は、少なくともある程度、玩具の好みのユーザーの音声を表している。
上記プロセッサは、マイクロフォンで受け取った音声入力と、蓄積された入力との類似度合いを決定するようプログラムされている。
上記蓄積された入力は、受け取った入力がポジティブであると分類されるときは、ユーザーをより強く表すように成るように調整され、上記類似度合いが低いとき、または、受け取った入力がネガティブであると分類されるときは、好みのユーザーを表す度合いが小さくなるか、または変化しない。
上記プロセッサは、予め決めた最大音声振幅に対して、それよりも大きい振幅の音声入力をネガティブな入力として分類し、それよりも小さい振幅の音声入力をポジティブな入力として分類するようにプログラムされている。
上記プロセッサは、予め決めた最大加速度閾値に対して、それよりも大きい加速度の動作入力をネガティブな入力として分類し、それよりも小さい加速度の動作入力をポジティブな入力として分類するようにプログラムされている。
上記プロセッサは、状況に応じて、受け取った入力のポジティブな度合い、またはネガティブな度合いを決定し、ポジティブまたはネガティブな度合いに比例して、蓄積された入力を調整するようにプログラムされている。 The features of the present invention further include the following features.
The input received from the human user corresponds to mutual interaction between the toy and the human corresponding to one or more of sound, action, and image.
The processor classifies sounds associated with screams and actions associated with physical abuse as negative inputs.
The toy of the present invention includes at least two input sensors, the first sensor is a microphone that detects sound and its amplitude, and the second sensor is an accelerometer that detects movement and acceleration of the toy.
The accumulated input represents, at least in part, the voice of a favorite user of the toy.
The processor is programmed to determine the degree of similarity between the voice input received at the microphone and the stored input.
The accumulated input is adjusted to be more representative of the user when the received input is classified as positive, and when the similarity is low or the received input is negative , The degree of representing a favorite user is small or does not change.
The processor is programmed to classify a voice input with a larger amplitude as a negative input and a voice input with a smaller amplitude as a positive input for a predetermined maximum voice amplitude. Yes.
The processor is programmed to classify motion inputs with higher accelerations as negative inputs and classify motion inputs with lower accelerations as positive inputs for a predetermined maximum acceleration threshold. Yes.
The processor is programmed to determine the positive or negative degree of the received input depending on the situation and adjust the accumulated input in proportion to the positive or negative degree.

本発明のさらなる特徴は、次の通りである。
玩具は、上記プロセッサと交信する計時手段を備えていて、プロセッサは、予め決めた時間長に対して、それよりも長く入力が無かったとき、それをネガティブな入力として分類し、それに応じて、好みのユーザーを表す度合いが低くなるよう蓄積された入力を調整するようにプログラムされている。
上記出力装置は、音響トランスデューサおよび動作アクチュエータの一方または両方を備えており、プロセッサは、受け取った音声入力の類似度合いが高い場合には、より頻繁に、および（または）よりクオリティの高い制御信号を、出力装置に送るようにプログラムされている。また、プロセッサは、受け取った音声入力の類似度合いが低い場合には、より低い頻度で、および（または）よりクオリティの低い制御信号を、出力装置に送るようにプログラムされている。 Further features of the present invention are as follows.
The toy includes timing means for communicating with the processor, and the processor classifies it as a negative input when there is no input for a predetermined time length longer than that, and accordingly, It is programmed to adjust the accumulated input so that the degree of representing the preferred user is low.
The output device may include one or both of an acoustic transducer and a motion actuator, and the processor may provide more frequent and / or higher quality control signals if the received audio input is highly similar. , Programmed to send to the output device. The processor is also programmed to send a less frequent and / or lower quality control signal to the output device if the received audio input is less similar.

本発明のさらなる特徴は、次の通りである。
上記蓄積された入力は、一般的なバックグランド話者に関連する音声から抽出された特徴の集合を含んでおり、当該特徴のそれぞれは、関連する多様な重み付けを有しており、重み付けられた特徴の集合は、好みのユーザーの音声を表している。
上記蓄積された入力が好みのユーザーの音声をより強く、またはより弱く表すこととなるように、上記特徴に関連する多様な重み付けが調整される。
上記蓄積された入力が現在の好みのユーザーの音声を表す程度が小さくなると、当該蓄積された入力が、少なくとも１人の別のユーザーの音声をより強く表すように成るよう調整され、蓄積された入力が、当該別のユーザーの音声を、現在の好みのユーザーのそれよりも強く表すように成った時、当該別のユーザーが新しい好みのユーザーになる。 Further features of the present invention are as follows.
The accumulated input includes a set of features extracted from speech associated with a typical background speaker, each of the features having a variety of associated weights and weighted A set of features represents a favorite user's voice.
The various weights associated with the features are adjusted so that the stored input will make the voice of the preferred user stronger or weaker.
As the accumulated input becomes less representative of the current preferred user's voice, the accumulated input is adjusted and accumulated to be more strongly representative of at least one other user's voice. When the input is such that the voice of the other user is expressed more strongly than that of the current favorite user, the other user becomes the new favorite user.

本発明により、次の方法が提供される。
すなわち、本発明の方法は、玩具がヒトに懐く挙動をシュミレートする方法であって、好みのユーザーを表す蓄積された入力を、当該玩具に関連するメモリに記憶するステップと、当該玩具に組み込まれた少なくとも１つの入力センサにより、ユーザーからの入力を受け取るステップと、受け取った入力を、ポジティブであるか、ネガティブであるかに分類するステップと、ポジティブな入力に応じて好みのユーザーをより強く表すように成るように、また、ネガティブな入力に応じて好みのユーザーをより弱く表すように成るように、蓄積された入力を調整するステップと、入力に応答して、蓄積された入力に依存する制御信号を、当該玩具の出力装置へ発するステップと、を含む。 The present invention provides the following method.
That is, the method of the present invention is a method for simulating the behavior of a toy for a human being, the stored input representing a favorite user being stored in a memory associated with the toy, And at least one input sensor for receiving input from the user, classifying the received input as positive or negative, and more strongly representing the user of choice according to the positive input And depending on the accumulated input in response to the input, and adjusting the accumulated input to be more weakly representative of the preferred user in response to negative input Issuing a control signal to the toy output device.

本発明の方法は次の点に特徴を有する。
すなわち、予め決めた振幅よりも大きな振幅の音声入力を受け取ったとき、当該入力をネガティブであると分類するステップと、予め決めた範囲の加速度を超える加速度の動作入力を受け取ったとき、当該入力をネガティブであると分類するステップと、予め決めた時間長よりも長い時間入力がなかったとき、それをネガティブな入力であると分類するステップと、を含む。
または、本発明の方法は、好みのユーザーの音声入力に対する受け取った音声入力の類似度合いを決定し、当該類似度合いに比例した制御信号を、玩具の出力装置へ発するステップを含む。 The method of the present invention is characterized by the following points.
That is, when a voice input having an amplitude larger than a predetermined amplitude is received, a step of classifying the input as negative, and when an operation input having an acceleration exceeding a predetermined range of acceleration is received, the input is A step of classifying the input as negative, and a step of classifying the input as a negative input when there is no input for a time longer than a predetermined time length.
Alternatively, the method of the present invention includes the step of determining the degree of similarity of the received voice input to the voice input of the preferred user and issuing a control signal proportional to the degree of similarity to the toy output device.

本発明の第１実施形態において、人間に懐く挙動を示す玩具人形の内部コンポーネントを示す概略図。Schematic which shows the internal component of the toy doll which shows the behavior which looks nostalgic in 1st Embodiment of this invention. 図１の玩具人形の別実施形態を示す概略図。Schematic which shows another embodiment of the toy doll of FIG. 本発明による玩具人形のマクロな挙動を示す流れ図。The flowchart which shows the macroscopic behavior of the toy doll by this invention.

図１は、本発明の第１実施形態に関して、玩具人形（図示せず）の内部機能コンポーネント（１０）を示している。人形は図示しない本体を備えているが、この人形本体は、あらゆる外観を為すことが可能である。例えば、乳幼児、よちよち歩きの幼児、動物、あるいは玩具キャラクター等である。
コンポーネント（１０）は、人形内部の都合の良い位置に配置される。例えば、本体の胸腔内に配置すれば、本体によってコンポーネント（１０）が保護される。玩具本体上の都合の良い位置を利用して、コンポーネントの中で定期的に交換あるいはメンテナンスが必要となる部分（例えば、電源またはバッテリーパック）へのアクセスを確保する。 FIG. 1 shows an internal functional component (10) of a toy doll (not shown) for a first embodiment of the present invention. The doll has a main body (not shown), but this doll main body can have any appearance. For example, an infant, a toddler, an animal, or a toy character.
The component (10) is placed at a convenient location inside the doll. For example, if placed in the thoracic cavity of the body, the body (10) is protected by the body. Use convenient locations on the toy body to ensure access to the parts of the component that require regular replacement or maintenance (eg, a power supply or battery pack).

コンポーネント（１０）は、必要な挙動をサポートするため、ＣＰＵ（１２）、記憶ユニット（１６）、入力センサ（１８）、出力装置（２４）を含む。
ＣＰＵすなわちデジタル中央処理装置（１２）は、計時手段（１４）を含んでおり、この実施形態ではデジタル・タイマーである。記憶ユニット（１６）は、不揮発性のメモリモジュールである。入力センサ（１８）は、入力を検出するもので、この実施形態では、マイクロホン（２０）および加速度計（２２）である。出力装置（２４）は、ユーザーとコミュニケーションをとる。
この実施形態において、出力装置は、玩具の腕（図示せず）に接続された音響トランスデューサ（２６）および動作アクチュエータ（２８）を含んでいる。玩具の動作を制御するために、動作アクチュエータ（２８）は、玩具のいずれのアームに接続してもよい。ＣＰＵ（１２）は、入力インターフェース（３０）および出力インタフェース（３２）を介して、それぞれ、入力センサ（１８）および出力装置（２６）に接続されている。
入力インターフェース（３０）は、アナログ−デジタル（Ａ／Ｄ）コンバータ（３４）を含んでいて、出力インタフェース（３２）は、デジタル−アナログ（Ｄ／Ａ）コンバータ（３６）を含んでいる。ソフトウェア（図示せず）の形態で与えられる機械語命令（machine instruction）は、メモリ（１６）あるいは増設メモリモジュール（３８）に記憶されていて、入力インターフェース（３０）および出力インターフェース（３２）、並びに関連するＡ／ＤおよびＤ／Ａコンバータを駆動する。また、機械語命令は、ＣＰＵに対して、入力センサを介して入力を受け取らせて、受け取った入力を処理させて、制御信号を出力装置へ送らせる、命令を含んでいる。 The component (10) includes a CPU (12), a storage unit (16), an input sensor (18), and an output device (24) to support the required behavior.
The CPU or digital central processing unit (12) includes timing means (14), which in this embodiment is a digital timer. The storage unit (16) is a nonvolatile memory module. The input sensor (18) detects input, and in this embodiment, is a microphone (20) and an accelerometer (22). The output device (24) communicates with the user.
In this embodiment, the output device includes an acoustic transducer (26) and a motion actuator (28) connected to a toy arm (not shown). In order to control the movement of the toy, the movement actuator (28) may be connected to any arm of the toy. The CPU (12) is connected to the input sensor (18) and the output device (26) via the input interface (30) and the output interface (32), respectively.
The input interface (30) includes an analog-to-digital (A / D) converter (34), and the output interface (32) includes a digital-to-analog (D / A) converter (36). Machine instructions provided in the form of software (not shown) are stored in the memory (16) or the additional memory module (38), and have an input interface (30) and an output interface (32), and Drives associated A / D and D / A converters. The machine language instruction includes an instruction for causing the CPU to receive an input via the input sensor, to process the received input, and to send a control signal to the output device.

玩具の挙動を統制する追加的なソフトウェアも、蓄積された入力変数と共にメモリ（１６）に記憶されている。入力変数は、デジタル・モデル（図示せず）の形態であって、ユーザーの声および（または）挙動から抽出されたキャラクターまたはプロパティの修正から構成され、そこには、現時点での好みのユーザー、並びに好みのユーザーのキャラクターを、一般の他のユーザーと、どのようにして区別するかの基準が含まれる。
蓄積された入力は、可変な程度で、現時点での好みのユーザーを表していて、不揮発性メモリモジュール（１６）に記憶される。
ソフトウェアは、さらに、音声およびスピーチ認識機能性、他の特徴抽出ソフトウェアを含んでいる。特徴抽出ソフトウェアは、プロセッサが受け取った入力を解析することを可能にし、また、それが現時点での好みのユーザーのデジタル・モデルに対応する程度を同プロセッサが決定することを可能にする。このようにして、受け取った音声入力の、蓄積された入力として表わされた好みのユーザーに対する、類似性の程度を作り出す。 Additional software that controls the behavior of the toy is also stored in the memory (16) along with the accumulated input variables. The input variables are in the form of a digital model (not shown) and consist of character or property modifications extracted from the user's voice and / or behavior, including the current preferred user, As well as criteria for how to distinguish a favorite user's character from other general users.
The accumulated input is variable and represents the current preferred user and is stored in the non-volatile memory module (16).
The software further includes voice and speech recognition functionality and other feature extraction software. Feature extraction software allows the processor to parse the received input and allows the processor to determine the extent to which it corresponds to the current user's preferred digital model. In this way, a degree of similarity of the received voice input to the preferred user represented as accumulated input is created.

さらに、メモリ（１６）は、次のようなソフトウェアを格納している。そのソフトウェアは、入力センサ（１８）によって検出された入力をＣＰＵが分析することを可能にし、その入力を本来的にポジティブであるかネガティブであるのかをＣＰＵが分類することを可能し、受け取った入力に対して、ポジティブな程度またはネガティブな程度をＣＰＵが割り当てることを可能にする。
仮に、入力を介して受け取ったカレント・ユーザーとの交流がポジティブであると考えられる場合、その入力は、当該カレント・ユーザーのプロパティを更に学習するために提供される。また、蓄積された入力は、更なるプロパティとしてアップデートされる。
カレント・ユーザーの更なるプロパティを、蓄積された入力に追加することにより、当該入力がポジティブに分類される限り、蓄積された入力がカレント・ユーザーをより強く表すこととなり、従って、当該カレント・ユーザーにますますより強く懐くようになる。カレント・ユーザーが好みのユーザーに近似する場合、蓄積された入力は、ますます強く好みのユーザーを表すようになり、当該カレント・ユーザーに対してより強く懐くようになる。
しかし、カレント・ユーザーが好みのユーザーを表さない場合には、玩具は好みのユーザーに懐くことが少なくなり、カレント・ユーザーにより強く懐くようになる。従って、カレント・ユーザーは、玩具とのポジティブな交流を続けることで、好みのユーザーとなることが可能である。 Further, the memory (16) stores the following software. The software allows the CPU to analyze the input detected by the input sensor (18) and allows the CPU to classify whether the input is inherently positive or negative and received Allows the CPU to assign a positive or negative degree to the input.
If the interaction with the current user received via an input is considered positive, the input is provided for further learning of the current user's properties. Also, the accumulated input is updated as a further property.
By adding additional properties of the current user to the accumulated input, the accumulated input will represent the current user more strongly as long as the input is classified as positive, and therefore the current user I become more and more nostalgic. If the current user approximates the favorite user, the accumulated input will increasingly represent the favorite user and become more fond of the current user.
However, if the current user does not represent a favorite user, the toy is less likely to be fond of the favorite user and is more fond of the current user. Therefore, the current user can become a favorite user by continuing positive exchange with the toy.

玩具との交流がネガティブであると考えられる場合、カレント・ユーザーが、蓄積された入力に含まれた好みのユーザーを表すプロパティと合致する程度において、学習消去プロセス（unlearning process）が、蓄積された入力を徐々に戻す、あるいは劣化させて、当該蓄積された入力は、好みのユーザーを表さなくなり、他のまたは一般的なバックグラウンド・ユーザーをより強く表すようになる。 If the interaction with the toy is considered negative, an unlearning process has been accumulated to the extent that the current user matches the property representing the preferred user contained in the accumulated input. Gradually returning or degrading the input, the accumulated input no longer represents the preferred user, and more strongly represents other or general background users.

学習または学習消去の程度は、ケースに応じて、ユーザーからの交流がポジティブまたはネガティブであると分類される程度に比例する。機械語命令（ソフトウェア）は、検出した運動入力の加速度と、受け取った音声入力の振幅との閾値を含んでいる。
閾値を超える振幅の音声を受け取った場合、その声は、ネガティブな入力として分類され、叫び声またはノイズに相当するものとなる。同様に、閾値を超える加速度の動作は、ネガティブな入力として分類され、身体的虐待、投げ、あるいは落下に相当するものとなる。
また、ソフトウェアにより、ＣＰＵ（１２）が、サウンド入力のピッチ・パターンにおける標準的な偏差を、歌っているものとして特定することが可能となり、およびＣＰＵ（１２）が、予め定めた最大閾値と最小閾値の間における標準的な加速を、揺さぶりとして特定することが可能となること、が予見できる。それらは、ポジティブな入力であると解釈される。 The degree of learning or learning elimination is proportional to the degree to which user interaction is classified as positive or negative, depending on the case. The machine language command (software) includes a threshold value of the acceleration of the detected motion input and the amplitude of the received voice input.
If a voice with an amplitude exceeding the threshold is received, the voice is classified as a negative input and corresponds to a scream or noise. Similarly, an acceleration action that exceeds the threshold is classified as a negative input and corresponds to physical abuse, throwing, or falling.
The software also allows the CPU (12) to identify the standard deviation in the pitch pattern of the sound input as singing, and the CPU (12) allows the predetermined maximum threshold and minimum It can be foreseen that the standard acceleration between the thresholds can be identified as shaking. They are interpreted as positive inputs.

ユーザーからの交流がポジティブであると考えられ、そして、カレント・ユーザーの特徴が、好みのユーザーのそれと近似している限り、換言すると、カレント・ユーザーの音声と好みのユーザーの音声（蓄積された入力で表わされる）とが近似する度合いが強い場合、頻度および（または）クオリティという点において、玩具からのポジティブな応答が増える。このポジティブな応答は、ＣＰＵ（１２）によって出力装置（２６）に送られる命令によって形成される。
反対に、カレント・ユーザーの特徴が好みのユーザーのそれと一致しない場合、頻度および（または）クオリティという点において、玩具からのポジティブな応答は減少する。このポジティブな応答は、ＣＰＵによって出力装置（２６）に送られる命令によって形成される。 In other words, as long as the user interaction is considered positive and the current user's characteristics approximate that of the favorite user, in other words, the current user's voice and the favorite user's voice (accumulated (Represented by the input) is more approximate, the positive response from the toy increases in terms of frequency and / or quality. This positive response is formed by a command sent by the CPU (12) to the output device (26).
Conversely, if the characteristics of the current user do not match those of the favorite user, the positive response from the toy is reduced in terms of frequency and / or quality. This positive response is formed by a command sent by the CPU to the output device (26).

センサ（１８）によって検出されたスピーチおよび動作等の入力に加えて、ソフトウェアにより、ＣＰＵ（１２）は、タイマー（１４）をモニタリングして、所定時間よりも長く玩具との交流が無いことを特定する。これは、玩具を無視していることであり、ネガティブな入力として分類され、蓄積された入力に対してそのような影響を及ぼす。これは、好みのユーザーを学習消去することに繋がる。 In addition to the speech and motion detected by the sensor (18), the software allows the CPU (12) to monitor the timer (14) and determine that there is no interaction with the toy for longer than a predetermined time. To do. This is ignoring the toy and is classified as a negative input and has such an effect on the accumulated input. This leads to the learning deletion of the favorite user.

玩具のマクロな挙動は、図３の流れ図を参照してより単に説明できる。図３において、入力センサ（１８）のうちの１つによって入力が検出されたとき（ステップ４０）、ＣＰＵ（１２）は当該入力をポジティブかネガティブに分類し、ケースによって、ポジティブまたはネガティブの程度を測定する。さらにＣＰＵ（１２）は、音声入力に関連する音声と、好みのユーザーのそれとの類似性の程度を決定する。図において、このステップは、「懐くユーザーに対する合致クオリティ」として示されている。
入力がポジティブに分類された場合（これはステップ４２として示されている）、ＣＰＵ（１２）は、カレント・ユーザーのプロパティを学習するまたは強化するように命じられる。これは、蓄積された入力を、受け取った入力のポジティブな程度に比例して、好みのユーザーをより強く表すことで実現される（ステップ４４）。その後、ステップ（４６）において、ＣＰＵ（１２）は、出力装置（１８）に指示を送る。この指示は、好みのユーザーに対するカレント・ユーザーの類似性の程度、および入力のポジティブな程度に比例している。 The macro behavior of the toy can be more simply explained with reference to the flow chart of FIG. In FIG. 3, when an input is detected by one of the input sensors (18) (step 40), the CPU (12) classifies the input as positive or negative and determines the degree of positive or negative depending on the case. taking measurement. Further, the CPU (12) determines the degree of similarity between the voice related to the voice input and that of the favorite user. In the figure, this step is shown as “Match Quality for Nostalgic Users”.
If the input is classified as positive (this is shown as step 42), the CPU (12) is instructed to learn or enhance the properties of the current user. This is accomplished by representing the accumulated input more strongly with the preferred user in proportion to the positive degree of the received input (step 44). Thereafter, in step (46), the CPU (12) sends an instruction to the output device (18). This indication is proportional to the degree of similarity of the current user to the preferred user and the positive degree of input.

ステップ（４２）で入力がネガティブであるとされた場合、ステップ（４８）において、ＣＰＵ（１２）は、カレント・ユーザーが現在の好みのユーザーでもあるのか、あるいは、入力が無視すべきものかどうかを決定する。カレント・ユーザーが現在の好みのユーザーではなく、入力が無視すべきものでもない場合には、ステップ（４６）において、ＣＰＵ（１２）は、好みのユーザーに対するカレント・ユーザーの類似性の程度、および入力のネガティブな程度に比例する命令を、再度、出力装置（１８）に送る。
しかしながら、ステップ（４８）において、カレント・ユーザーが現在の好みのユーザーであると認定されるか、入力が無視すべきものであると認定されれば、ＣＰＵ（１２）は、入力のネガティブな程度に比例して、カレント・ユーザーのプロパティを学習消去（unlearn）するように命じられる（ステップ５０）。この後、ＣＰＵ（１２）は、好みのユーザーに対するカレント・ユーザーの類似性の程度、および入力のネガティブな程度に比例して、出力装置（１８）に指示を送る（ステップ４６）。 If the input is negative in step (42), in step (48), the CPU (12) determines whether the current user is also the current favorite user or whether the input should be ignored. decide. If the current user is not the current favorite user and the input is not to be ignored, in step (46), the CPU (12) determines the degree of similarity of the current user to the favorite user and the input. A command proportional to the negative degree of is again sent to the output device (18).
However, if in step (48) the current user is found to be the current preferred user or the input is found to be negligible, the CPU (12) will make the input negative. In proportion, it is commanded to unlearn the current user's properties (step 50). Thereafter, the CPU (12) sends an instruction to the output device (18) in proportion to the degree of similarity of the current user to the favorite user and the negative degree of input (step 46).

ステップ（４６）で出力装置へ送られた命令を完了すると、ＣＰＵ（１２）は、次の入力の受取り、またはタイマーが交流無しを表示するのを待つ。 Upon completion of the command sent to the output device in step (46), the CPU (12) waits for receipt of the next input or for the timer to indicate no AC.

本発明の別の実施形態を図２に示している。図２では、図１の実施形態と同じ特徴については、同じ数字で示している。図２の実施形態においても、デジタル中央処理装置（ＣＰＵ）（１２）が含まれる。この実施形態は、デジタル・タイマー（１４）、不揮発性メモリ・モジュールとして与えられた記憶ユニット（１６）、入力を検出する入力センサ（１８）、マイクロホン（２０）、および加速度計（２２）を含んでいる。
この実施形態は、追加的に、デジタル映像記録装置（５０）を含んでいる。それは、この実施形態においては、デジタルカメラである。この実施形態は、さらに、ユーザーとのコミュニケーションを図るための出力装置（２４）を含んでいる。ここでも出力装置は、音響トランスデューサ（２６）および動作アクチュエータ（２８）を含んでいて、それらは、玩具の肢体（図示せず）に接続されている。
ＣＰＵ（１２）は、入力インターフェース（３０）および出力インタフェース（３２）を介して、それぞれ、入力センサ（１８）および出力装置（２６）に接続されている。入力インターフェース（３０）は、アナログ・デジタル（Ａ／Ｄ）コンバータ（３４）を含んでいる。出力インタフェース（３２）は、デジタル・アナログ（Ｄ／Ａ）コンバータ（３６）を含んでいる。ソフトウェア（図示せず）として与えられた機械語命令は、メモリ（１６）または追加のメモリモジュール（３８）に格納されていて、入力インターフェース（３０）および出力インターフェース（３２）、並びに関連するＡ／ＤコンバータおよびＤ／Ａコンバータを駆動する。 Another embodiment of the present invention is shown in FIG. In FIG. 2, the same features as those of the embodiment of FIG. 1 are indicated by the same numerals. The embodiment of FIG. 2 also includes a digital central processing unit (CPU) (12). This embodiment includes a digital timer (14), a storage unit (16) provided as a non-volatile memory module, an input sensor (18) for detecting input, a microphone (20), and an accelerometer (22). It is out.
This embodiment additionally includes a digital video recording device (50). In this embodiment, it is a digital camera. This embodiment further includes an output device (24) for communicating with the user. Again, the output device includes an acoustic transducer (26) and a motion actuator (28), which are connected to a toy limb (not shown).
The CPU (12) is connected to the input sensor (18) and the output device (26) via the input interface (30) and the output interface (32), respectively. The input interface (30) includes an analog to digital (A / D) converter (34). The output interface (32) includes a digital-to-analog (D / A) converter (36). Machine language instructions provided as software (not shown) are stored in the memory (16) or additional memory module (38) and are associated with an input interface (30) and an output interface (32) and associated A / The D converter and the D / A converter are driven.

本発明のこの実施形態において、デジタルカメラ（５０）を使用して、ユーザーの画像を周期的に捕らえてもよい。それは、例えば、ユーザーからの交流が検出されたときである。この画像を音声記録と組み合わせて、あるいは別々に用いて、好みのユーザーの顔を認識してもよい。デジタル画像を、メモリ（１６）に格納された好みのユーザーの画像と比較するために使用できる、複雑な画像認識ソフトウェアは入手可能である。
上述したように、そして音声認識に関して更に以下に説明するように、画像認識ソフトウェアを使用して、カメラ（５０）で撮影した好みのユーザーの画像と、後段階で撮影されるカレント・ユーザーの画像との類似性の程度を決定することができる。ＣＰＵ（１２）によって出力装置（２４）に送られた制御信号もまた、カレント・ユーザーの画像と好みのユーザーの画像との類似性の程度に依存する。 In this embodiment of the invention, a digital camera (50) may be used to periodically capture a user's image. For example, when an alternating current from the user is detected. This image may be used in combination with audio recording or separately to recognize the user's favorite face. Complex image recognition software is available that can be used to compare digital images with images of a favorite user stored in memory (16).
As described above, and as described further below with respect to voice recognition, using image recognition software, an image of a favorite user taken with the camera (50) and an image of the current user taken later. The degree of similarity to can be determined. The control signal sent to the output device (24) by the CPU (12) also depends on the degree of similarity between the current user image and the preferred user image.

以上の説明は、玩具の動作全体を概観したものである。次に、ソフトウェアによって採用され、ＣＰＵ（１２）によって実施されるアルゴリズムについて、その詳細な分析を説明する。
アルゴリズム（それは、ソフトウェアまたはハードウェアで実行されるもので、メモリ（１６）内に存在するものではない）は、ＣＰＵ（１２）上で実行され、カレント・ユーザーとの交流を評価し、それに基づいて好みのユーザーについての内部表現（蓄積された入力）を変更し、ユーザーとの交流の性質を決定する。 The above description gives an overview of the overall operation of the toy. Next, a detailed analysis of the algorithm adopted by the software and executed by the CPU (12) will be described.
An algorithm (which is executed in software or hardware, not in memory (16)) is executed on the CPU (12) to evaluate and interact with the current user. To change the internal representation (accumulated input) for the user of choice and determine the nature of interaction with the user.

ユーザーからの入力（この場合は、スピーチ）は、検出時にサンプリングされ、デジタルの形式でＣＰＵが利用可能となる。この信号は、その後、デジタルで処理されて、関連する情報内容を決定する。様々な代替が可能であるが、この実施形態では、信号は、互いに５０％オーバーラップする連続した３０ミリセカンドのフレームへと細分される。
各フレームは、ウィンドウ生成機能（windowing function）によって形成され、そのパワーレベルおよびメル周波数ケプストラム係数（MFCCs）が決定される（ＲＡＳＴＡＰＬＰ等、他の様々な分解も使用可能である）。これは、その与えられた時刻におけるピッチ周波数で増大される。
これらの情報すべてが、そのフレームに関連するスピーチ情報を要約する特徴ベクトルｘ（ｎ）内に組み合わせられる。インデックスｎは、このベクトルが決定された特定のフレーム番号を示している。情報を利用可能とした状態で、信号は、サイレントとスピーチのセグメントに分割できる。そのための実行（implementation）が幾つか知られている。 Input from the user (in this case, speech) is sampled upon detection and is available to the CPU in digital form. This signal is then processed digitally to determine the relevant information content. While various alternatives are possible, in this embodiment, the signal is subdivided into successive 30 millisecond frames that overlap 50% of each other.
Each frame is formed by a windowing function and its power level and mel frequency cepstrum coefficients (MFCCs) are determined (other various decompositions such as RASTA PLP can also be used). This is increased at the pitch frequency at that given time.
All of this information is combined into a feature vector x (n) that summarizes the speech information associated with that frame. The index n indicates the specific frame number for which this vector was determined. With the information available, the signal can be divided into silent and speech segments. Several implementations for this are known.

同様に、加速度計から得られた入力は、玩具の動作を要約する他の特徴ベクトルｙ（ｎ）内に集めることができる。 Similarly, the input obtained from the accelerometer can be collected in another feature vector y (n) that summarizes the behavior of the toy.

ｘ（ｎ）から、信号出力（振幅）およびピッチ周波数の両方は、時間の関数として知られている。声の大きさは、この信号出力から直接決定される。音の大きさが予め定めた最小閾値と最大閾値の間にある場合、交流はポジティブであると考えられる。予め定めた時間長に音声が全く存在しない場合、それは無視、したがってネガティブと考えられる。一方、最大閾値を超える大きな声が存在する場合、それは叫び声、従ってこれもネガティブと考えられる。 From x (n), both signal output (amplitude) and pitch frequency are known as a function of time. The loudness of the voice is determined directly from this signal output. If the loudness is between a predetermined minimum threshold and a maximum threshold, alternating current is considered positive. If no speech is present for a predetermined length of time, it is ignored and therefore considered negative. On the other hand, if there is a loud voice that exceeds the maximum threshold, it is considered a scream, and thus also negative.

これらの態様を組み合わせて、所定時間長におけるクオリティメジャー（quality measure）とすることができる。これは、０を中立として、次のように表すことができる。

These aspects can be combined to form a quality measure for a predetermined length of time. This can be expressed as follows, with 0 being neutral:

話者の同一性を決定するために、統計モデルを使用して、ターゲットとなる話者と、一般的なバックグランドの話者との両方を記述する。ここでいう記述とは、話者の特徴をモデル化し、これを用いて、未知のスピーチサンプルと特定の話者との間におけるマッチィングを決定するという、特定の実行を意味するが、それを行う他の技術を排除するものではない。
この特許において、その正確な技術または実行は重要ではなく、一般的な話者認識および機械学習（パターン認識）という広い分野から利用可能な幾つかの候補が存在する。ここで説明したものの代わりとして、サポート・ベクター・マシン（ＳＶＭ）あるいは他のポピュラーなパターン分類アプローチを使用することも可能である。 In order to determine speaker identity, a statistical model is used to describe both the target speaker and the general background speaker. The description here refers to a specific practice of modeling speaker characteristics and using them to determine matching between an unknown speech sample and a specific speaker. It does not exclude other technologies.
In this patent, the exact technique or execution is not important and there are several candidates available from a broad field of general speaker recognition and machine learning (pattern recognition). As an alternative to the one described here, a support vector machine (SVM) or other popular pattern classification approach can be used.

一般的なバックグランドの話者は、ガウスの混合物モデル（ＧＭＭ）を用いて表わされるもので、ここでは、ユニバーサル・バックグランド・モデル（ＵＢＭ）と呼ぶ。その最も単純化した形態において、そのような混合物は、単一のガウス密度へと崩壊（collapse）し、その結果、コンピュータ上の要求を大きくに減じる。代表的に、ＵＢＭは、多数の話者のスピーチから集合的に訓練される。 A typical background speaker is represented using a Gaussian mixture model (GMM) and is referred to herein as a universal background model (UBM). In its most simplified form, such a mixture collapses to a single Gaussian density, thereby greatly reducing the computational demands. Typically, UBMs are trained collectively from multiple speaker's speech.

その後、このＵＢＭは、ＭＡＰ適合（Maximum a Posteriori adaptation）、ＭＬＬＲ（Maximum-Likelihood Linear Regression）、またはＭＬＥＤ（Maximum-Likelihood Eigendecomposition）等のプロセスを介して、ターゲットとなる特定の話者（この実施形態では、好みのユーザー）のスピーチに適合される。
訓練されたＵＢＭパラメータは、安定した初期モデル評価を形成する。その後、初期モデル評価は、幾つかの方法で再度重み付けが行われ、好みのユーザーの特徴にさらによく似せる。このようにして、好みの話者モデルが作られる。以下に、このアプローチについて、さらに詳しく説明する。 The UBM is then sent to a specific speaker (this embodiment) via a process such as Maximum a Posteriori adaptation (MLLR), Maximum-Likelihood Linear Regression (MLLR), or Maximum-Likelihood Eigendecomposition (MLED). Then, it is adapted to the speech of the user).
Trained UBM parameters form a stable initial model evaluation. The initial model evaluation is then re-weighted in several ways to more closely resemble the user's favorite characteristics. In this way, a favorite speaker model is created. In the following, this approach will be described in more detail.

利用可能なＵＢＭおよびターゲットの話者モデルを持つことで、好みのユーザーのモデルに対して未知のスピーチ・セグメントがどの程度マッチしているかについての近似性（closeness）を評価することができる。このことは、バックグランド話者のモデル（ＵＢＭ）および好みのユーザーのモデル（蓄積された入力によって表わされている）の両方に対する当該スピーチ・セグメントの対数スコアを評価することで行われる。
これらのスコア間における差異は、ＬＬＲスコア（log-likelihood-ratio score）にほぼ等しく、好みのユーザーが与えられたスピーチにどれくらい良く一致しているのか、に直接変換される。
数学的には、ｎ番目のフレームのＬＬＲスコア、ｓ（ｎ）は、次式で表される。

上式において、ｆは、ガウスまたはＧＭＭ確率密度関数のいずれかを示す。下付文字Ｔ、Ｕは、それぞれ、ターゲット話者およびＵＢＭ話者を示す。 Having an available UBM and target speaker model can assess the closeness of how well unknown speech segments match the model of the preferred user. This is done by evaluating the log segment's log score for both the background speaker model (UBM) and the favorite user model (represented by accumulated input).
The difference between these scores is directly converted to how well the preferred user matches the given speech, approximately equal to the LLR score (log-likelihood-ratio score).
Mathematically, the LLR score of the nth frame, s (n), is expressed by the following equation.

In the above equation, f represents either a Gaussian or GMM probability density function. The subscripts T and U indicate the target speaker and UBM speaker, respectively.

シングルフレームに基づく決定は、不安定である。代表的には、その前にＮ個のフレームが集められる。Ｎは、１０〜３０秒の時間期間に対応して選択される。そのような各セグメントのスコアは、次式で与えられる。

大きなスコアは、そのスピーチが好みのユーザーにより発せられた可能性が大きい（類似性が高い）ことを示している。ゼロ値は、そのスピーチが一般的なバックグランド話者と区別できない（類似性が低い）ことを示している。ここでも、幾つかの代替案がある。
テスト標準化（ＴＮＯＲＭ）は、注目すべき別例であって、これは、単一のＵＢＭを多くのバックグランド話者モデルで置き換える Decisions based on single frames are unstable. Typically, N frames are collected before that. N is selected corresponding to a time period of 10 to 30 seconds. The score for each such segment is given by:

A large score indicates that the speech is likely to have been issued by a favorite user (high similarity). A zero value indicates that the speech is indistinguishable from a general background speaker (low similarity). Again, there are several alternatives.
Test normalization (TNORM) is another noteworthy example, which replaces a single UBM with many background speaker models

多次元のガウス密度は、意味／中心ベクトルｍおよび共分散マトリクスＣからなる。ガウス中心ベクトルのＭＡＰアダプテーションは、既存の先の中心ベクトルおよび新しく観察されたターゲット特徴ベクトルの重み付けされた組合せを具体的に導き、そのとき共分散マトリクスＣは変化せず、一定である。
ここでは、この考えを用いて、計算上効率的な方法によりシステムが、最近の話者の特徴を学習し、これと同時に以前の話者の特徴を徐々に学習消去していくことを可能にしている。 The multidimensional Gaussian density consists of a semantic / center vector m and a covariance matrix C. The MAP adaptation of the Gaussian center vector specifically leads to a weighted combination of the existing previous center vector and the newly observed target feature vector, at which time the covariance matrix C does not change and is constant.
Here we use this idea to allow the system to learn the features of recent speakers in a computationally efficient way and at the same time gradually learn and erase the features of previous speakers. ing.

単一のターゲットとなるガウス中心のアダプテーションが最初に記述され、その後、ＧＭＭに埋め込まれたガウス中心のアダプテーションにまで拡張される。玩具を最初に使用する前に、ターゲット中心は、ＵＢＭからクローニングされる。従ってこの段階では、好みのユーザーは、一般的なバックグランド話者と区別できない。
したがって、

Ｔはターゲットを、ＵはＵＢＭを、ｎの値はアダプテーションタイムステップ（adaptation time step）を示している。なお、ターゲット中心は時間ｎの関数であるが、ＵＢＭ中心は一定である。ここでターゲット特徴ベクトルは、ユーザーのスピーチから導き出されるもので、ｘ（ｎ）で表される。
ターゲット中心は、回帰法を用いて適合される。

ここで、λは小さな正の定数で、ｎ＝０、１、２・・・である。この差分方程式は、デジタルのローパスフィルタを表しており、そのＤＣゲインは１である。λの値が小さいほど、既存の中心値に対して大きな強調が置かれることとなり、新しく観察された特徴ベクトルに置かれる強調は小さくなる。
従ってλは、システムが過去の中心として有しているメモリの長さを有効に制御する。このメモリの有効長は、このフィルターのインパルス応答が、もとのインパルス高さの約１０％に戻すのに要する時間に注目することで決定できる。 A single target Gaussian adaptation is first described and then extended to a Gaussian adaptation embedded in the GMM. Prior to the first use of the toy, the target center is cloned from the UBM. Therefore, at this stage, favorite users cannot be distinguished from general background speakers.
Therefore,

T represents a target, U represents UBM, and the value of n represents an adaptation time step. Note that the target center is a function of time n, but the UBM center is constant. Here, the target feature vector is derived from the user's speech and is represented by x (n).
The target center is fitted using a regression method.

Here, λ is a small positive constant, and n = 0, 1, 2,. This difference equation represents a digital low-pass filter, and its DC gain is 1. The smaller the value of λ, the greater the emphasis placed on the existing center value, and the less emphasis placed on the newly observed feature vector.
Therefore, λ effectively controls the length of memory that the system has as the center of the past. The effective length of this memory can be determined by noting the time it takes for the impulse response of this filter to return to about 10% of the original impulse height.

このことを、次の表に要約して示す。

表１：異なるλの値に対するメモリの有効長。ミニッツ（minute）の長さは、１５ミリ秒のタイムステップに基づく。 This is summarized in the following table.

Table 1: Effective length of memory for different values of λ. The length of the minute is based on a time step of 15 milliseconds.

従って、λ＝１０^−５の場合、前の話者を学習消去し、新しい好みの話者に懐くには、約１時間の支持されるスピーチが必要である。このような学習率は、λを以下のように設定することで、対話(あるいは交流）の質により調整できる。

Thus, for λ = 10 ⁻⁵ , about 1 hour of supported speech is required to learn and erase the previous speaker and to remember the new favorite speaker. Such a learning rate can be adjusted according to the quality of dialogue (or alternating current) by setting λ as follows.

より精巧なシステムは、ガウスの混合物モデル（ＧＭＭ）を用いる。これは、上に説明したような単一のガウス密度の代わりに、Ｋ個のガウス要素モデルから成る。可能性のある特徴ベクトルｘ（ｎ）が与えられた場合、ｉ番目のガウス要素はｆｉ（ｘ（ｎ））で与えられ、ＧＭＭからの可能性のある結果は、次のような加重和となるであろう。

ｗｉは混合の重み付けで、ｉ＝１、２、…、Ｋである。

このようなモデルをアップデートするとき、ターゲットの特徴ベクトルｘ（ｎ）は、様々なガウス要素と比例関係にあるだろう。全体として、１つのガウス要素だけと関係するのではない。これらの比例定数は、リスポンサビリティ（responsibilities）として知られており、次のように決定することができる。

A more sophisticated system uses a Gaussian mixture model (GMM). This consists of K Gaussian element models instead of a single Gaussian density as described above. Given a possible feature vector x (n), the i th Gaussian element is given by fi (x (n)), and the possible result from the GMM is a weighted sum such as It will be.

wi is a blending weight, i = 1, 2,.

When updating such a model, the target feature vector x (n) will be proportional to the various Gaussian elements. Overall, it is not related to just one Gaussian element. These proportionality constants are known as responsibility and can be determined as follows.

ＧＭＭのアダプテーションは、特徴ベクトルを比例して用いることで、対応して行われて、ガウス要素の各々をアップデートする。これにより、もとのアップデートされた漸化式は、次のように変更される。

このアダプテーションの方法を用いて、既存のユーザーに懐くことを、そのユーザーが交流を続ける限り、継続することができる。しかしながら、別のユーザーが玩具と交流を持ち始めれば、もとのユーザーの記憶は徐々に衰えて新しいユーザーのそれと置き換わっていくだろう。それは、まさに所望の挙動である。 GMM adaptation is performed correspondingly by using feature vectors in proportion to update each of the Gaussian elements. As a result, the original updated recurrence formula is changed as follows.

By using this method of adaptation, as long as the user continues to interact with the existing user, it can be continued. However, if another user begins to interact with the toy, the original user's memory will gradually fade and replace the new user's. That is exactly the desired behavior.

現在の好みのユーザーが玩具との交流を怠っている場合、我々は、彼（彼女）が玩具の記憶から消えて行くこと、別の言葉で表現すれば、玩具が彼（彼女）の声の特徴を学習消去すること、を望んでいる。このことは、周期的に下記余分の特徴ベクトルを適合プロセスに挿入することで達成される。

この余分の特徴ベクトルは、ＵＢＭ中心から出てくるものである。それらの対応するリスポンサビリティ定数は、次式で表される。

If the current favorite user has neglected to interact with the toy, we can say that he (she) disappears from the memory of the toy, in other words, the toy I want to learn and erase features. This is accomplished by periodically inserting the following extra feature vectors into the fitting process.

This extra feature vector comes from the UBM center. Their corresponding responsivity constants are expressed as:

これは、好みのユーザーの特徴から離れるように、そして、一般的なバックグランド話者に近づくように、ターゲットモデルを移動させる。しかしながら、これらのベクトルの影響は、真のターゲット話者の入力ベクトルのそれよりも、はるかに小さく宣告されるべきである。
従って、それらは、大体２０回のタイムフレーム（またはそれ以上）の後に挿入されるべきであり、この学習消去プロセスは、学習プロセスのほぼ２０倍遅くなる。これは、２つの目的に役立つ。
第１に、ターゲットモデルは、ＵＢＭに向かって絶えず安定させられ、非本質的な外部ノイズに対して、ある種の頑固さを提供する。第２に、ユーザーが長期間の玩具を無視した場合、玩具は徐々にそのユーザーを“忘れる”。 This moves the target model away from the favorite user's characteristics and closer to the general background speaker. However, the effects of these vectors should be declared much smaller than that of the true target speaker's input vector.
Therefore, they should be inserted after roughly 20 time frames (or more), and this learning elimination process is almost 20 times slower than the learning process. This serves two purposes.
First, the target model is constantly stabilized towards the UBM, providing some kind of robustness against non-essential external noise. Second, if a user ignores a long-term toy, the toy gradually “forgets” the user.

好みのユーザーが“虐待的”な挙動を示した場合、我々は、即座にそのユーザーを玩具のメモリから追い出したい。好みのユーザーは、高い識別スコアｓ（Ｘ）によって認識される。虐待の存在は、交流クオリティＱの大きく否定的な値によって示される。それらが混在する場合には、次式で表される非常に増大させた値を用いて、直ちにこの手順の適用することで、上述の学習消去プロセスを加速させる。

If a favorite user behaves "abused", we want to immediately remove that user from the toy memory. A favorite user is recognized by a high identification score s (X). The presence of abuse is indicated by a large negative value for AC quality Q. When they are mixed, the learning elimination process described above is accelerated by immediately applying this procedure using a greatly increased value represented by the following equation.

これは、素早くＵＢＭに戻るようターゲットモデルを移動させるだろう。しかし、その場合でも、スピーチが好ましい話者から実際に発生したかも知れないという不確実性を依然として考慮に入れている。 This will move the target model back to the UBM quickly. However, even then, it still takes into account the uncertainty that the speech may have actually originated from the preferred speaker.

交流が、ａ）ポジティブであって、かつ、ｂ）好みのユーザーに対するマッチ度合いが強い、と認められる限りにおいて、玩具からのポジティブな交流は、頻度およびクオリティの両面において増大する。これらは、玩具からの会話応答、可能な顔の表情のコントロール、肢体を使った動作で表現される。 As long as the interaction is recognized as a) positive and b) a strong match for the user of choice, positive interaction from the toy increases in both frequency and quality. These are expressed by conversational responses from toys, control of possible facial expressions, and movement using the limbs.

ここでの説明は、叫び声に対して静かで滑らかな音声を、あるいは、投げるまたは落ちるに対してソフトに締め付ける動作を、検出する特定の態様に関連している。しかし、そのようにする他の態様、および考えられる他のタイプのジェスチャが除外されるものではない。詳細な技術または態様は、本件特許においてクリティカルではない。 The description herein relates to a particular aspect of detecting a quiet and smooth sound for screams, or a soft tightening action for throwing or falling. However, other aspects of doing so, and other possible types of gestures are not excluded. The detailed technique or aspect is not critical in this patent.

さらに、ここでは説明していないが、好みの個人の顔を、その一般的な表情に基づいて識別する同様のプロセスを工夫することが可能である。それに対する１つのアプローチは、固有の顔表情の第１要素によって提供される一般的な顔に対して、好みの顔がそこからどの位逸れているのかを測定することである。 Further, although not described here, it is possible to devise a similar process for identifying a favorite individual's face based on its general expression. One approach to that is to measure how far the favorite face deviates from the general face provided by the first element of unique facial expression.

以上の説明は単なる例示であって、多数の修正、応用、および他の態様が可能である。例えば、図示した要素に対して、置換、追加、修正が可能である。また、ここで説明した方法に対して、置換、配置変更、工程の追加をすることが可能である。さらに、デジタル的なものとして説明したすべての要素は、玩具のハードウェアに対して適切な変更が行われる場合には、アナログ回路によって同等に実施することができる。従って、以上の説明は、発明を限定するものではない。 The above description is exemplary only, and many modifications, applications, and other aspects are possible. For example, substitution, addition, and modification can be made to the illustrated elements. Moreover, it is possible to perform substitution, arrangement change, and addition of processes to the method described here. Further, all elements described as digital can be equally implemented by analog circuitry if appropriate changes are made to the toy hardware. Accordingly, the above description does not limit the invention.

Claims

A toy provided with a main body,
At least one input sensor (18) for receiving input from a human user;
At least one output device (24) for the toy to interact with the user;
A processor (12) in communication with the input sensor (18) and the output device (24);
A memory (16) in communication with the processor (12),
It is characterized by the following:

The processor (12)
Classify each input received as positive or negative,
Adjust the accumulated input stored in the memory (16) according to the classification,
Programmed to send a control signal dependent on the accumulated input to the output device (24);

The accumulated input is a digital model representing one or more of the toy's favorite user's voice, image, and behavior,
If the input is classified as positive, the preference for the user's voice, image, and / or behavior is enhanced, and if the input is classified as negative, the preference Adjusted by the processor to reduce the degree of user voice, image, and / or behavior
Thereby, in response to a series of dominant positive inputs over a period of time, the behavior favors the preferred user , and in response to a series of dominant negative inputs over a period of time, the above A toy that reduces the behavior of the user .

The toy according to claim 1, wherein the input received from the human user corresponds to mutual interaction between the toy and the human corresponding to one or more of sound, action, and image.

The toy of claim 2, wherein the processor (12) is programmed to classify sounds associated with screams and actions associated with physical abuse as negative inputs.

Including at least first and second input sensors (18);
The first input sensor is a microphone (20) that detects sound and its amplitude;
The toy according to any one of claims 1 to 3, wherein the second input sensor is an accelerometer (22) that detects the motion and acceleration of the toy.

Input which is the accumulation of at least some extent, the user's voice favorite toy is a table to digital model, toy according to any one of claims 1 to 4.

The toy of claim 1 , wherein the processor (12) is programmed to determine a degree of similarity between the received input and the accumulated input.

The processor adjusts the digital model so that the degree of preference for the preferred user is reduced or does not change when the degree of similarity is low or the received input is classified as negative. The toy of claim 6 programmed as follows .

The processor (12) classifies a sound input having a larger amplitude than a predetermined maximum sound amplitude as a negative input, and classifies a sound input having a smaller amplitude as a positive input. The toy according to claim 2 , which is programmed.

The processor (12) classifies a motion input having a higher acceleration than a predetermined maximum acceleration threshold value as a negative input, and classifies a motion input having a smaller acceleration as a positive input. The toy according to claim 2 , which is programmed.

The processor (12) determines the positive or negative degree of the received input depending on the situation,
10. A toy according to any one of the preceding claims, programmed to adjust the accumulated input in proportion to a positive or negative degree.

A timer (14) for communicating with the processor (12);
The processor (12) classifies it as a negative input when there is no input for a predetermined time length measured by the above timer and longer than that,
The toy of claim 1 , wherein the toy is programmed to adjust the digital model accordingly to be less representative of the voice, image, and / or behavior of a preferred user.

The output device (24) comprises one or both of an acoustic transducer (26) and a motion actuator (28),
The processor (12) sends more frequent and / or higher quality control signals to the output device (24) if the similarity of the further received input is high above the accumulated input. Is programmed to
The processor (12) sends a lower frequency control signal and / or a lower quality control signal to the output device (24) if the received input has a lower similarity over the stored input. The toy of claim 6 programmed to send.

The digital model includes a set of features extracted from speech associated with a typical background speaker,
6. The toy of claim 5 , wherein each of the features has a variety of associated weights, and the set of weighted features represents a favorite user's voice.

The processor (12), by adjusting the various weights associated with the characteristics, as described above accumulated input becomes represent weaker than or stronger, the voice of the user preferences, the digital model 14. A toy according to claim 13 , programmed to adjust .

As the digital model becomes less representative of the current preferred user's voice, the digital model is adjusted to be more strongly representative of the voice of at least one other user,
15. A toy according to claim 13 or 14, wherein when the digital model represents the voice of the other user more strongly than that of the current favorite user, the other user becomes the new favorite user.

A method for simulating the behavior of a toy for humans,
Storing the accumulated input, which is a digital model representing one or more of the user 's favorite voice, image, and behavior, in a memory (16) associated with the toy;
Receiving input from a user by at least one input sensor (18) incorporated in the toy;
Categorizing received input as positive or negative;
User's voice preferences depending on the input is classified as positive, image, and (or) behaves like made to represent more strongly, also, the user's voice preferences in response to an input that is classified as negative, image And / or adjusting the digital model to be more weakly representative of behavior ;
Responsive to the input, issuing a control signal dependent on the accumulated input to the toy output device (26).

Categorizing the input as negative when receiving an audio input with an amplitude greater than a predetermined amplitude;
Categorizing the input as negative when receiving a motion input of acceleration other than a predetermined range of acceleration;
17. The method of claim 16, comprising the step of classifying a negative input when no input has been received for a time longer than a predetermined time length.

18. The method of claim 16 or 17 , further comprising: determining a similarity of the received voice input on the digital model of the user's preferred voice and issuing a control signal proportional to the similarity to the toy output device. Method.