JP2018001404A

JP2018001404A - Method, system and robot body for synchronizing voice and virtual operation

Info

Publication number: JP2018001404A
Application number: JP2017133168A
Authority: JP
Inventors: ナンチユウ; Nan Qiu; ハオフエンワン; Haofen Wang
Original assignee: Shenzhen Gowild Robotics Co Ltd
Current assignee: Shenzhen Gowild Robotics Co Ltd
Priority date: 2016-07-07
Filing date: 2017-07-06
Publication date: 2018-01-11
Anticipated expiration: 2037-07-06
Also published as: CN106463118A; CN106463118B; WO2018006371A1; JP6567610B2

Abstract

PROBLEM TO BE SOLVED: To enhance man-machine interaction experience.SOLUTION: A method for synchronizing voices and virtual motions comprises: acquiring user's multi-mode information; creating interaction contents including at least voice information and motion information on the basis of the user's multi-mode information and a daily time axis; and performing synchronization adjustment between time length of the voice information and time length of the motion information. Consequently, the interaction contents including at least the voice information and the motion information can be created through one or more kinds of multi-mode information including user's voices, expressions and motions, furthermore, in order to synchronize the voice information and the motion information, the invention similarly adjusts the time length of the voice information and the time length of the motion information, thereby, a robot is enabled to synchronously match the voices and the motions during playback thereof, the robot can be further personified, as well as user's interaction experiences with the robot can be enhanced.SELECTED DRAWING: Figure 1

Description

本発明はロボットインタラクション技術分野に関し、特には音声と仮想動作を同期させる方法、システム及びロボット本体に関するものである。 The present invention relates to the field of robot interaction technology, and more particularly, to a method, system, and robot body for synchronizing voice and virtual motion.

ロボットは人類と対話するツールとして、応用場面がますます多くなり、例えばある老人、子供は孤独を感じる時に、ロボットと対話、娯楽などのインタラクションができるようになった。従来のマンマシンインタラクション技術は一般的に、一種類のマンマシンインタラクションモードだけを支持でき、例えばユーザに機械的な返事をするしかできなく、せいぜい返事に従って限りのある表情を出すことにすぎない。市販の児童コンパニオンロボットは、設定された四、五種類しかできなく、こられの簡単な表情は出力された音声と同期させる必要はない。 Robots are increasingly used as a tool for interacting with humankind. For example, when an elderly person or a child feels lonely, he can interact with the robot and interact with it. Conventional man-machine interaction technology can generally support only one type of man-machine interaction mode, for example, it can only give a mechanical response to the user, and at best only gives a limited expression according to the response. Commercially available child companion robots can only have four or five types that are set, and these simple expressions do not need to be synchronized with the output voice.

而も、ユーザはロボットの使用体験に対する要求がますます高くなることと共に、ロボットにも、例えば、音声で返事すると同時に、もっと擬人化に相応する表情と動作ができることなど、マルチモードで人類と対話する能力の具備が必要となってきた。同時に二種類以上の出力方法で人類と対話するために、ロボットは何種類の出力方法を同期させる必要がある、例えば、「はい」と言う同時に「うなずく」、「いえ」と言う同時に「首を振る」、怒っている時に目を大きく見開き口を尖らせることなど。そうしてこそ、人類がロボットとのインタラクションから夢中になれる体験を得、向かい合う対象が対話可能であるように感じることができる。 In addition, the user is increasingly demanding for the experience of using the robot, and the robot can respond to the human voice in multiple modes, for example, by responding with a voice and at the same time being able to perform facial expressions and actions more suitable for anthropomorphism. It has become necessary to have the ability to In order to interact with humankind with two or more output methods at the same time, the robot needs to synchronize the output methods, for example, say “yes” and simultaneously say “nod” and “no” Shake "or sharpen your eyes wide open when you are angry. Only then can human beings get an immersive experience through interaction with robots and feel that the objects they face are able to interact.

ただし、音声と動作がマッチできなくなると、ユーザのインタラクション体験に甚大な影響をしてしまう。ところが、仮想ロボットが返事する内容に含まれる音声と表情などの仮想動作を同期させるかはかなり複雑な課題であって、ロボット工学、心理学、社会科学など複数の学科に関するものである。それではこの課題の解決が目前に迫っている厄介なものであり、今まで上記課題を比較的よく解決したシステムはまだない。 However, if voice and actions cannot be matched, it will have a profound effect on the user's interaction experience. However, whether to synchronize the voice and facial expressions such as facial expressions included in the contents returned by the virtual robot is a very complex issue, and it relates to multiple departments such as robot engineering, psychology, and social science. Then, the solution of this problem is a troublesome one that is imminent, and no system has solved the above problem relatively well.

本発明は音声と仮想動作を同期させる方法、システム及びロボット本体を提供することで、マンマシンインタラクション体験を向上させることを目的とする。 An object of the present invention is to improve a man-machine interaction experience by providing a method, a system, and a robot body that synchronize voice and virtual motion.

本発明の目的は下記技術様態で実現される：
音声と仮想動作を同期させる方法は、
ユーザのマルチモード情報の取得と、
ユーザのマルチモード情報と生活時間軸によって、少なくとも音声情報と動作情報を含むインタラクション内容の生成と、
音声情報の時間長と動作情報の時間長に対する同期調整を含むことを特徴とする。 The objects of the present invention are realized in the following technical aspects:
To synchronize voice and virtual motion,
Obtaining user multi-mode information,
Depending on the user's multi-mode information and life time axis, generation of interaction contents including at least voice information and motion information,
It includes synchronization adjustment for the time length of audio information and the time length of motion information.

好ましくは、音声情報の時間長と動作情報の時間長を同じように調整する前記ステップは具体的に、
音声情報の時間長と動作情報の時間長との差が閾値以下にある場合は、音声情報の時間長が動作情報の時間長より小さいであるなら、動作情報の再生速度を速め、それにより動作情報の時間長を前記音声情報の時間長と同じようにすることを含む。 Preferably, the step of adjusting the time length of the voice information and the time length of the motion information in the same manner is specifically,
If the difference between the time length of the voice information and the time length of the motion information is less than or equal to the threshold, if the time length of the voice information is smaller than the time length of the motion information, the playback speed of the motion information is increased, thereby Including making the time length of the information the same as the time length of the voice information.

好ましくは、音声情報の時間長が動作情報の時間長より大きいである場合は、音声情報の再生速度を速める又は／及び動作情報の再生速度を落とし、それにより動作情報の時間長を前記音声情報の時間長に等しいにする。 Preferably, when the time length of the voice information is larger than the time length of the motion information, the playback speed of the voice information is increased or / and the playback speed of the motion information is decreased, whereby the time length of the motion information is set to the voice information. Is equal to the length of time.

好ましくは、音声情報の時間長と動作情報の時間長を同じように調整する前記ステップは、具体的に、
音声情報の時間長と動作情報の時間長との差が閾値より大きいである場合は、音声情報の時間長が動作情報の時間長より大きいであるなら、少なくとも二組の動作情報を順序付けて組み合わせ、それにより動作情報の時間長を前記音声情報の時間長に等しいにすることを含む。 Preferably, the step of adjusting the time length of the voice information and the time length of the motion information in the same manner is specifically,
When the difference between the time length of the voice information and the time length of the motion information is greater than the threshold, if the time length of the voice information is greater than the time length of the motion information, at least two sets of motion information are combined in order , Thereby making the time length of the motion information equal to the time length of the audio information.

好ましくは、音声情報の時間長が動作情報の時間長より小さいである場合は、動作情報における一部の動作を選択して、これらの動作の時間長が前記音声情報の時間長に等しいにする。 Preferably, when the time length of the voice information is smaller than the time length of the motion information, select some motions in the motion information so that the time length of these motions is equal to the time length of the voice information. .

好ましくは、またロボットの生活時間軸の生成を含み、生成方法は、
ロボットの自己認識の拡大と、
生活時間軸のパラメータの取得と、
ロボットの自己認識パラメータを生活時間軸のパラメータと整合し、それによりロボットの生活時間軸の生成を含む。 Preferably, also including generation of a robot life time axis,
Expansion of robot self-awareness,
Get life time axis parameters,
Match the robot's self-recognition parameters with the parameters of the life time axis, thereby generating the robot's life time axis.

好ましくは、ロボットの自己認識を拡大する前記ステップは具体的に、生活場面をロボットの自己認識と結合して、生活時間軸を基礎とする自己認識曲線を生成することを含む。 Preferably, the step of expanding the self-recognition of the robot specifically includes combining a life scene with the self-recognition of the robot to generate a self-recognition curve based on a life time axis.

好ましくは、ロボットの自己認識パラメータを生活時間軸のパラメータと整合する前記ステップは具体的に、確率アルゴリズムを利用し、時間軸の場面パラメータが変えた後に、生活時間軸にあるロボットの各パラメータが変化する確率を計算し、整合曲線を形成することを含む。 Preferably, the step of matching the self-recognition parameter of the robot with the parameter of the life time axis specifically uses a probability algorithm, and after the scene parameter of the time axis is changed, each parameter of the robot on the life time axis is changed. Calculating the probability of changing and forming a matching curve.

好ましくは、ここにおいて、前記生活時間軸は一日の２４時間を含む時間軸を指し、前記生活時間軸におけるパラメータは少なくともユーザが前記時間軸で行う日常生活行為及び該行為を表すパラメータの値を含む。 Preferably, here, the life time axis indicates a time axis including 24 hours of a day, and the parameters on the life time axis are at least daily activities performed by the user on the time axis and values of parameters representing the actions. Including.

音声と仮想動作を同期させるシステムは、
ユーザのマルチモード情報を取得する取得モジュール、
ユーザのマルチモード情報と生活時間軸に基づいて、少なくとも音声情報と動作情報を含むインタラクション内容を生成する人工知能モジュール、
音声情報の時間長と動作情報の時間長を同じように調整する制御モジュールを含む、ことを特徴とする。 The system that synchronizes voice and virtual motion
An acquisition module that acquires the user's multi-mode information,
An artificial intelligence module that generates interaction content including at least voice information and motion information based on the user's multi-mode information and life time axis;
It includes a control module that adjusts the time length of the voice information and the time length of the operation information in the same manner.

好ましくは、前記制御モジュールは具体的に、
音声情報の時間長と動作情報の時間長との差が閾値以下にある場合は、音声情報の時間長が動作情報の時間長より小さいであるなら、動作情報の再生速度を速め、それにより動作情報の時間長を前記音声情報の時間長と同じようにすることを含む。 Preferably, the control module is specifically:
If the difference between the time length of the voice information and the time length of the motion information is less than or equal to the threshold, if the time length of the voice information is smaller than the time length of the motion information, the playback speed of the motion information is increased, thereby Including making the time length of the information the same as the time length of the voice information.

好ましくは、音声情報の時間長が動作情報の時間長より大きいである場合は、音声情報の再生速度を速める又は／及び動作情報の再生速度を落とし、それにより動作情報の時間長を前記音声情報の時間長と同じようにする。 Preferably, when the time length of the voice information is larger than the time length of the motion information, the playback speed of the voice information is increased or / and the playback speed of the motion information is decreased, whereby the time length of the motion information is set to the voice information. Same as the time length of.

好ましくは、前記制御モジュールは、具体的に、
音声情報の時間長と動作情報の時間長との差が閾値より大きいである場合は、音声情報の時間長が動作情報の時間長より大きいであるなら、少なくとも二組の動作情報を組み合わせ、それにより、動作情報の時間長を前記音声情報の時間長と同じようにする。 Preferably, the control module is specifically:
If the difference between the time length of the voice information and the time length of the motion information is greater than the threshold, if the time length of the voice information is greater than the time length of the motion information, combine at least two sets of motion information; Thus, the time length of the operation information is made the same as the time length of the voice information.

好ましくは、音声情報の時間長が動作情報の時間長より小さいである場合は、動作情報における一部の動作を選択して、これらの動作の時間長が前記音声情報の時間長と同じようにする。 Preferably, when the time length of the audio information is smaller than the time length of the operation information, select some operations in the operation information so that the time length of these operations is the same as the time length of the audio information. To do.

好ましくは、前記システムは、
ロボットの自己認識の拡大と、
生活時間軸のパラメータの取得、
ロボットの自己認識パラメータを生活時間軸のパラメータと整合し、ロボットの生活時間軸を生成することを含む。 Preferably, the system comprises
Expansion of robot self-awareness,
Acquisition of parameters of life time axis,
This includes matching the robot's self-recognition parameters with the parameters of the life time axis to generate the robot's life time axis.

好ましくは、前記処理モジュールは具体的に、生活場面をロボットの自己認識と結合して生活時間軸を基礎とする自己認識曲線を生成することに用いる。 Preferably, the processing module is specifically used to generate a self-recognition curve based on a life time axis by combining the life scene with the self-recognition of the robot.

好ましくは、前記処理モジュールは確率アルゴリズムを利用し、時間軸の場面パラメータが変えた後に、生活時間軸にあるロボットの各パラメータが変化する確率を計算し、整合曲線を形成することに用いる。 Preferably, the processing module uses a probability algorithm to calculate a probability that each parameter of the robot on the life time axis changes after the scene parameter on the time axis changes, and is used to form a matching curve.

本発明はロボットを開示し、上記のいずれかに記載の音声と仮想動作を同期させるシステムを含む。 The present invention discloses a robot and includes a system for synchronizing voice and virtual motion as described above.

音声と仮想動作を同期させるシステムであって、マイク、アナログデジタルコンバータ、音声識別プロセッサ、画像取得装置、顔認識プロセッサ、音声合成装置、パワーアンプ、スピーカー、イメージングシステム、インタラクション内容プロセッサ及びメモリを含む。 A system that synchronizes speech and virtual motion, including a microphone, an analog-digital converter, a speech identification processor, an image acquisition device, a face recognition processor, a speech synthesizer, a power amplifier, a speaker, an imaging system, an interaction content processor, and a memory.

前記マイク、アナログデジタルコンバータ、音声識別プロセッサとインタラクション内容プロセッサは順次に接続され、前記画像取得装置、顔認識プロセッサとインタラクション内容プロセッサは順次に接続され、前記インタラクション内容プロセッサは前記メモリと接続され、前記インタラクション内容プロセッサ、音声合成装置、パワーアンプ及びスピーカーは順次に接続され、前記イメージングシステムはインタラクション内容プロセッサと接続され、
前記マイクはユーザとロボットが対話する際にユーザの音声信号の取得に用い、前記アナログデジタルコンバータは前記音声信号の音声デジタル情報への転換に用い、前記音声識別プロセッサは前記音声デジタル情報を文字情報に転化する上で前記意図識別プロセッサへの入力に用い、
前記画像取得装置はユーザがいる画像の取得に用い、前記顔認識プロセッサはユーザがいる画像からユーザの表情情報を識別し取得して前記意図識別プロセッサへの入力に用い、
前記インタラクション内容プロセッサは、少なくとも前記文字情報と表情情報を含むユーザのマルチモード情報、及び前記メモリに記憶されている生活時間軸に基づいて、少なくとも音声情報と動作情報を含むインタラクション内容を生成し、音声情報の時間長と動作情報の時間長を同じように調整することに用い、
前記イメージングシステムは前記動作情報によって仮想３Ｄ映像を生成し、前記スピーカーは前記音声情報を同時に再生することを特徴する音声と仮想動作を同期させる方法。 The microphone, the analog-digital converter, the voice identification processor and the interaction content processor are sequentially connected, the image acquisition device, the face recognition processor and the interaction content processor are sequentially connected, and the interaction content processor is connected to the memory, An interaction content processor, a speech synthesizer, a power amplifier and a speaker are sequentially connected, and the imaging system is connected to an interaction content processor;
The microphone is used to acquire a user's voice signal when the user and the robot interact, the analog-digital converter is used to convert the voice signal into voice digital information, and the voice identification processor uses the voice digital information as character information. Used to input to the intent identification processor
The image acquisition device is used to acquire an image of a user, and the face recognition processor identifies and acquires facial expression information of the user from an image of the user and uses it for input to the intention identification processor.
The interaction content processor generates interaction content including at least voice information and action information based on a user's multi-mode information including at least the character information and facial expression information, and a life time axis stored in the memory, Used to adjust the time length of voice information and the time length of motion information in the same way,
The method of synchronizing audio and virtual motion, wherein the imaging system generates a virtual 3D image according to the motion information, and the speaker reproduces the audio information simultaneously.

好ましくは、前記インタラクション内容プロセッサにおいて、音声情報の時間長と動作情報の時間長を同じように調整する前記ステップは、具体的に、
音声情報の時間長と動作情報の時間長との差が閾値以下にある場合は、音声情報の時間長が動作情報の時間長より小さいであるなら、動作情報の再生速度を速め、それにより動作情報の時間長を前記音声情報の時間長と同じようにすることを含む。 Preferably, in the interaction content processor, the step of adjusting the time length of the voice information and the time length of the motion information in the same manner is specifically,
If the difference between the time length of the voice information and the time length of the motion information is less than or equal to the threshold, if the time length of the voice information is smaller than the time length of the motion information, the playback speed of the motion information is increased, thereby Including making the time length of the information the same as the time length of the voice information.

好ましくは、前記インタラクション内容プロセッサにおいて、音声情報の時間長が動作情報の時間長より大きいである場合は、音声情報の再生速度を速める又は／及び動作情報の再生速度を落とし、それにより動作情報の時間長を前記音声情報の時間長と同じようにする。 Preferably, in the interaction content processor, when the time length of the voice information is larger than the time length of the motion information, the playback speed of the voice information is increased or / and the playback speed of the motion information is decreased, thereby The time length is made the same as the time length of the voice information.

好ましくは、前記インタラクション内容プロセッサにおいて、音声情報の時間長と動作情報の時間長を同じように調整する前記ステップは具体的に、
音声情報の時間長と動作情報の時間長との差が閾値より大きいである場合は、音声情報の時間長が動作情報の時間長より大きいであるなら、少なくとも二組の動作情報を順序付けて組み合わせ、それにより動作情報の時間長を前記音声情報の時間長と同じようにすることを含む。 Preferably, the step of adjusting the time length of the voice information and the time length of the motion information in the interaction content processor is specifically,
When the difference between the time length of the voice information and the time length of the motion information is greater than the threshold, if the time length of the voice information is greater than the time length of the motion information, at least two sets of motion information are combined in order , Thereby making the time length of the operation information the same as the time length of the voice information.

好ましくは、前記インタラクション内容プロセッサにおいて、音声情報の時間長が動作情報の時間長より小さいである場合は、動作情報における一部の動作を選択して、これらの動作の時間長が前記音声情報の時間長と同じようにする。 Preferably, in the interaction content processor, when the time length of the voice information is smaller than the time length of the motion information, a part of motions in the motion information is selected, and the time length of these motions is the time length of the voice information. Same as time length.

好ましくは、前記システムはまた人工知能クラウドプロセッサを含み、前記人工知能クラウドプロセッサはロボットの生活時間軸パラメータの生成に用い、具体的に、
ロボットの自己認識の拡大、
前記メモリからユーザの生活時間軸パラメータの取得、
ロボットの自己認識パラメータをユーザの生活時間軸におけるパラメータと整合し、それによるロボットの生活時間軸の生成を含む、ことを特徴とする請求項１２に記載のシステム。 Preferably, the system also includes an artificial intelligence cloud processor, the artificial intelligence cloud processor used to generate a robot time-of-life parameter, specifically,
Expansion of robot self-awareness,
Obtaining the user's life time axis parameters from the memory;
13. The system of claim 12, comprising matching robot self-recognition parameters with parameters on a user's life time axis, thereby generating a robot's life time axis.

好ましくは、前記人工知能クラウドプロセッサにおいて、ロボットの自己認識を拡大する前記ステップは具体的に、生活場面をロボットの自己認識と結合して生活時間軸を基礎とする自己認識曲線を生成することを含む。 Preferably, in the artificial intelligence cloud processor, the step of expanding the self-recognition of the robot specifically generates a self-recognition curve based on a life time axis by combining the life scene with the self-recognition of the robot. Including.

好ましくは、前記人工知能クラウドプロセッサにおいて、ロボットの自己認識パラメータを生活時間軸のパラメータと整合する前記ステップは具体的に、確率アルゴリズムを利用し、時間軸の場面パラメータが変えた後に、生活時間軸にあるロボットの各パラメータが変化する確率を計算し、整合曲線を形成することを含む。 Preferably, in the artificial intelligence cloud processor, the step of matching the self-recognition parameter of the robot with the parameter of the life time axis specifically uses a probability algorithm, and the life time axis is changed after the scene parameter of the time axis is changed. And calculating the probability that each parameter of the robot is changed to form a matching curve.

好ましくは、前記音声情報の時間長と動作情報の時間長を同じように調整する動作は、
ロボットが現在生活時間軸に所在する時間位置の取得、
前記音声情報の時間長と動作情報の時間長との差を対比し、ロボットが所在する時間位置に基づいて音声情報の時間長を調整するか、それとも動作情報の時間長を調整するかという判断、
判断結果に基づいて音声情報の時間長と動作情報の時間長を同じように調整することを含む、ことを特徴とする請求項１７に記載のシステム。 Preferably, the operation of adjusting the time length of the voice information and the time length of the operation information in the same manner is as follows:
Acquisition of the time position where the robot is currently located on the life time axis,
The difference between the time length of the voice information and the time length of the motion information is compared, and it is determined whether the time length of the voice information is adjusted based on the time position where the robot is located or whether the time length of the motion information is adjusted. ,
The system according to claim 17, comprising adjusting the time length of the voice information and the time length of the motion information in the same manner based on the determination result.

従来のマンマシンインタラクション技術は普通一種類のマンマシンインタラクションモードしか支持できなく、或いは限りのある動作や表情の出力だけができ、例えば、市販の児童コンパニオンロボットは、設定された四、五種類しかできない。従来技術より、本発明は、その音声と仮想動作を同期させる方法が、ユーザのマルチモード情報の取得、ユーザのマルチモード情報と生活時間軸に基づいて、少なくとも音声情報と動作情報を含むインタラクション内容の生成、音声情報の時間長と動作情報の時間長に対する同期調整を含むという利点を有する。そうすることでは、ユーザの音声、表情、動作などのマルチモード情報の一種や多種利によって、少なくとも音声情報と動作情報を含むインタラクション内容を生成でき、音声情報と動作情報を同期させるためには、音声情報の時間長と動作情報の時間長を同じように調整し、それでロボットが音声と動作を再生する時に同時にマッチすることが可能になって、ロボットが人類と対話する際に音声表現だけではなく、また動作などさまざまな表現形式も利用可能、ロボットの表現形式をもっと多様化にし、生成したロボットの動作、表情は数種類また十数種類に限られないようになり、動作ライブラリーにある動作グリップを任意に接ぎ合わすことができるため、ロボットはより擬人化に対話でき、ユーザのインタラクション体験も向上した。 Conventional man-machine interaction technology usually supports only one type of man-machine interaction mode, or can only output limited movements and facial expressions. For example, commercially available child companion robots have only four or five types set. Can not. From the prior art, the present invention is a method for synchronizing the voice and the virtual motion, the content of the interaction including at least the voice information and the motion information based on the acquisition of the user's multi-mode information and the user's multi-mode information and the life time axis. And the synchronization adjustment for the time length of the voice information and the time length of the motion information. By doing so, it is possible to generate interaction contents including at least audio information and operation information according to a kind or multipurpose of multi-mode information such as user's voice, facial expression, operation, etc. In order to synchronize audio information and operation information, The time length of the voice information and the time length of the motion information are adjusted in the same way, so that it is possible to match at the same time when the robot plays back the voice and motion, and when the robot interacts with humankind, only the voice expression In addition, various expression formats such as motion can be used, the robot expression format is further diversified, and the generated robot motion and facial expressions are not limited to several or a dozen types, and the motion grip in the motion library The robot can interact more anthropomorphically and can improve the user interaction experience

図１は本発明実施例１の音声と仮想動作を同期させる方法の流れを示すフロー図である。FIG. 1 is a flowchart showing a flow of a method for synchronizing voice and virtual operation according to the first embodiment of the present invention. 図２は本発明実施例２の音声と仮想動作を同期させるシステムを示す図である。FIG. 2 is a diagram showing a system for synchronizing voice and virtual motion according to the second embodiment of the present invention. 図３は本発明実施例３の音声と仮想動作を同期させるシステムの回路ブロック図である。FIG. 3 is a circuit block diagram of a system for synchronizing voice and virtual operation according to the third embodiment of the present invention. 図４は本発明実施例３の音声と仮想動作を同期させるシステムの好ましい回路ブロック図である。FIG. 4 is a preferred circuit block diagram of a system for synchronizing voice and virtual operation according to the third embodiment of the present invention. 図５は本発明実施例３の音声と仮想動作を同期させるシステムをウェアラブルデバイスと結合することを示す図である。FIG. 5 is a diagram showing that a system for synchronizing voice and virtual operation according to the third embodiment of the present invention is combined with a wearable device. 図６は本発明実施例３の音声と仮想動作を同期させるシステムを移動端末と結合することを示す図である。FIG. 6 is a diagram showing that a system for synchronizing voice and virtual operation according to the third embodiment of the present invention is combined with a mobile terminal. 図７は本発明実施例３の音声と仮想動作を同期させるシステムをロボットと結合する応用場面を示す図である。FIG. 7 is a diagram illustrating an application scene in which the system for synchronizing the voice and the virtual motion according to the third embodiment of the present invention is combined with the robot.

フロー図で各操作を順序に処理するように説明したが、その中に多くの操作は並列、合併又は同時に実行できるものである。各操作の順序を改めて配置してもよい。操作を完成した時には処理を中止できるが、図面に含まず追加ステップを有してもよい。該処理は方法、関数、規則、サブルーチン、サブプログラムなどに対応可能である。 Although the flow diagram has been described as processing each operation in sequence, many of the operations can be performed in parallel, merged or simultaneously. You may arrange the order of each operation anew. When the operation is completed, the process can be stopped, but it may not be included in the drawing and may have additional steps. The processing can correspond to a method, a function, a rule, a subroutine, a subprogram, and the like.

コンピュータデバイスはユーザデバイスとネットワークデバイスを含む。ここにおいて、ユーザデバイスやクライアントはコンピュータ、スマートフォン、ＰＤＡなどを含むがそれらには限定されなく、ネットワークデバイスはシングルネットワークサーバー、マルチネットワークサーバーからなるサーバーグループ又はクラウドコンピューティングに基づいて数多いコンピュータやネットワークサーバーで構成されるクラウドを含むがそれらには限定されない。コンピュータデバイスは独立運行で本発明を実現してもよく、ネットワークにアクセスし且つそこにおける他のコンピュータデバイスとのインタラクション操作で本発明を実現してもよい。コンピュータデバイスが位置するネットワークはインターネット、広域ネットワーク、メトロポリタンエリアネットワーク、ローカルエリアネットワーク、ＶＰＮネットワークなどを含むがそれらには限定されない。 Computer devices include user devices and network devices. Here, user devices and clients include, but are not limited to, computers, smartphones, PDAs, and the like, and network devices include a large number of computers and network servers based on a single network server, a server group consisting of multiple network servers, or cloud computing. Including, but not limited to, a cloud comprised of The computer device may implement the present invention by independent operation, and may implement the present invention by accessing a network and interacting with other computer devices there. The network in which the computing device is located includes, but is not limited to, the Internet, a wide area network, a metropolitan area network, a local area network, a VPN network, and the like.

ここで、「第一」、「第二」などの専門用語で各ユニットを説明したかもしれないが、これらのユニットは当該専門用語に限られなく、これらの専門用語の使用はただ一つのユニットを別のユニットと区別するためだけである。ここで用いる専門用語「及び／又は」は列挙した一つや複数の関連プロジェクトの任意と全部の組み合わせを含む。一つのユニットがもう一つのユニットに「接続」や「結合」と定義された時には、それが前記もう一つのユニットに直接的接続や結合されてもよいが、中間ユニットに存在してもよい。 Here, each unit may be described in terms such as “first”, “second”, etc., but these units are not limited to these terms, and the use of these terms is only one unit. Only to distinguish it from another unit. As used herein, the term “and / or” includes any and all combinations of one or more of the associated projects. When one unit is defined as “connected” or “coupled” to another unit, it may be directly connected or coupled to the other unit, but may be present in an intermediate unit.

ここで使用する専門用語はただ具体的な実施例を説明するためだけであるが例示的実施例を限定しない。テクストで別に明示されたもの以外に、ここで使用した単数形「一つ」、「一項」はまた複数を含むことを図っている。なお理解すべきなのは、ここで使用した「含む」及び／又は「含有する」という専門用語が説明する特徴、整数、ステップ、操作、ユニット及び／又はモジュールの存在を規定するが、一つや更に多い他の特徴、整数、ステップ、操作、ユニット、モジュール及び／又は組み合わせの存在や追加を排除するわけではない。
下記、図面と優れた実施例を結合して本発明についてもっと詳細に説明する。 The terminology used herein is for the purpose of describing particular embodiments only and is not intended to limit the illustrative embodiments. Except as otherwise specified in the text, the singular forms “one” and “one term” used herein are intended to include the plural as well. It should be understood that the terms “include” and / or “contain” as used herein define the presence of features, integers, steps, operations, units and / or modules described by one or more. The presence or addition of other features, integers, steps, operations, units, modules and / or combinations is not excluded.
In the following, the invention will be described in more detail in conjunction with the drawings and the exemplary embodiments.

実施形態１
図１に示すように、本実施例は音声と仮想動作を同期させる方法を開示し、それは、
ユーザのマルチモード情報を取得するステップS101、
ユーザのマルチモード情報と生活時間軸に基づいて、少なくとも音声情報と動作情報を含むインタラクション内容を生成するステップS102、
音声情報の時間長と動作情報の時間長を同じように調整するステップS103を含むことを特徴とする。 Embodiment 1
As shown in FIG. 1, the present embodiment discloses a method for synchronizing voice and virtual motion,
Step S101 for acquiring user multi-mode information,
Step S102 for generating an interaction content including at least voice information and motion information based on the user's multi-mode information and life time axis,
Step S103 is included which adjusts the time length of the voice information and the time length of the motion information in the same manner.

本発明の音声と仮想動作を同期させる方法は、ユーザのマルチモード情報の取得、ユーザのマルチモード情報と生活時間軸に基づいて、少なくとも音声情報と動作情報を含むインタラクション内容の生成、音声情報の時間長と動作情報の時間長に対する同期調整を含む。そうすることでは、ユーザの音声、表情、動作などのマルチモード情報の一種や多種類によって、少なくとも音声情報と動作情報を含むインタラクション内容を生成でき、また音声情報と動作情報を同期させるためには、音声情報の時間長と動作情報の時間長を同じように調整し、それによりロボットは音声と動作を再生する際に同時にマッチすることが可能になり、ロボットは音声表現のみならず、また動作などのさまざまな表現形式で対話できるようになり、ロボットの表現方法を多様化にさせ、ロボットはもっと擬人化になる他、ユーザがロボットとのインタラクション体験も向上させた。 The method of synchronizing voice and virtual motion according to the present invention includes obtaining multi-mode information of a user, generating interaction contents including at least voice information and motion information based on the multi-mode information and life time axis of the user, Includes synchronization adjustment for time length and time length of motion information. By doing so, it is possible to generate interaction contents including at least voice information and action information by one or many kinds of multi-mode information such as user's voice, facial expression, action, etc., and in order to synchronize voice information and action information The time length of the voice information and the time length of the motion information are adjusted in the same way, so that the robot can match at the same time when playing back the voice and motion, so that the robot is not only voice expression but also motion It became possible to interact in various expression formats such as the above, diversifying the robot expression method, making the robot more anthropomorphic and improving the user's interaction experience with the robot.

人類にとっては、毎日の生活がある程度の規則性を有し、ロボットと人類との対話をもっと擬人化にするためには、一日の２４時間において、ロボットが眠る、運動する、食事をする、ダンスする、本を読む、メイクアップする、眠るなどの動作を持たせる。それにより本発明はロボットがある生活時間軸をロボットのインタラクション内容の生成に添加することで、ロボットがより擬人化に人類と対話でき、ロボットが生活時間軸において人類の生活スタイルを有するようになり、該方法は生成したロボットインタラクション内容の擬人性及びマンマシンインタラクション体験を向上させ、且つインテリジェント性を高めることができる。インタラクション内容は表情、文字、音声や動作などの一種、又は多種類の組み合わせであってもよい。ロボットの生活時間軸は事前に整合・配置したものであり、具体的に言うと、ロボットの生活時間軸は一シリーズのパラメータコレクションであって、このパラメータをシステムに伝送してインタラクション内容を生成する。 For humanity, everyday life has a certain degree of regularity, and in order to make the interaction between the robot and humanity more anthropomorphic, in 24 hours of the day, the robot sleeps, exercises, eats, Give them the ability to dance, read a book, make up, and sleep. As a result, the present invention adds a certain life time axis to the generation of the interaction content of the robot, so that the robot can interact with humanity more anthropomorphically, and the robot has a human life style on the life time axis. The method can improve the anthropomorphism of the generated robot interaction content and the man-machine interaction experience, and enhance the intelligent. The interaction content may be one type of facial expressions, characters, voices and actions, or a combination of various types. The robot's life time axis is aligned and arranged in advance. Specifically, the robot's life time axis is a series of parameter collections, and these parameters are transmitted to the system to generate interaction contents. .

本実施例におけるマルチモード情報はユーザの表情情報、音声情報、手振り情報、場面情報、画像情報、ビデオ情報、顔情報、虹彩情報、光感知情報や指紋情報などの一種や多種類としてもよい。 The multi-mode information in this embodiment may be one kind or many kinds of user facial expression information, voice information, hand gesture information, scene information, image information, video information, face information, iris information, light sensing information, fingerprint information, and the like.

本実施例において、生活時間軸を基礎とすることは具体的に、人類の日常生活時間軸に基づき、人類のスタイルによって、ロボット本体の日常生活時間軸における自己認識数値を整合し、ロボットはこの整合によって動作を行い、即ちロボット本体の一日における行為を得、それでロボットが生活時間軸を基礎として、インタラクション内容を生成して人類と対話することなどの自分の行為をさせるものとなる。ロボットがずーと起きていると、この時間軸における行為によって動作し、ロボットの自己認識もこの時間軸に従って変更されるようになった。生活時間軸と可変パラメータは気持ちの値、疲労値などの自己認識における属性を変更できる他、新たな自己認識情報を自動に添加できる、例えば、ここまでは怒り値がないため、生活時間軸と可変素子に基づく場面はこの前の情報に基づいて人類の自己認識の場面を模擬し、それでロボットの自己認識を添加する。生活時間軸は音声情報だけではなく、動作などの情報も含んでいる。 In this embodiment, based on the life time axis, specifically, the self-recognition value on the daily life time axis of the robot body is matched based on the human life style based on the human life time axis. The operation is performed by matching, that is, the action of the robot main body is obtained in one day, and the robot makes its own action such as generating interaction contents and interacting with humanity based on the life time axis. When the robot was awake, it acted on this time axis, and the robot's self-recognition was changed according to this time axis. The life time axis and variable parameters can change attributes in self-recognition such as feeling value, fatigue value, etc., and new self-recognition information can be added automatically.For example, there is no anger value so far, The scene based on the variable elements simulates the scene of humankind's self-recognition based on the previous information, and adds the robot's self-recognition. The life time axis includes not only voice information but also information such as actions.

例えば、ユーザがロボットに「眠いわ」と言うと、ロボットはユーザが眠いであるとこの言葉を理解した後、ロボットの生活時間軸を結合する、例えば、現在の時間が午前９時である場合は、ロボットが主人が起きたばかりであることが分かり、主人に挨拶するべきで、「おはようございます」などの音声で返事する他、また歌を歌い、相応のダンス動作なども追加できる。もしユーザがロボットに「眠いわ」と言うと、ロボットはユーザが眠いであるとこの言葉を理解した後、ロボットの生活時間軸を結合する、例えば、現在の時間が午後９時である場合は、ロボットが主人様の睡眠時間になることが分かり、「おやすみなさい。」などの類似言葉で返事し、相応のお休み、睡眠動作などを追加できる。そういう様態は単純な音声と表情による返事よりもっと人類の生活に近く、動作はもっと擬人化になった。 For example, if the user says “sleepy” to the robot, the robot will understand the word that the user is sleepy and then connect the robot's life time axis, for example, if the current time is 9 am The robot knows that the master has just woken up, should greet the master, respond with a voice like "Good morning", or sing a song and add a corresponding dance action. If the user says “sleepy” to the robot, the robot understands the word that the user is sleepy and then combines the robot's life time axis, eg if the current time is 9pm , You can see that the robot is sleeping time of the master, you can reply with similar words such as “Good night” and add a suitable rest, sleep action, etc. Such a situation is closer to human life than a simple voice and facial response, and the movement is more anthropomorphic.

本実施例においては、前記音声情報の時長と動作情報の時間長を同じように調整するステップは具体的に、
音声情報の時間長と動作情報の時間長との差が閾値以下にある場合は、音声情報の時間長が動作情報の時間長より小さいであるなら、動作情報の再生速度を速め、それにより動作情報の時間長を前記音声情報の時間長と同じようにすることを含む。 In the present embodiment, the step of adjusting the time length of the voice information and the time length of the motion information in the same way is specifically,
If the difference between the time length of the voice information and the time length of the motion information is less than or equal to the threshold, if the time length of the voice information is smaller than the time length of the motion information, the playback speed of the motion information is increased, thereby Including making the time length of the information the same as the time length of the voice information.

音声情報の時間長が動作情報の時間長より大きいである場合は、音声情報の再生速度を速める又は／及び動作情報の再生速度を落とし、それにより動作情報の時間長を前記音声情報の時間長と同じようにする。 If the time length of the voice information is greater than the time length of the motion information, the playback speed of the voice information is increased or / and the playback speed of the motion information is decreased, whereby the time length of the motion information is set to the time length of the voice information. To do the same.

そのために、音声情報の時間長と動作情報の時間長との差が閾値以下にある場合は、調節の具体的な意味は音声情報の時間長又は／及び動作情報の時間長を圧縮や延長することを指してもよく、また再生速度を速める又は落とすことを指してもよい。例えば音声情報の再生速度を２にかけ、それとも動作情報の再生時間を０．８にかけるなど。 Therefore, when the difference between the time length of the voice information and the time length of the motion information is equal to or less than the threshold, the specific meaning of the adjustment is to compress or extend the time length of the voice information and / or the time length of the motion information. It may also indicate that the playback speed is to be increased or decreased. For example, the reproduction speed of audio information is multiplied by 2, or the reproduction time of operation information is multiplied by 0.8.

例えば、音声情報の時間長と動作情報の時間長との閾値は１分で、ロボットがユーザのマルチモード情報に基づいて生成するインタラクション内容においては、音声情報の時間が１分で、動作情報の時間が２分である場合は、動作情報の再生速度を元の二倍まで速めてもよく、それで動作情報が調整された後の再生時間は１分になり、それにより音声情報との同期を実現する。勿論、音声情報の再生速度を本来の０．５倍まで調整してもよく、それにより音声情報は調整された後の再生時間が２分になり、それにより動作情報との同期を実現する。なお、音声情報と動作情報を共に調整してもよい、例えば音声情報の速度を落とすと同時に、動作情報を速め、両者とも１分３０秒まで調節することも、音声と動作との同期を実現できる。 For example, the threshold between the time length of the voice information and the time length of the motion information is 1 minute, and in the interaction content generated by the robot based on the user's multi-mode information, the time of the voice information is 1 minute, If the time is 2 minutes, the playback speed of the motion information may be increased to twice the original speed, so that the playback time after the motion information is adjusted becomes 1 minute, thereby synchronizing with the audio information. Realize. Of course, the reproduction speed of the sound information may be adjusted to 0.5 times the original speed, so that the reproduction time after the sound information is adjusted becomes 2 minutes, thereby realizing synchronization with the operation information. It is also possible to adjust both voice information and motion information. For example, the speed of voice information can be reduced and the speed of motion information can be increased, and both can be adjusted to 1 minute 30 seconds. it can.

その他、本実施例においては、前記音声情報の時間長と動作情報の時間長を同じように調整するステップは具体的に、
音声情報の時間長と動作情報の時間長との差が閾値より大きいである場合は、音声情報の時間長が動作情報の時間長より大きいであるなら、少なくとも二組の動作情報を順序付けて組み合わせ、それにより動作情報の時間長を前記音声情報の時間長と同じようにする。 In addition, in the present embodiment, the step of adjusting the time length of the audio information and the time length of the operation information in the same manner is specifically,
When the difference between the time length of the voice information and the time length of the motion information is greater than the threshold, if the time length of the voice information is greater than the time length of the motion information, at least two sets of motion information are combined in order Thereby, the time length of the operation information is made the same as the time length of the voice information.

音声情報の時間長が動作情報の時間長より小さいである場合は、動作情報における一部の動作を選択して、これらの動作の時間長が前記音声情報の時間長と同じようにする。 When the time length of the audio information is smaller than the time length of the operation information, some operations in the operation information are selected so that the time length of these operations is the same as the time length of the audio information.

そのために、音声情報の時間長と動作情報の時間長との差が閾値より大きいである場合は、調節する意味は一部の動作情報を追加や削除することで、動作情報の時間長と音声情報の時間長との同期調整を実現できる。 Therefore, when the difference between the time length of the voice information and the time length of the motion information is larger than the threshold, the meaning of the adjustment is to add or delete some motion information, so that the time length of the motion information and the voice Synchronous adjustment with the time length of information can be realized.

例えば、音声情報の時間長と動作情報の時間長との閾値が３０秒である場合は、ロボットはユーザのマルチモード情報に基づいて生成したインタラクション内容において、音声情報の時間長が３分で、動作情報の時間長が１分であるなら、他の動作情報を本来の動作情報に加える必要となり、例えば時間長が２分である動作情報を見つけて、上記二組の動作情報を組み合わせた後に前記音声情報の時間と同じようにマッチした。勿論、もし時間長が２分である動作情報の代わりに、２分半である動作情報を見つけた場合は、選択された動作情報の時間長が２分であるよう、この２分半の動作情報における一部の動作（一部のフレーム）を選択し、そうすると音声情報の時間長と同じようにマッチできるようになった。 For example, when the threshold value of the time length of the voice information and the time length of the motion information is 30 seconds, the robot has the time length of the voice information of 3 minutes in the interaction content generated based on the user's multi-mode information. If the time length of the motion information is 1 minute, it is necessary to add other motion information to the original motion information. For example, after finding motion information having a time length of 2 minutes and combining the above two sets of motion information Matched in the same way as the time of the voice information. Of course, if the operation information of 2 minutes and a half is found instead of the operation information having a time length of 2 minutes, the operation of this 2 minutes and a half is performed so that the time length of the selected operation information is 2 minutes. You can select some actions (some frames) in the information and then match them as well as the time length of the voice information.

本実施例においては、音声情報の時間長によって、音声情報の時間長と最も近い動作情報の選択、更に動作情報の時間長によって最も近い音声情報の選択も可能である。 In this embodiment, it is possible to select the operation information closest to the time length of the sound information depending on the time length of the sound information, and further to select the sound information closest to the time length of the operation information.

そうすると、選択時に音声情報の時間長によって選択することは、制御モジュールが音声情報と動作情報の時間長を調整しやすくなり、もっと容易に一致するまで調節でき、且つ調節された再生もより自然で、平滑になった。 Then, selecting according to the time length of the audio information at the time of selection makes it easier for the control module to adjust the time length of the audio information and the operation information, can be adjusted until they match more easily, and the adjusted playback is more natural. , Became smooth.

その中の一つの実施例によって、音声情報の時間長と動作情報の時間長を同じように調整するステップに続いてはたま、調節された後の音声情報と動作情報を仮想映像に出力して表示することを含む。 According to an embodiment of the present invention, the audio information and the operation information after the adjustment are output to the virtual image after the step of adjusting the time length of the audio information and the operation information in the same manner. Including displaying.

そうすると一致するまで調節した後に出力でき、仮想映像での出力が可能であるため、仮想ロボットはもっと擬人化になり、ユーザ体験を向上させた。 Then, it can be output after adjusting until it matches, and output in virtual video is possible, so the virtual robot became more anthropomorphic and improved the user experience.

その中の一つの実施例によっては、前記ロボットの生活時間軸パラメータを生成する方法は、
ロボットの自己認識の拡大、
生活時間軸パラメータの取得、
ロボットの自己認識パラメータを生活時間軸のパラメータと整合し、ロボットの生活時間軸を生成することを含む。 In one embodiment, the method for generating a time axis parameter for the robot includes:
Expansion of robot self-awareness,
Acquisition of life time axis parameters,
This includes matching the robot's self-recognition parameters with the parameters of the life time axis to generate the robot's life time axis.

そうして生活時間軸をロボット本体の自己認識に加えることで、ロボットは擬人化の生活を過ごすようになった。例えば昼食をする認識をロボットに加えること。 Thus, by adding the life time axis to the self-recognition of the robot body, the robot began to have an anthropomorphic life. For example, adding recognition to the robot to have lunch.

その中の一つの実施例によって、前記ロボットの自己認識を拡大するステップは具体的に、生活場面とロボットの自己認識を結合して生活時間軸を基礎とする自己認識曲線を形成することを含む。 According to one embodiment, the step of expanding the self-recognition of the robot specifically includes combining a life scene and the self-recognition of the robot to form a self-recognition curve based on a life time axis. .

そうしては生活時間軸を具体的にロボット本体のパラメータに加えることが可能になった。 Then, the life time axis can be specifically added to the parameters of the robot body.

その中の一つの実施例によっては、ロボットの自己認識パラメータを生活時間軸のパラメータと整合する前記ステップは具体的に、確率アルゴリズムを利用し、生活時間軸にあるロボットが時間軸の場面パラメータが変えた後、その各パラメータが変化する確率を計算して、整合曲線を形成することを含む。そうするとロボットの自己認識パラメータと生活時間軸のパラメータを具体的に整合することができる。ここにおいて確率アルゴリズムはベイズ確率アルゴリズムであってもよい。 In one embodiment, the step of matching the self-recognition parameter of the robot with the parameter of the time-of-life axis may be performed by using a probability algorithm. After changing, calculating the probability that each parameter will change to form a matching curve. Then, the self-recognition parameter of the robot and the parameter of the life time axis can be specifically matched. Here, the probability algorithm may be a Bayes probability algorithm.

例えば、一日の２４時間において、ロボットが眠る、運動する、食事する、ダンスする、本を読む、食事する、メイクアップする、眠るなどの動作を有するようにさせる。それぞれの動作はロボット本体の自己認識に影響を与え、生活時間軸におけるパラメータをロボット本体の自己認識を結合して整合した後、即ちロボットの自己認識に、気持ち、疲労値、親密度、好感度、対話回数、ロボットの三次元認識、年齢、身長、体重、親密度、ゲーム場面値、ゲーム対象値、場所場面値、場所対象値などを含むようになった。ロボットは、カフェ、ベッドルームなど、自分が位置する場所場面を識別できるようになった。 For example, in 24 hours a day, the robot is allowed to have actions such as sleeping, exercising, eating, dancing, reading a book, eating, making up, and sleeping. Each movement affects the self-recognition of the robot body, and after matching the parameters in the life time axis by combining the self-recognition of the robot body, that is, the robot's self-recognition, feeling, fatigue value, familiarity, favorable sensitivity , The number of dialogues, 3D recognition of robots, age, height, weight, intimacy, game scene values, game target values, place scene values, place target values, etc. Robots can now identify places where they are located, such as cafes and bedrooms.

ロボットは一日の時間軸内に異なる動作をする、例えば、夜に眠る、昼に食事する、昼に運動するなど、生活時間軸におけるこれらの場面は全部、自己認識に影響を及ぼす。これらの数値の変化は確率モデルによる動的整合方法で、これらの動作が時間軸に発生する確率を整合仕上げる。場面識別：この場所場面識別は自己認識における地理場面値を変える可能性はある。 Robots behave differently within the time axis of the day, such as sleeping at night, eating in the day, exercising in the day, all of these scenes on the time axis of life affect self-awareness. The change of these numerical values is a dynamic matching method based on a probabilistic model, and finishes matching the probability that these actions occur on the time axis. Scene identification: This location scene identification may change the geographic scene value in self-recognition.

実施形態２
図２に示すように、本発明は、
ユーザのマルチモード情報を取得する取得モジュール２０１、
ユーザのマルチモード情報、及び生活時間軸モジュール２０５で生成される生活時間軸に基づいて、少なくとも音声情報と動作情報を含むインタラクション内容を生成する人工知能モジュール２０２、
音声情報の時間長と動作情報の時間長を同じように調整する制御モジュール２０３を含むことを特徴とする音声と仮想動作を同期させるシステム。 Embodiment 2
As shown in FIG.
An acquisition module 201 for acquiring user multi-mode information;
An artificial intelligence module 202 that generates interaction content including at least voice information and motion information based on the user's multi-mode information and the life time axis generated by the life time axis module 205;
A system for synchronizing voice and virtual motion, comprising a control module 203 for adjusting the time length of voice information and the time length of motion information in the same way.

そうすることでは、ユーザの音声、表情、動作などのマルチモード情報の一種や多種類によって、少なくとも音声情報と動作情報を含むインタラクション内容を生成でき、また音声情報と動作情報を同期させるためには、音声情報の時間長と動作情報の時間長を同じように調整し、それによりロボットは音声と動作を再生する際に同時にマッチすることが可能になり、ロボットは音声表現のみならず、また動作などのさまざまな表現形式で対話できるようになり、ロボットの表現方法を多様化にさせ、ロボットはもっと擬人化になる他、ユーザのロボットとのインタラクション体験も向上させた。 By doing so, it is possible to generate interaction contents including at least voice information and action information by one or many kinds of multi-mode information such as user's voice, facial expression, action, etc., and in order to synchronize voice information and action information The time length of the voice information and the time length of the motion information are adjusted in the same way, so that the robot can match at the same time when playing back the voice and motion, so that the robot is not only voice expression but also motion As a result, the robot's expression method has become more anthropomorphic and the user's interaction experience with the robot has also been improved.

人類にとっては、毎日の生活がある程度の規則性を有し、ロボットと人類との対話をもっと擬人化にするためには、一日の２４時間において、ロボットが眠る、運動する、食事する、ダンスする、本を読む、食事する、メイクアップする、眠るなどの動作を持たせる。それにより本発明はロボットがある生活時間軸をロボットのインタラクション内容の生成に添加することで、ロボットはより擬人化に人類と対話でき、ロボットは生活時間軸において人類の生活スタイルを有するようになり、生成したロボットインタラクション内容の擬人性及び人類のインタラクション体験を向上させ、且つインテリジェント性を高めることができる。インタラクション内容は表情、文字、音声や動作などの一種で、又は多種類の組み合わせであってもよい。ロボットの生活時間軸は事前に整合・配置したものであり、具体的に言うと、ロボットの生活時間軸は一シリーズのパラメータコレクションであって、このパラメータをシステムに伝送してインタラクション内容を生成する。 For human beings, everyday life has a certain degree of regularity, and in order to make the interaction between the robot and humanity more anthropomorphic, the robot sleeps, exercises, eats, dances in 24 hours a day Doing, reading books, eating, making up, sleeping, etc. As a result, the present invention adds a certain life time axis to the generation of the interaction content of the robot, so that the robot can interact with humanity more anthropomorphically, and the robot has a human life style on the life time axis. It is possible to improve the anthropomorphism of the generated robot interaction content and the human interaction experience, and enhance the intelligent. The interaction content may be a kind of expression, character, voice, action, or a combination of many kinds. The robot's life time axis is aligned and arranged in advance. Specifically, the robot's life time axis is a series of parameter collections, and these parameters are transmitted to the system to generate interaction contents. .

本実施例において、前記制御モジュールは具体的に、
音声情報の時間長と動作情報の時間長との差が閾値以下にある場合は、音声情報の時間長が動作情報の時間長より小さいであるなら、動作情報の再生速度を速め、それにより動作情報の時間長を前記音声情報の時間長と同じようにする。 In this embodiment, the control module is specifically:
If the difference between the time length of the voice information and the time length of the motion information is less than or equal to the threshold, if the time length of the voice information is smaller than the time length of the motion information, the playback speed of the motion information is increased, thereby The time length of the information is made the same as the time length of the voice information.

例えば、音声情報の時間長と動作情報の時間長との閾値は１分で、ロボットがユーザのマルチモード情報に基づいて生成するインタラクション内容においては、音声情報の時間が１分で、動作情報の時間が２分である場合は、動作情報の再生速度を元の二倍まで速めてもよく、それで動作情報が調整された後の再生時間は１分になり、それにより音声情報との同期を実現する。勿論、音声情報の再生速度を本来の０．５倍まで調整してもよく、それにより音声情報は調整された後の再生時間が２分になり、それにより動作情報との同期を実現する。なお、音声情報と動作情報を共に調整してもよい、例えば音声情報の速度を落とすと同時に、動作情報を速め、両者とも１分３０秒まで調整することも、音声と動作との同期を実現できる。 For example, the threshold between the time length of the voice information and the time length of the motion information is 1 minute, and in the interaction content generated by the robot based on the user's multi-mode information, the time of the voice information is 1 minute, If the time is 2 minutes, the playback speed of the motion information may be increased to twice the original speed, so that the playback time after the motion information is adjusted becomes 1 minute, thereby synchronizing with the audio information. Realize. Of course, the reproduction speed of the sound information may be adjusted to 0.5 times the original speed, so that the reproduction time after the sound information is adjusted becomes 2 minutes, thereby realizing synchronization with the operation information. Note that both voice information and motion information may be adjusted. For example, the speed of voice information can be reduced and the speed of motion information can be increased, and both can be adjusted to 1 minute 30 seconds. it can.

その他、本実施例において、前記制御モジュールは具体的に、
音声情報の時間長と動作情報の時間長との差が閾値より大きいである場合は、音声情報の時間長が動作情報の時間長より大きいであるなら、少なくとも二組の動作情報を組み合わせ、それにより動作情報の時間長を前記音声情報の時間長と同じようにすることに用いる。 In addition, in this embodiment, the control module is specifically:
If the difference between the time length of the voice information and the time length of the motion information is greater than the threshold, if the time length of the voice information is greater than the time length of the motion information, combine at least two sets of motion information; Thus, the time length of the operation information is used to be the same as the time length of the voice information.

例えば、音声情報の時間長と動作情報の時間長との閾値が３０秒である場合は、ロボットはユーザのマルチモード情報に基づいて生成したインタラクション内容において、音声情報の時間長が３分で、動作情報の時間長が１分であるなら、他の動作情報を本来の動作情報に加えることが必要となり、例えば時間長が２分である動作情報を見つけて、上記二組の動作情報を組み合わせた後に前記音声情報の時間と同じようにマッチした。勿論、もし時間長が２分である動作情報の代わりに、２分半である動作情報を見つけた場合は、選択された動作情報の時間長が２分であるよう、この２分半の動作情報における一部の動作（一部のフレーム）を選択し、そうすると音声情報の時間長と同じようにマッチできるようになった。 For example, when the threshold value of the time length of the voice information and the time length of the motion information is 30 seconds, the robot has the time length of the voice information of 3 minutes in the interaction content generated based on the user's multi-mode information. If the time length of the motion information is 1 minute, it is necessary to add other motion information to the original motion information. For example, find the motion information whose time length is 2 minutes and combine the above two sets of motion information After the same match as the time of the voice information. Of course, if the operation information of 2 minutes and a half is found instead of the operation information having a time length of 2 minutes, the operation of this 2 minutes and a half is performed so that the time length of the selected operation information is 2 minutes. You can select some actions (some frames) in the information and then match them as well as the time length of the voice information.

本実施例において、人工知能モジュールは具体的に、音声情報の時間長によって、音声情報の時間長と最も近い動作情報の選択、更に動作情報の時間長によって最も近い音声情報の選択に用いる。 In this embodiment, the artificial intelligence module is specifically used for selecting the operation information closest to the time length of the sound information according to the time length of the sound information, and further selecting the sound information closest to the time length of the operation information.

その中の一つの実施例によって、前記システムはまた、調整された後の音声情報と動作情報を仮想映像に出力して表示する出力モジュール２０４を含む。 According to one embodiment, the system also includes an output module 204 that outputs the adjusted audio information and motion information to a virtual image for display.

そうすると一致するまで調整した後に出力でき、仮想映像での出力が可能であるため、仮想ロボットがもっと擬人化になり、ユーザ体験を向上させた。 Then, it can be output after adjusting until it matches, and it can be output in virtual video, so the virtual robot became more anthropomorphic and improved the user experience.

その中の一つの実施例によって、前記システムは時間軸を基礎とする人工知能クラウドプロセッサモジュールを含み、
ロボットの自己認識の拡大と、
生活時間軸パラメータの取得と、
ロボットの自己認識パラメータを生活時間軸のパラメータと整合し、ロボットの生活時間軸を生成する。 According to one embodiment, the system includes a time axis based artificial intelligence cloud processor module,
Expansion of robot self-awareness,
Acquisition of life time axis parameters,
The robot's self-recognition parameter is matched with the life time axis parameter to generate the robot life time axis.

その中の一つの実施例によって、前記システムは時間軸を基礎とする人工知能クラウドプロセッサモジュールは、生活場面とロボットの自己認識を結合して生活時間軸を基礎とする自己認識曲線を形成することに用いる。そうしては生活時間軸を具体的にロボット本体のパラメータに加えることが可能になった。 According to one embodiment, the system is a time-based artificial intelligence cloud processor module that combines a life scene and robot self-recognition to form a self-recognition curve based on the life time axis. Used for. Then, the life time axis can be specifically added to the parameters of the robot body.

その中の一つの実施例によって、時間軸を基礎とする人工知能クラウド処理モジュールは具体的に、確率アルゴリズムを利用し、時間軸の場面パラメータが変えた後、生活時間軸にあるロボットの各パラメータが変化する確率を計算して、整合曲線を形成することを含む。そうするとロボットの自己認識パラメータと生活時間軸のパラメータを具体的に整合することができる。ここにおいて確率アルゴリズムはベイズ確率アルゴリズムであってもよい。 According to one of the embodiments, the artificial intelligence cloud processing module based on the time axis specifically uses the probability algorithm to change each parameter of the robot on the life time axis after the scene parameter on the time axis changes. Calculating the probability of changing to form a matching curve. Then, the self-recognition parameter of the robot and the parameter of the life time axis can be specifically matched. Here, the probability algorithm may be a Bayes probability algorithm.

本発明は、上記のいずれかに記載の音声と仮想動作を同期させるシステムを含むロボットを開示する。 The present invention discloses a robot including the system for synchronizing the voice and the virtual motion described in any of the above.

実施形態３
図３に示すように、本実施例は音声と仮想動作を同期させるシステム３００を開示し、マイク３０１、アナログデジタルコンバータ３０２、音声識別プロセッサ３０３、画像取得装置３０４、顔認識プロセッサ３０５、インタラクション内容プロセッサ３０６、音声合成装置３０７、パワーアンプ３０８、スピーカー３０９、イメージングシステム３１０及びメモリ３１１を含む。 Embodiment 3
As shown in FIG. 3, this embodiment discloses a system 300 that synchronizes voice and virtual motion, and includes a microphone 301, an analog-to-digital converter 302, a voice identification processor 303, an image acquisition device 304, a face recognition processor 305, and an interaction content processor. 306, a speech synthesizer 307, a power amplifier 308, a speaker 309, an imaging system 310, and a memory 311.

前記マイク３０１、アナログデジタルコンバータ３０２、音声識別プロセッサ３０３と意図識別プロセッサ３０６は順次に接続され、前記画像取得装置３０４、顔認識プロセッサ３０５と意図識別プロセッサ３０６は順次に接続され、前記インタラクション内容プロセッサ３０６はメモリ３１１と接続され、前記インタラクション内容プロセッサ３０６、前記音声合成装置３０７、パワーアンプ３０８とスピーカー３０９は順次に接続され、前記イメージングシステム３１０はインタラクション内容プロセッサ３０６と接続される。 The microphone 301, the analog-digital converter 302, the voice identification processor 303 and the intention identification processor 306 are sequentially connected, and the image acquisition device 304, the face recognition processor 305 and the intention identification processor 306 are sequentially connected, and the interaction content processor 306 is connected. Is connected to the memory 311, the interaction content processor 306, the speech synthesizer 307, the power amplifier 308 and the speaker 309 are sequentially connected, and the imaging system 310 is connected to the interaction content processor 306.

前記マイク３０１はユーザとロボットが対話する際にユーザの音声信号の取得に用い、前記アナログデジタルコンバータ３０２は前記音声信号の音声デジタル情報への転換に用い、前記音声識別プロセッサ３０３は前記音声デジタル情報を文字情報に転化する上で前記インタラクション内容プロセッサ３０６への入力に用いる。 The microphone 301 is used to acquire a user's voice signal when the user and the robot interact, the analog-digital converter 302 is used to convert the voice signal into voice digital information, and the voice identification processor 303 is used to convert the voice digital information. Is converted into character information and used for input to the interaction content processor 306.

前記画像取得装置３０４はユーザがいる画像の取得に用い、前記顔認識プロセッサ３０５はユーザがいる画像からユーザの表情情報を識別し取得して前記インタラクション内容プロセッサ３０６への入力に用いる。画像取得装置３０４はビデオカメラ、カメラなどであってもよく、ユーザの表情情報のみならず、またユーザがいる環境、ユーザの動作情報なども識別でき、これらの情報はインタラクション内容プロセッサ３０６への入力としてもよく、それにより生成したインタラクション内容がもっとユーザの現在の需要に合う。 The image acquisition device 304 is used to acquire an image of the user, and the face recognition processor 305 identifies and acquires the facial expression information of the user from the image of the user and uses it for input to the interaction content processor 306. The image acquisition device 304 may be a video camera, a camera, etc., and can identify not only the user's facial expression information but also the environment where the user is, the user's operation information, etc., and these information are input to the interaction content processor 306. The interaction content generated thereby more closely matches the user's current demand.

前記インタラクション内容プロセッサ３０６は、少なくとも前記文字情報と表情情報を含むユーザのマルチモード情報、及び前記メモリ３１１に記憶されている生活時間軸に基づいて、少なくとも音声情報と動作情報を含むインタラクション内容を生成し、音声情報の時間長と動作情報の時間長を同じように調整することに用いる。ここにおいて、まずはユーザのマルチモード情報と生活時間軸に基づいてインタラクション内容の音声情報を生成し、それによってメモリ３１１の動作ライブラリから適切な動作グリップを選択し、適切な遷移動作を追加して完全の動作情報を生成する。ここにおいて、生活時間軸は一日の２４時間を含む時間軸を指し、前記生活時間軸におけるパラメータは少なくともユーザが前記時間軸で行う日常生活行為及び該行為を表すパラメータの値を含む。 The interaction content processor 306 generates an interaction content including at least voice information and action information based on the user's multi-mode information including at least the character information and facial expression information, and the life time axis stored in the memory 311. Then, it is used to adjust the time length of the voice information and the time length of the operation information in the same way. Here, first, the voice information of the interaction content is generated based on the user's multi-mode information and the life time axis, thereby selecting an appropriate operation grip from the operation library of the memory 311 and adding an appropriate transition operation to complete the operation. Operation information is generated. Here, the life time axis indicates a time axis including 24 hours in a day, and the parameters on the life time axis include at least daily life activities performed by the user on the time axis and values of parameters representing the actions.

ここにおいて、インタラクション内容プロセッサ３０６においては、前記音声情報の時間長と動作情報の時間長を同じように調整するステップは具体的に、音声情報の時間長と動作情報の時間長との差が閾値以下にある場合は、音声情報の時間長が動作情報の時間長より小さいであるなら、動作情報の再生速度を速め、それにより動作情報の時間長を前記音声情報の時間長と同じようにし、音声情報の時間長が動作情報の時間長より大きいである場合は、音声情報の再生速度を速める又は／及び動作情報の再生速度を落とし、それにより動作情報の時間長を前記音声情報の時間長と同じようにする。 Here, in the interaction content processor 306, the step of adjusting the time length of the voice information and the time length of the motion information in the same way is specifically the difference between the time length of the voice information and the time length of the motion information being a threshold value. If it is below, if the time length of the voice information is smaller than the time length of the motion information, the playback speed of the motion information is increased, thereby making the time length of the motion information the same as the time length of the voice information, If the time length of the voice information is greater than the time length of the motion information, the playback speed of the voice information is increased or / and the playback speed of the motion information is decreased, whereby the time length of the motion information is set to the time length of the voice information. To do the same.

例えば、音声情報の時間長と動作情報の時間長との閾値は１分で、ロボットがユーザのマルチモード情報に基づいて生成するインタラクション内容においては、音声情報の時間が１分で、動作情報の時間が２分である場合は、動作情報の再生速度を元の二倍まで速めてもよく、それで動作情報が調整された後の再生時間は１分になり、それにより音声情報との同期を実現する。勿論、音声情報の再生速度を本来の０．５倍まで落としてもよく、それにより音声情報は調整された後の再生時間が２分になり、それにより動作情報との同期を実現する。なお、音声情報と動作情報を共に調整してもよい、例えば音声情報の速度を落とすと同時に、動作情報を速め、両者とも１分３０秒まで調整することも、音声と動作との同期を実現できる。 For example, the threshold between the time length of the voice information and the time length of the motion information is 1 minute, and in the interaction content generated by the robot based on the user's multi-mode information, the time of the voice information is 1 minute, If the time is 2 minutes, the playback speed of the motion information may be increased to twice the original speed, so that the playback time after the motion information is adjusted becomes 1 minute, thereby synchronizing with the audio information. Realize. Of course, the reproduction speed of the audio information may be reduced to 0.5 times the original speed, so that the reproduction time after the audio information is adjusted becomes 2 minutes, thereby realizing synchronization with the operation information. Note that both voice information and motion information may be adjusted. For example, the speed of voice information can be reduced and the speed of motion information can be increased, and both can be adjusted to 1 minute 30 seconds. it can.

ここにおいて、インタラクション内容プロセッサ３０６においては、音声情報の時間長と動作情報の時間長を同じように調整するステップは具体的に、音声情報の時間長と動作情報の時間長との差が閾値より大きいである場合は、音声情報の時間長が動作情報の時間長より大きいであるなら、少なくとも二組の動作情報を順序付けて組み合わせ、それにより動作情報の時間長を前記音声情報の時間長と同じようにし、音声情報の時間長が動作情報の時間長より小さいである場合は、動作情報における一部の動作を選択して、これらの動作の時間長が前記音声情報の時間と同じようにする。 Here, in the interaction content processor 306, the step of adjusting the time length of the voice information and the time length of the motion information in the same way is specifically the difference between the time length of the voice information and the time length of the motion information being based on the threshold value. If so, if the time length of the voice information is greater than the time length of the motion information, at least two sets of motion information are combined in order, whereby the time length of the motion information is the same as the time length of the voice information If the time length of the audio information is smaller than the time length of the operation information, select some operations in the operation information so that the time length of these operations is the same as the time of the audio information. .

それにより、音声情報の時間長と動作情報の時間長との差が閾値より大きいである場合は、調節する意味は一部の動作情報を追加や削除することで、動作情報の時間長と音声情報の時間長との同期調整を実現できる。 As a result, when the difference between the time length of the voice information and the time length of the motion information is larger than the threshold, the meaning of the adjustment is to add or delete some motion information, Synchronous adjustment with the time length of information can be realized.

イメージングシステム３１０は前記動作情報に基づいて仮想３Ｄ映像を生成し、スピーカー３０９は前記音声信号を同時に再生する。ここにおいて、イメージングシステム３１０は普通のディスプレイとしてもよく、ホログラフィック投影装置としてもよくて、それにより表示したロボットの立体感と真実性を増加し、ユーザの体験を高める。 The imaging system 310 generates a virtual 3D image based on the operation information, and the speaker 309 reproduces the audio signal simultaneously. Here, the imaging system 310 may be a normal display or a holographic projection device, thereby increasing the stereoscopic effect and authenticity of the displayed robot and enhancing the user experience.

メモリ３１１はインタラクション内容プロセッサ３０６が操作する時に用いるデータの記憶に用いることも可能である。選択できるのは、インタラクション内容プロセッサ３０６はCPU（中央処理装置）、ASIC（Application Specific Integrated Circuit、特定用途向け集積回路）、FPGA（Field Programmable Gate Array、フィールドプログラマブルゲートアレイ）やCPLD（Complex Programmable Logic Device、コンプレックスプログラマブルロジックデバイス）としてもよい。 The memory 311 can also be used to store data used when the interaction content processor 306 operates. You can select an interaction content processor 306: CPU (Central Processing Unit), ASIC (Application Specific Integrated Circuit), FPGA (Field Programmable Gate Array), CPLD (Complex Programmable Logic Device) Complex programmable logic device).

本実施例の音声と仮想動作を同期させるシステム３００によっては、ユーザのマルチモード情報を取得でき、ユーザのマルチモード情報と生活時間軸に基づいて、少なくとも音声情報と動作情報を含むインタラクション内容の生成、音声情報の時間長と動作情報の時間長を同じように調整する。そうすることでは、ユーザの音声、表情、動作などのマルチモード情報の一種や多種類によって、少なくとも音声情報と動作情報を含むインタラクション内容を生成でき、また音声情報と動作情報を同期させるためには、音声情報の時間長と動作情報の時間長を同じように調整し、それによりロボットは音声と動作を再生する際に同時にマッチすることが可能になり、ロボットは音声表現のみならず、また動作などのさまざまな表現形式で対話できるようになり、ロボットの表現方法を多様化にさせ、ロボットはもっと擬人化になる他、ユーザがロボットとのインタラクション体験も向上させた。 Depending on the system 300 that synchronizes voice and virtual motion according to the present embodiment, the user's multi-mode information can be acquired, and based on the user's multi-mode information and the life time axis, generation of interaction content including at least voice information and motion information The time length of the voice information and the time length of the operation information are adjusted in the same way. By doing so, it is possible to generate interaction contents including at least voice information and action information by one or many kinds of multi-mode information such as user's voice, facial expression, action, etc., and in order to synchronize voice information and action information The time length of the voice information and the time length of the motion information are adjusted in the same way, so that the robot can match at the same time when playing back the voice and motion, so that the robot is not only voice expression but also motion It became possible to interact in various expression formats such as the above, diversifying the robot expression method, making the robot more anthropomorphic and improving the user's interaction experience with the robot.

図４に示すように、本実施例に開示した音声と仮想動作を同期させるシステム３００はまたユーザのいくつかの生理信号を取得し、信号プリプロセッサ３１４で前記生理信号を前処理した後に生理パラメータを得、且つ前記生理パラメータをインタラクション内容プロセッサ３０６に送信することに用いる複数のセンサー３１３を含む。相応的には、インタラクション内容プロセッサ３０６は前記文字情報、表情情報と生理パラメータに基づいてインタラクション内容を生成し、前記インタラクション内容を前記イメージングシステム３１０に送信し、且つ前記インタラクション内容にある音声情報をスピーカー３０９に送信する。 As shown in FIG. 4, the system 300 for synchronizing voice and virtual motion disclosed in the present embodiment also acquires some physiological signals of the user, and after the physiological signals are preprocessed by the signal preprocessor 314, the physiological parameters are obtained. And a plurality of sensors 313 used to transmit the physiological parameters to the interaction content processor 306. Correspondingly, the interaction content processor 306 generates an interaction content based on the character information, facial expression information and physiological parameters, transmits the interaction content to the imaging system 310, and transmits audio information in the interaction content to a speaker. To 309.

音声と仮想動作を同期させるシステム３００におけるセンサー３１３は光センサー、虹彩認識センサー、指紋取得センサー、温度センサー、心拍数センサーなどを含むがそれらには限定されなく、ユーザの光感知情報、虹彩情報、指紋情報、体温情報、心拍数情報などにおける一種や多種類の生理信号を含むマルチモード情報を豊富にする。 The sensor 313 in the system 300 for synchronizing voice and virtual motion includes, but is not limited to, a light sensor, an iris recognition sensor, a fingerprint acquisition sensor, a temperature sensor, a heart rate sensor, and the like. It enriches multi-mode information including one kind and many kinds of physiological signals such as fingerprint information, body temperature information, heart rate information.

下記はマイク３０１、画像取得装置３０４、センサー３１３が取得・出力する情報をマルチモード情報と総称する。 In the following, information acquired and output by the microphone 301, the image acquisition device 304, and the sensor 313 is collectively referred to as multi-mode information.

図５に示すように、一部のセンサー３１３は音声と仮想動作を同期させるシステム３００と一体に集積され、一部のセンター３１３はウェアラブルデバイス４００に集積されることが可能である、例えば温度センサー、心拍数センサーをスマートリストバンドに集積し、無線通信装置によって取得した情報を音声と仮想動作を同期させるシステム３００におけるインタラクション内容プロセッサ３０６に送信する。図５はただ音声と仮想動作を同期させるシステム３００における無線通信装置とインタラクション内容プロセッサ３０６との接続関係を表示するだけで、音声と仮想動作を同期させるシステム３００における他の接続関係は図３と図４に参照すればよい。 As shown in FIG. 5, some sensors 313 can be integrated with a system 300 that synchronizes voice and virtual motion, and some centers 313 can be integrated into a wearable device 400, for example, temperature sensors. The heart rate sensor is integrated into the smart wristband and the information acquired by the wireless communication device is transmitted to the interaction content processor 306 in the system 300 that synchronizes voice and virtual motion. FIG. 5 merely displays the connection relationship between the wireless communication apparatus and the interaction content processor 306 in the system 300 that synchronizes the voice and the virtual operation, and the other connection relationship in the system 300 that synchronizes the voice and the virtual operation is as shown in FIG. Reference may be made to FIG.

システム３００はまた人工知能クラウドプロセッサを含み、前記人工知能クラウドプロセッサはロボットの生活時間軸パラメータの生成に用い、具体的にはロボットの自己認識の拡大、前記メモリからユーザの生活時間軸パラメータの取得、ロボットの自己認識パラメータをユーザの生活時間軸におけるパラメータと整合し、それによるロボットの生活時間軸の生成を含む。ここにおいて、ロボットの自己認識を拡大する前記ステップは具体的に、生活場面をロボットの自己認識と結合して生活時間軸を基礎とする自己認識曲線の生成を含む。ロボットの自己認識パラメータを生活時間軸におけるパラメータと整合するステップは具体的に、確率アルゴリズムを利用し、時間軸の場面パラメータが変えた後、生活時間軸にあるロボットの各パラメータが変化する確率を計算して整合曲線を形成することを含む。そうするとロボットの自己認識パラメータと生活時間軸のパラメータを具体的に整合することができる。ここにおいて確率アルゴリズムはベイズ確率アルゴリズムであってもよい。 The system 300 also includes an artificial intelligence cloud processor, which is used to generate a robot's life time axis parameter, specifically expanding the robot's self-awareness, and obtaining the user's life time axis parameter from the memory. Including matching the robot's self-recognition parameters with the parameters on the user's life time axis, thereby generating the robot's life time axis. Here, the step of expanding the self-recognition of the robot specifically includes generating a self-recognition curve based on the life time axis by combining the life scene with the self-recognition of the robot. Specifically, the step of matching the robot's self-recognition parameters with the parameters on the life time axis uses a probability algorithm to determine the probability that each parameter of the robot on the life time axis will change after the scene parameter on the time axis changes. Calculating to form a matching curve. Then, the self-recognition parameter of the robot and the parameter of the life time axis can be specifically matched. Here, the probability algorithm may be a Bayes probability algorithm.

本実施例のシステムはまたロボットの生活時間軸を利用してロボットの音声と動作の同期をガイドできる。具体的には、音声情報の時間長と動作情報の時間長を同じように調節するステップは具体的に、ロボットの現在生活時間軸にある時間位置を取得すること、前記音声情報の時間長と動作情報の時間長との差を比較し、ロボットがある時間位置に基づいて音声情報の時間長を調整するか、それとも動作情報の時間長を調整するかを判断すること、判断結果によって音声情報の時間長と動作情報の時間長を同じように調整することを含む。例えば、マンマシンインタラクションはロボットが休憩中又は起きたばかり時点に発生し、且つ動作時間＞言語の長さであると、言語の長さを延長し、それはロボットが怠惰な状態にあるため、比較的に遅い動作と音声とも人類が日常生活における表現をよりよく整合できるようになった。 The system of the present embodiment can also guide the synchronization of the robot's voice and motion using the robot's life time axis. Specifically, the step of adjusting the time length of the voice information and the time length of the motion information in the same way specifically includes obtaining a time position on the current life time axis of the robot, and the time length of the voice information. Compare the difference with the time length of the motion information, determine whether the robot adjusts the time length of the voice information based on a certain time position or the time length of the motion information, and the voice information according to the judgment result And adjusting the time length of the motion information in the same manner. For example, man-machine interaction occurs when the robot is resting or just getting up, and if the operation time> the length of the language, the length of the language is extended, which is relatively low because the robot is in a lazy state. It has become possible for humans to better align expressions in daily life with both slow movements and voice.

音声情報と動作情報の時間長を調整する際には、ロボットが現在時間軸にある時間位置の他、ロボットの現在の気持ち（長時間運転−人類の疲労に対応する）、ロボット現在の状態（残った電気量−人類の飢餓に対応する）などを考えるべきである。 When adjusting the time length of voice information and motion information, in addition to the time position where the robot is currently on the time axis, the robot's current feeling (long-time driving-corresponding to human fatigue), the robot's current state ( The amount of electricity remaining-corresponding to human hunger) should be considered.

本実施例に開示したシステム３００はまた無線通信装置３１４を含み、図６に示すように、無線通信装置３１４はインタラクション内容プロセッサ３０６と接続され、インタラクション内容プロセッサ３０６はまたインタラクション内容の移動端末５００への送信に用い、移動端末５００は動作情報に基づいて仮想３Ｄ映像を生成し、移動端末５００のスピーカーで音声情報を同時に再生する。図6はただ音声と仮想動作を同期させるシステム３００における無線通信装置とインタラクション内容プロセッサ３１１との接続関係を表示するだけで、音声と仮想動作を同期させるシステム３００における他の接続関係は図３と図４に参照すればよい。 The system 300 disclosed in this embodiment also includes a wireless communication device 314, which is connected to an interaction content processor 306, as shown in FIG. 6, which also communicates to the mobile terminal 500 for interaction content. The mobile terminal 500 generates a virtual 3D video based on the operation information, and simultaneously reproduces the audio information using the speaker of the mobile terminal 500. FIG. 6 merely displays the connection relationship between the wireless communication apparatus and the interaction content processor 311 in the system 300 that synchronizes voice and virtual operation, and the other connection relationship in the system 300 that synchronizes voice and virtual operation is as shown in FIG. Reference may be made to FIG.

本実施例が開示したロボットの音声と仮想動作を同期させるシステム３００は、さまざまな面でロボットのユーザと対話する形式を豊富にすることが可能で、ロボットがもっと擬人化に人類と対話でき、該システムは生成したロボットのインタラクション内容の擬人性及びマンマシンインタラクション体験を向上させ、インテリジェント性を高めることが期待される。 The system 300 that synchronizes the voice and virtual motion of the robot disclosed in the present embodiment can be rich in various forms of interacting with the user of the robot in various aspects, and the robot can interact with humanity more anthropomorphically. The system is expected to improve the anthropomorphism of the generated robot interaction content and the man-machine interaction experience, and enhance the intelligent.

図７に示すように、音声と仮想動作を同期させるシステム３００はまたロボット６００の内部に集積されてもよく、ロボット６００に備えられた音声取得装置６１２、ビデオカメラ６１１、各種類のセンサー（図４に表示されない）、ＧＰＳナビゲーション装置（図４に表示されない）などによってユーザのマルチモード情報を取得し、且つインタラクション内容プロセッサ３０６に送信できる、例えば、ユーザがロボットをある場所に連れる時に、ＧＰＳナビゲーション装置でユーザがいる位置情報が得られ、そうして生活時間軸と結合することで可変パラメータを得、且つロボット本体の自己認識を拡大し、自己認識パラメータと可変パラメータにおける応用場面パラメータを整合して、擬人化の影響を生じる。 As shown in FIG. 7, a system 300 for synchronizing voice and virtual motion may also be integrated inside the robot 600, and includes a voice acquisition device 612, a video camera 611, and various types of sensors (see FIG. The user's multi-mode information can be obtained and transmitted to the interaction content processor 306, for example by a GPS navigation device (not shown in FIG. 4), for example when the user joins the robot to a location The position information of the user can be obtained with the navigation device, so that the variable parameter can be obtained by combining with the life time axis, and the self-recognition of the robot body can be expanded, and the application scene parameter in the self-recognition parameter and the variable parameter can be matched. As a result, the effect of anthropomorphism is produced.

インタラクション内容プロセッサ３０６はメモリー３１１に記憶されたプログラムの読み取りに用い、ユーザのマルチモード情報の取得、ユーザのマルチモード情報と生活時間軸に基づいて、少なくとも音声情報と動作情報を含むインタラクション内容の生成、音声情報の時間長と動作情報の時間長の同期調整、及び時間長が同様に調整された音声情報と動作情報を出力するプロセスを実行する。インタラクション内容プロセッサ３０６から出力された音声情報はロボット６００の音声システム６１３で再生され、ロボット６００のホストコントローラによってインタラクション内容プロセッサ３０６から出力された動作情報をロボットのそれぞれの関節の制御信号に転化し、ロボットのそれぞれの関節６１４の運動を制御し、それによりロボット６００を音声と同期する動作をさせる、例えば、ロボット６００のヘッドの内側における関節によってヘッドの横向け揺れ、前後の振り、及びうなずく動作を制御し、ロボットの運動を制御する具体的な方法は従来技術であるため、本文で詳細に説明しない。インタラクション内容プロセッサ３０６で処理されたデータを無線通信装置３１４を経由して無線媒質で伝送し、更に、無線通信装置３１４はデータを受信してからそれをインタラクション内容プロセッサ３０６に転送し、ロボット６００は無線通信装置３１４によってインターネットにアクセスできる他、またインターネットによってユーザのさまざまなデータを取得やアップロードでき、また無線通信装置３１４でユーザの移動端末にアクセスでき、ロボットとの対話やロボットに対するさまざまな設定を実現できる。 The interaction content processor 306 is used to read the program stored in the memory 311 and obtains the user's multi-mode information and generates the interaction content including at least voice information and operation information based on the user's multi-mode information and the life time axis. Then, the synchronization adjustment of the time length of the voice information and the time length of the motion information, and the process of outputting the voice information and the motion information whose time length are similarly adjusted are executed. The audio information output from the interaction content processor 306 is reproduced by the audio system 613 of the robot 600, and the motion information output from the interaction content processor 306 is converted into a control signal for each joint of the robot by the host controller of the robot 600. Control the movement of each joint 614 of the robot, thereby causing the robot 600 to synchronize with the voice, for example, the joint inside the head of the robot 600 swings the head sideways, swings back and forth, and nods. The specific method of controlling and controlling the motion of the robot is prior art and will not be described in detail here. The data processed by the interaction content processor 306 is transmitted by a wireless medium via the wireless communication device 314. Further, the wireless communication device 314 receives the data and then transfers the data to the interaction content processor 306. In addition to accessing the Internet via the wireless communication device 314, it is also possible to acquire and upload various data of the user via the Internet, access the user's mobile terminal via the wireless communication device 314, and to interact with the robot and various settings for the robot. realizable.

音声と仮想動作を同期させるシステムも電子デバイス端末をキャリアとしてソフトウェアプログラムを経由して実現でき、スマートフォンをキャリアとすることで例示し、情報取得装置はスマートフォンに既存している音声取得装置、ビデオカメラ、各種類のセンサー、ＧＰＳナビゲーション装置などを再利用してユーザのマルチモード情報を取得し、且つスマートフォンに内蔵したプロセッサに送信した後、プロセッサはメモリーに記憶されたプログラムを読み取り、ユーザのマルチモード情報と可変パラメータに基づいて、少なくとも音声情報と動作情報を含むインタラクション内容を生成するプロセスと、音声情報の時間長と動作情報の時間長を同じように調整するプロセスと、時間長が同様に調整された音声情報と動作情報を出力するプロセスとを実行する。スマートフォンのスクリーンで仮想ロボットの動作を表示し、スピーカーで音声を同時に再生する。スマートフォンの無線通信モジュールによって外部のデバイスやネットワークと接続され、データインタラクションを完成する。 A system that synchronizes voice and virtual motion can also be realized via a software program using an electronic device terminal as a carrier, exemplified by using a smartphone as a carrier, and the information acquisition device is a voice acquisition device, video camera that already exists in the smartphone After reusing each type of sensor, GPS navigation device, etc. to acquire the user's multi-mode information and sending it to the processor built in the smartphone, the processor reads the program stored in the memory, and the user's multi-mode information Based on information and variable parameters, the process of generating the interaction content including at least voice information and motion information, the process of adjusting the time length of voice information and the time length of motion information in the same way, and the time length are adjusted in the same way Output audio information and motion information To run the process. The operation of the virtual robot is displayed on the screen of the smartphone, and the sound is played back simultaneously with the speaker. It is connected to external devices and networks by the wireless communication module of the smartphone, completing the data interaction.

本実施例の音声と仮想動作を同期させるシステムは、ユーザの音声、表情、動作などのマルチモード情報の一種や多種利によって、少なくとも音声情報と動作情報を含むインタラクション内容を生成でき、音声情報と動作情報を同期させるためには、音声情報の時間長と動作情報の時間長を同じように調整し、それでロボットが音声と動作を再生する時に同時にマッチすることが可能になって、ロボットが人類と対話する際に音声表現だけではなく、また動作などのさまざまな表現形式も利用可能、ロボットの表現形式をもっと多様化にし、ロボットはより擬人化に対話でき、ユーザのインタラクション体験も向上した。 The system that synchronizes the voice and the virtual motion of the present embodiment can generate an interaction content including at least the voice information and the motion information according to one type or various benefits of the multi-mode information such as the user's voice, facial expression, motion, etc. In order to synchronize the motion information, the time length of the voice information and the time length of the motion information are adjusted in the same way, so that when the robot plays back the voice and motion, it can be matched at the same time. In addition to voice expression, various expression formats such as motion can be used when interacting with the robot, making the robot expression format more diverse, allowing the robot to interact more anthropomorphically and improving the user interaction experience.

上記内容は具体的な好ましい実施様態を結合した上で本発明に対する更に詳細な説明であるが、本発明の具体的な実施例はこれらの説明に限定されるわけではない。当業者にとっては、本発明の精神から脱逸しない前提で、上記実施様態にさまざまな変更・改良を加えることが可能であって、本発明の保護範囲に属するべきである。 The above description is a more detailed description of the present invention after combining specific preferred embodiments, but the specific embodiments of the present invention are not limited to these descriptions. For those skilled in the art, various modifications and improvements can be made to the above-described embodiment on the premise that they do not depart from the spirit of the present invention, and should fall within the protection scope of the present invention.

Claims

A method of synchronizing voice and virtual motion,
Obtaining user multi-mode information,
Depending on the user's multi-mode information and life time axis, generation of interaction contents including at least voice information and motion information,
A method of synchronizing a voice and a virtual motion, comprising a synchronization adjustment for the time length of the voice information and the time length of the motion information.

Specifically, the step of adjusting the time length of the voice information and the time length of the motion information in the same manner,
If the difference between the time length of the voice information and the time length of the motion information is less than or equal to the threshold, if the time length of the voice information is smaller than the time length of the motion information, the playback speed of the motion information is increased, thereby The method according to claim 1, comprising making the time length of information the same as the time length of the voice information.

If the time length of the audio information is larger than the time length of the motion information, the playback speed of the speech information is increased or / and the playback speed of the motion information is decreased, whereby the time length of the motion information is set to the time of the voice information. The method of claim 2, wherein the method is the same as the length.

Specifically, the step of adjusting the time length of the voice information and the time length of the motion information in the same manner,
When the difference between the time length of the voice information and the time length of the motion information is greater than the threshold, if the time length of the voice information is greater than the time length of the motion information, at least two sets of motion information are combined in order The method according to claim 1, further comprising making the time length of the operation information the same as the time length of the voice information.

When the time length of the audio information is smaller than the time length of the operation information, some operations in the operation information are selected so that the time length of these operations is the same as the time of the audio information. The method according to claim 4.

It also includes the generation of the robot's life time axis.
Expansion of robot self-awareness,
Get life time axis parameters,
The method of claim 1, comprising matching the robot's self-recognition parameters to the parameters of the life time axis, thereby generating the robot's life time axis.

The step of expanding the robot's self-awareness is specifically:
7. The method of claim 6, comprising combining the life scene with the robot's self-recognition, thereby generating a self-recognition curve based on the life time axis.

The step of matching the self-recognition parameter of the robot with the parameter of the life time axis specifically includes:
7. The method according to claim 6, further comprising: calculating a probability that each parameter of the robot on the life time axis changes after the scene parameter on the time axis is changed through a probability algorithm, and forming a matching curve. The method described.

Here, the life time axis indicates a time axis of 24 hours in a day, and the parameters on the life time axis include at least daily activities performed by the user on the time axis, and values of parameters representing the actions. The method of claim 1, characterized in that:

A system that synchronizes voice and virtual motion,
An acquisition module for acquiring user multi-mode information;
An artificial intelligence module that generates interaction content including at least voice information and motion information based on the user's multi-mode information and life time axis;
A system that synchronizes voice and virtual motion, including a control module that adjusts the time length of voice information and the time length of motion information in the same manner.

A robot comprising the system for synchronizing voice and virtual motion according to claim 10.

A system for synchronizing speech and virtual operation, including a microphone, an analog-digital converter, a speech identification processor, an image acquisition device, a face recognition processor, a speech synthesizer, a power amplifier, a speaker, an imaging system, an interaction content processor, and a memory
The microphone, the analog-digital converter, the voice identification processor, and the interaction content processor are sequentially connected, the image acquisition device, the face recognition processor, and the interaction content processor are sequentially connected, and the interaction content processor is the memory. The interaction content processor, the speech synthesizer, the power amplifier and the speaker are sequentially connected, and the imaging system is connected to the interaction content processor,
The microphone is used to acquire a user's voice signal when the user and the robot interact, the analog-digital converter is used to convert the voice signal into voice digital information, and the voice identification processor uses the voice digital information as character information. Used to input to the intent identification processor
The image acquisition device is used to acquire an image of a user, and the face recognition processor identifies and acquires user facial expression information from an image of the user and uses it for input to the interaction content processor.
The interaction content processor generates interaction content including at least voice information and action information based on a user's multi-mode information including at least the character information and the facial expression information and a life time axis stored in the memory. , Used to adjust the time length of voice information and the time length of motion information in the same way,
The imaging system generates a virtual 3D image according to the motion information, and the speaker reproduces the audio information at the same time. A method of synchronizing sound and virtual motion.

In the interaction content processor, the step of adjusting the time length of the voice information and the time length of the motion information in the same manner is specifically,
If the difference between the time length of the voice information and the time length of the motion information is less than or equal to the threshold, if the time length of the voice information is smaller than the time length of the motion information, the playback speed of the motion information is increased, thereby The method according to claim 1, comprising making the time length of information the same as the time length of the voice information.

In the interaction content processor, when the time length of the voice information is larger than the time length of the motion information, the playback speed of the voice information is increased or / and the playback speed of the motion information is decreased, thereby reducing the time length of the motion information. The method according to claim 13, wherein the time length of the voice information is the same.

In the interaction content processor, the step of adjusting the time length of the voice information and the time length of the motion information in the same manner is specifically,
If the difference between the time length of the voice information and the time length of the motion information is greater than the threshold, if the time length of the voice information is greater than the time length of the motion information, at least two sets of motion information are combined in order, The method according to claim 1, further comprising making the time length of the motion information the same as the time length of the voice information.

In the interaction content processor, when the time length of the voice information is smaller than the time length of the motion information, a part of the motion information is selected, and the time length of these motions is the same as the time of the voice information The method system according to claim 15, characterized in that:

The system also includes an artificial intelligence cloud processor used to generate the robot time-of-life parameters, specifically,
Expansion of robot self-awareness,
Obtaining the user's life time axis parameters from the memory;
13. The system of claim 12, including matching the robot's self-recognition parameters to parameters on the user's life time axis, thereby generating the robot's life time axis.

In the artificial intelligence cloud processor, the step of expanding the self-recognition of the robot is specifically,
The system according to claim 17, further comprising: generating a self-recognition curve based on a life time axis by combining the life scene with the self-recognition of the robot.

In the artificial intelligence cloud processor, the step of matching the self-recognition parameter of the robot with the parameter in the life time axis specifically includes:
18. The robot according to claim 17, wherein in the probability algorithm, the robot on the life time axis calculates the probability that each parameter changes after the scene parameter on the time axis changes, thereby forming a matching curve. System.

The operation to adjust the time length of the voice information and the time length of the motion information in the same way,
The acquisition of the time position where the robot is currently located on the life time axis,
Performing a determination of whether to adjust the time length of the voice information or the time length of the motion information according to the time position where the robot is located, by comparing the time difference between the voice information and the motion information;
The system according to claim 17, comprising adjusting the time length of the voice information and the time length of the motion information in the same manner based on the determination result.