JP2022531072A

JP2022531072A - Interactive object drive methods, devices, devices, and storage media

Info

Publication number: JP2022531072A
Application number: JP2021556973A
Authority: JP
Inventors: 子隆 ▲張▼; 文岩 ▲呉▼; 潜溢 ▲呉▼; ▲親▼▲親▼ ▲許▼
Original assignee: Beijing Sensetime Technology Development Co Ltd
Current assignee: Beijing Sensetime Technology Development Co Ltd
Priority date: 2020-03-31
Filing date: 2020-11-18
Publication date: 2022-07-06
Anticipated expiration: 2040-11-18
Also published as: CN111459452B; TW202138970A; WO2021196645A1; CN111459452A; KR20210129713A; TWI760015B; JP7227395B2

Abstract

本発明は、インタラクティブ対象の駆動方法、装置、デバイス、及び記憶媒体を開示し、前記インタラクティブ対象は、表示デバイスに展示されている。前記方法は、前記インタラクティブ対象の駆動データを取得し、前記駆動データの駆動モードを確定することと、前記駆動モードに応答して、前記駆動データに基づいて前記インタラクティブ対象の制御パラメータ値を取得することと、前記制御パラメータ値に基づいて前記インタラクティブ対象の姿態を制御することと、を含む。【選択図】図２The present invention discloses a driving method, apparatus, device and storage medium for an interactive object, said interactive object being displayed on a display device. The method includes obtaining driving data of the interactive object, determining a driving mode of the driving data, and obtaining a control parameter value of the interactive object based on the driving data in response to the driving mode. and controlling the pose of the interactive object based on the control parameter value. [Selection drawing] Fig. 2

Description

本発明は、コンピュータ技術分野に関し、具体的には、インタラクティブ対象の駆動方法、装置、デバイス、及び記憶媒体に関する。 The present invention relates to the field of computer technology, specifically to driving methods, devices, devices, and storage media for interactive objects.

＜関連出願の相互引用＞
本発明は、出願番号が２０２０１０２４６１１２０であり、出願日が２０２０年３月３１日である中国特許出願の優先権を主張し、当該中国特許出願の全ての内容が援用により本願に組み入れられる。 <Mutual citation of related applications>
The present invention claims priority for a Chinese patent application with application number 2020102461120 and filing date March 31, 2020, the entire contents of which are incorporated herein by reference.

人間とコンピュータの相互作用は、主に、キーストローク、タッチ、および音声によって入力し、表示スクリーンに画像、テキスト、または仮想キャラクターを表示して応答する。現在、仮想キャラクターは主に音声アシスタントに基づいて改善されたものであり、デバイスの音声を出力するだけである。 Human-computer interactions are primarily input by keystrokes, touch, and voice, and respond by displaying images, text, or virtual characters on the display screen. Currently, virtual characters have been improved primarily based on voice assistants and only output the device's voice.

本発明の実施例は、インタラクティブ対象の駆動の技術的解決策を提供する。 The embodiments of the present invention provide a technical solution for driving an interactive object.

本発明の１態様によると、表示デバイスに展示されているインタラクティブ対象の駆動方法を提供し、前記方法は、前記インタラクティブ対象の駆動データを取得し、前記駆動データの駆動モードを確定することと、前記駆動モードに応答して、前記駆動データに基づいて前記インタラクティブ対象の制御パラメータ値を取得することと、前記制御パラメータ値に基づいて前記インタラクティブ対象の姿態を制御することと、を含む。 According to one aspect of the present invention, a method for driving an interactive object exhibited in a display device is provided, in which the method acquires driving data of the interactive object and determines a driving mode of the driving data. In response to the drive mode, the control parameter value of the interactive object is acquired based on the drive data, and the appearance of the interactive object is controlled based on the control parameter value.

本発明によって提供される任意の実施形態に結合して、前記方法は、前記駆動データに基づいて、前記表示デバイス出力音声を制御し、および／または、テキストを展示することをさらに含む。 Combined with any embodiment provided by the present invention, the method further comprises controlling the display device output audio and / or displaying text based on the driving data.

本発明によって提供される任意の実施形態に結合して、前記駆動データに対応する駆動モードを確定することは、前記駆動データのタイプに基づいて、前記駆動データに対応する音声データシーケンスを取得することであって、前記音声データシーケンスは複数の音声データ単位を含むことと、前記音声データ単位に含まれている目標データが検出されたことに応答して、前記駆動データの駆動モードを第１駆動モードとして確定することであって、前記目標データは前記インタラクティブ対象の所定の制御パラメータ値に対応することと、を含み、前記駆動モードに応答して、前記駆動データに基づいて前記インタラクティブ対象の制御パラメータ値を取得することは、前記第１駆動モードに応答して、前記目標データに対応する前記所定の制御パラメータ値を前記インタラクティブ対象の制御パラメータ値として使用することを含む。 Combining with any embodiment provided by the present invention to determine the drive mode corresponding to the drive data obtains a voice data sequence corresponding to the drive data based on the type of drive data. That is, in response to the fact that the voice data sequence includes a plurality of voice data units and the detection of the target data contained in the voice data unit, the drive mode of the drive data is set to the first. Determining as a drive mode, wherein the target data corresponds to a predetermined control parameter value of the interactive object, and in response to the drive mode, the interactive object is based on the drive data. Acquiring the control parameter value includes using the predetermined control parameter value corresponding to the target data as the control parameter value of the interactive object in response to the first drive mode.

本発明によって提供される任意の実施形態に結合して、前記目標データは、キー単語またはキー文字を含み、前記キー単語または前記キー文字は、前記インタラクティブ対象の所定の動作の所定の制御パラメータ値に対応し、または、前記目標データは、音節を含み、前記音節は、前記インタラクティブ対象の所定の口形状動作の所定の制御パラメータ値に対応する。 Combined with any embodiment provided by the present invention, the target data comprises a key word or key character, wherein the key word or key character is a predetermined control parameter value of a predetermined operation of the interactive object. Or, the target data includes a syllable, which corresponds to a predetermined control parameter value of a predetermined mouth shape movement of the interactive object.

本発明によって提供される任意の実施形態に結合して、前記駆動データに対応する駆動モードを確定することは、前記駆動データのタイプに基づいて、前記駆動データに対応する音声データシーケンスを取得することであって、前記音声データシーケンスは複数の音声データ単位を含むことと、前記音声データ単位に含まれている目標データが検出されないと、前記駆動データの駆動モードを第２駆動モードとして確定し、前記目標データは、前記インタラクティブ対象の所定の制御パラメータ値に対応する。前記駆動モードに応答して、前記駆動データに基づいて前記インタラクティブ対象の制御パラメータ値を取得することは、前記第２駆動モードに応答して、前記音声データシーケンス中の少なくとも１つの音声データ単位の特徴情報を取得することと、前記特徴情報に対応する前記インタラクティブ対象の制御パラメータ値を取得することと、を含む。 Combining with any embodiment provided by the present invention to determine the drive mode corresponding to the drive data obtains a voice data sequence corresponding to the drive data based on the type of drive data. That is, if the voice data sequence includes a plurality of voice data units and the target data included in the voice data unit is not detected, the drive mode of the drive data is determined as the second drive mode. , The target data corresponds to a predetermined control parameter value of the interactive object. Acquiring the control parameter value of the interactive object based on the drive data in response to the drive mode is a unit of at least one voice data in the voice data sequence in response to the second drive mode. It includes acquiring the feature information and acquiring the control parameter value of the interactive object corresponding to the feature information.

本発明によって提供される任意の実施形態に結合して、前記音声データシーケンスは、音素シーケンスを含み、前記音声データシーケンス中の少なくとも１つの音声データ単位の特徴情報を取得することは、前記音素シーケンスに対して特徴エンコーディングを実行して、前記音素シーケンス対応する第１コードシーケンスを得ることと、前記第１コードシーケンスに基づいて少なくとも１つの音素に対応する特徴コードを取得することと、前記特徴コードに基づいて前記少なくとも１つの音素の特徴情報を得ることと、を含む。 Combined with any embodiment provided by the present invention, the phoneme sequence includes a phoneme sequence, and acquiring feature information of at least one voice data unit in the voice data sequence is the phoneme sequence. The feature encoding is executed on the phoneme to obtain the first chord sequence corresponding to the phoneme sequence, the feature code corresponding to at least one phoneme is obtained based on the first chord sequence, and the feature code is obtained. To obtain characteristic information of at least one phoneme based on the above.

本発明によって提供される任意の実施形態に結合して、前記音声データシーケンスは、音声フレームシーケンスを含み、前記音声データシーケンス中の少なくとも１つの音声データ単位の特徴情報を取得することは、前記音声フレームシーケンス対応する第１音響特徴シーケンスを取得することであって、前記第１音響特徴シーケンスは、前記音声フレームシーケンス中の各音声フレームに対応する音響特徴ベクトルを含むことと、前記第１音響特徴シーケンスに基づいて少なくとも１つの音声フレームに対応する音響特徴ベクトルを取得することと、前記音響特徴ベクトルに基づいて前記少なくとも１つの音声フレームに対応する特徴情報を得ることと、を含む。 Combined with any embodiment provided by the present invention, said voice data sequence comprises a voice frame sequence, and acquiring feature information of at least one voice data unit in said voice data sequence is said voice. Acquiring the first acoustic feature sequence corresponding to the frame sequence, the first acoustic feature sequence includes the acoustic feature vector corresponding to each audio frame in the audio frame sequence, and the first acoustic feature. It includes acquiring an acoustic feature vector corresponding to at least one voice frame based on a sequence, and obtaining feature information corresponding to the at least one voice frame based on the acoustic feature vector.

本発明によって提供される任意の実施形態に結合して、前記インタラクティブ対象の制御パラメータは、顔部姿態パラメータを含み、前記顔部姿態パラメータは、顔部筋肉制御係数を含み、前記顔部筋肉制御係数は、少なくとも１つの顔部筋肉の運動状態を制御するために使用され、前記駆動データに基づいて前記インタラクティブ対象の制御パラメータ値を取得することは、前記駆動データに基づいて前記インタラクティブ対象の顔部筋肉制御係数を取得することを含み、前記制御パラメータ値に基づいて前記インタラクティブ対象の姿態を制御することは、取得した顔部筋肉制御係数に基づいて前記インタラクティブ対象が前記駆動データにマッチングする顔部動作を行うように前記インタラクティブ対象を駆動することを含む。 Combined with any embodiment provided by the present invention, the control parameter of the interactive object includes a facial appearance parameter, the facial appearance parameter includes a facial muscle control coefficient, and the facial muscle control. The coefficient is used to control the motor state of at least one facial muscle, and acquiring the control parameter value of the interactive object based on the driving data is based on the driving data of the face of the interactive object. Controlling the appearance of the interactive object based on the control parameter value, including acquiring the part muscle control coefficient, means that the interactive object matches the driving data based on the acquired facial muscle control coefficient. It includes driving the interactive object to perform a partial motion.

本発明によって提供される任意の実施形態に結合して、前記方法は、前記顔部姿態パラメータに関連している体姿態の駆動データを取得することと、前記顔部姿態パラメータ値に関連している体姿態の駆動データに基づいて前記インタラクティブ対象が肢体動作を行うように駆動することと、をさらに含む。 Combined with any embodiment provided by the present invention, the method relates to obtaining body shape driving data associated with the facial shape parameter and in relation to the facial shape parameter value. Further including driving the interactive object to perform limb movements based on the driving data of the existing body shape.

本発明によって提供される任意の実施形態に結合して、前記インタラクティブ対象の制御パラメータ値は、前記インタラクティブ対象の少なくとも１つの部分的領域の制御ベクトルを含み、前記駆動データに基づいて前記インタラクティブ対象の制御パラメータ値を取得することは、前記駆動データに基づいて前記インタラクティブ対象の少なくとも１つの部分的領域の制御ベクトルを取得することを含み、前記制御パラメータ値に基づいて前記インタラクティブ対象の姿態を制御することは、取得した前記少なくとも１つの部分的領域の制御ベクトルに基づいて前記インタラクティブ対象の顔部動作および／または肢体動作を制御することを含む。 Combined with any embodiment provided by the present invention, the control parameter value of the interactive object comprises a control vector of at least one partial region of the interactive object and is based on the driving data of the interactive object. Acquiring the control parameter value includes acquiring the control vector of at least one partial region of the interactive object based on the driving data, and controls the appearance of the interactive object based on the control parameter value. That includes controlling the facial and / or limb movements of the interactive object based on the acquired control vector of at least one partial region.

本発明によって提供される任意の実施形態に結合して、前記特徴情報に対応する前記インタラクティブ対象の制御パラメータ値を取得することは、前記特徴情報を事前に訓練されたリカレントニューラルネットワークに入力して、前記特徴情報に対応する前記インタラクティブ対象の制御パラメータ値を得ることを含む。 Capturing the control parameter values of the interactive object corresponding to the feature information in combination with any embodiment provided by the present invention inputs the feature information into a pre-trained recurrent neural network. , Includes obtaining control parameter values for the interactive object corresponding to the feature information.

本発明の１態様によると、表示デバイスに展示されているインタラクティブ対象の駆動装置を提供し、前記装置は、前記インタラクティブ対象の駆動データを取得し、前記駆動データの駆動モードを確定するための第１取得ユニットと、前記駆動モードに応答して、前記駆動データに基づいて前記インタラクティブ対象の制御パラメータ値を取得するための第２取得ユニットと、前記制御パラメータ値に基づいて前記インタラクティブ対象の姿態を制御するための駆動ユニットと、を備える。 According to one aspect of the present invention, there is provided a drive device for an interactive object exhibited in a display device, the device for acquiring drive data for the interactive object and determining a drive mode for the drive data. The 1 acquisition unit, the second acquisition unit for acquiring the control parameter value of the interactive object based on the drive data in response to the drive mode, and the appearance of the interactive object based on the control parameter value. It is equipped with a drive unit for control.

本発明の１態様によると、電子デバイスを提供し、当該電子デバイスは、前記デバイスメモリとプロセッサとを備え、前記メモリは、プロセッサ上で運行可能なコンピュータ命令を記憶し、前記プロセッサは、前記コンピュータ命令が実行されるときに、実現本発明によって提供される任意の実施形態に記載のインタラクティブ対象の駆動方法を実行する。 According to one aspect of the invention, an electronic device is provided, the electronic device comprising the device memory and a processor, the memory storing computer instructions that can be run on the processor, the processor being the computer. Realization When the instruction is executed, the method of driving an interactive object according to any embodiment provided by the present invention is executed.

本発明の１態様によると、コンピュータプログラムが記憶されているコンピュータ可読記録媒体を提供し、前記コンピュータプログラムがプロセッサによって実行されるときに、本発明によって提供される任意の実施形態に記載のインタラクティブ対象の駆動方法が実行される。 According to one aspect of the invention, an interactive object according to any embodiment provided by the present invention provides a computer-readable recording medium in which a computer program is stored and the computer program is executed by a processor. The driving method of is executed.

本発明の１つのまたは複数の実施例のインタラクティブ対象の駆動方法、装置、デバイス、及びコンピュータ可読記憶媒体によると、前記インタラクティブ対象の駆動データの駆動モードに基づいて、前記インタラクティブ対象の制御パラメータ値を取得することによって、前記インタラクティブ対象の姿態を制御する。ここで、互いに異なる駆動モードに対して互いに異なる方法に従って該当するインタラクティブ対象の制御パラメータ値を取得し、インタラクティブ対象が前記駆動データの内容および／または対応する音声にマッチングされる姿態を展示するようにして、目標対象にインタラクティブ対象と交流しているような感覚を与え、目標対象のインタラクティブ対象とのインタラクティブ体験を改善した。 According to the driving method, apparatus, device, and computer readable storage medium of the interactive object of one or more embodiments of the present invention, the control parameter value of the interactive object is set based on the driving mode of the driving data of the interactive object. By acquiring, the appearance of the interactive object is controlled. Here, the control parameter values of the corresponding interactive objects are acquired according to different methods for different drive modes, and the appearance in which the interactive objects are matched with the contents of the drive data and / or the corresponding voices is exhibited. The target was given the feeling of interacting with the interactive target, and the interactive experience with the target was improved.

以下、本明細書の１つまたは複数の実施例または先行技術での技術的解決策をより明確に説明するために、実施例または先行技術の説明に使用する必要のある図面を簡単に紹介する。明らかに、以下に説明する図面は、本明細書の１つまたは複数の実施例に記載のいくつかの実施例に過ぎず、当業者は創造的な作業なしにこれら図面に基づいて他の図面を得ることができる。
本発明の少なくとも１つの実施例によって提供されるインタラクティブ対象の駆動方法中の表示デバイスの模式図である。本発明の少なくとも１つの実施例によって提供されるインタラクティブ対象の駆動方法のフローチャートである。本発明の少なくとも１つの実施例によって提供される音素シーケンスに対して特徴エンコーディングを実行する過程の模式図である。本発明の少なくとも１つの実施例によって提供される音素シーケンスに基づいて制御パラメータ値を得る過程の模式図である。本発明の少なくとも１つの実施例によって提供される音声フレームシーケンスに基づいて制御パラメータ値を得る過程の模式図である。本発明の少なくとも１つの実施例によって提供されるインタラクティブ対象の駆動装置の構成の模式図である。本発明の少なくとも１つの実施例によって提供される電子デバイスの構成の模式図である。 Hereinafter, in order to more clearly explain the technical solution in one or more embodiments or prior art of the present specification, the drawings that need to be used in the description of the embodiment or prior art are briefly introduced. .. Obviously, the drawings described below are only a few examples described in one or more embodiments herein, and those skilled in the art will be able to base on these drawings without any creative work. Can be obtained.
It is a schematic diagram of the display device in the driving method of an interactive object provided by at least one embodiment of the present invention. It is a flowchart of the driving method of an interactive object provided by at least one Embodiment of this invention. It is a schematic diagram of the process of performing feature encoding on the phoneme sequence provided by at least one embodiment of the present invention. It is a schematic diagram of the process of obtaining the control parameter value based on the phoneme sequence provided by at least one embodiment of the present invention. It is a schematic diagram of the process of obtaining the control parameter value based on the voice frame sequence provided by at least one embodiment of the present invention. It is a schematic diagram of the configuration of the drive device of an interactive object provided by at least one embodiment of the present invention. It is a schematic diagram of the configuration of the electronic device provided by at least one embodiment of the present invention.

以下、例示的な実施例を詳細に説明し、その例を図面に示す。以下の説明が図面を言及している場合、特に明記しない限り、異なる図面における同一の数字は、同一または類似な要素を示す。以下の例示的な実施例で叙述される実施形態は、本発明と一致するすべての実施形態を代表しない。逆に、それらは、添付された特許請求の範囲に記載された、本発明のいくつかの態様と一致する装置及び方法の例に過ぎない。 Hereinafter, exemplary embodiments will be described in detail and examples will be shown in the drawings. Where the following description refers to drawings, the same numbers in different drawings indicate the same or similar elements, unless otherwise stated. The embodiments described in the following exemplary examples do not represent all embodiments consistent with the present invention. Conversely, they are merely examples of devices and methods consistent with some aspects of the invention described in the appended claims.

本明細書における「および／または」という用語は、ただ関連対象の関連関係を説明するものであり、３つの関係が存在できることを示し、たとえば、Ａおよび／またはＢは、Ａが単独に存在すること、ＡとＢが同時に存在すること、および、Ｂが単独に存在することのような３つの関係が存在する。また、本明細書における「少なくとも１種」という用語は、複数種類の中の任意の１種または複数種類の中の少なくとも２種の任意の組み合わせを示し、たとえば、Ａ、Ｂ、Ｃの中の少なくとも１種を含むことは、Ａ、Ｂ、および、Ｃから構成されたセットから選択した任意の１つまたは複数の要素を含むことを示す。 The term "and / or" as used herein merely describes the relationship of a related object and indicates that three relationships can exist, for example, A and / or B, where A is present alone. There are three relationships, such as the existence of A and B at the same time, and the existence of B alone. Further, the term "at least one kind" in the present specification refers to any one kind in a plurality of kinds or any combination of at least two kinds in a plurality of kinds, for example, in A, B, C. Inclusion of at least one indicates that it comprises any one or more elements selected from a set composed of A, B, and C.

本発明の少なくとも１つの実施例は、インタラクティブ対象の駆動方法を提供し、前記駆動方法は、端末デバイスまたはサーバなどの電子デバイスによって実行され得る。前記端末デバイスは、携帯電話、タブレットパソコン、ゲーム機、デスクトップパソコン、広告機、オールインワン機、車載端末などの、固定端末または移動端末であり得る。前記サーバは、ローカルサーバまたはクラウドサーバなどを含む。前記方法は、プロセッサによりメモリに記憶されているコンピュータ可読命令を呼び出す方法によって実現されることができる。 At least one embodiment of the present invention provides a driving method for an interactive object, which driving method can be performed by an electronic device such as a terminal device or a server. The terminal device may be a fixed terminal or a mobile terminal such as a mobile phone, a tablet personal computer, a game machine, a desktop personal computer, an advertising machine, an all-in-one machine, and an in-vehicle terminal. The server includes a local server, a cloud server, and the like. The method can be realized by a method of calling a computer-readable instruction stored in a memory by a processor.

本発明の実施例において、インタラクティブ対象は、目標対象とインタラクティブを実行できる任意の仮想イメージであり得る。１実施例において、インタラクティブ対象は、仮想キャラクターであり得、さらに、仮想動物、仮想物品、漫画イメージなどの、インタラクティブ機能を実現できる他の仮想イメージであり得る。インタラクティブ対象の表示形式は、２Ｄまたは３Ｄであるが、本発明はこれに対して限定しない。前記目標対象は、ユーザ、ロボット、またはその他のスマートデバイスであり得る。前記インタラクティブ対象の前記目標対象とのインタラクティブ方法は、能動的インタラクティブ方法または受動的インタラクティブ方法であり得る。１例において、目標対象により、ジェスチャまたは肢体動作を行うことによって要求を発して、能動的インタラクティブ方法によってインタラクティブ対象をトリガしてインタラクティブを行うことができる。もう１例において、インタラクティブ対象により、能動的に挨拶して、目標対象が動作などを行うようにプロンプトする方法によって、目標対象が受動的方法によってインタラクティブ対象とインタラクティブを行うようにすることができる。 In an embodiment of the invention, the interactive object can be any virtual image capable of performing interactive with the target object. In one embodiment, the interactive object may be a virtual character, and may be another virtual image capable of realizing an interactive function, such as a virtual animal, a virtual article, or a cartoon image. The display format of the interactive object is 2D or 3D, but the present invention is not limited thereto. The target can be a user, a robot, or other smart device. The interactive method of the interactive object with the target object may be an active interactive method or a passive interactive method. In one example, depending on the target object, a request can be made by performing a gesture or a limb movement, and the interactive object can be triggered and interactively performed by an active interactive method. In another example, the interactive object can be made to interact with the interactive object by a passive method by actively greeting and prompting the target object to perform an action or the like.

前記インタラクティブ対象は、端末デバイスを利用して展示することができ、前記端末デバイスは、テレビ、表示機能を有するオールインワン器、プロジェクター、仮想現実（ＶｉｒｔｕａｌＲｅａｌｉｔｙ、ＶＲ）デバイス、拡張現実（ＡｕｇｍｅｎｔｅｄＲｅａｌｉｔｙ、ＡＲ）デバイスなどであり得、本発明は端末デバイスの具体的な形態に対して限定しない。 The interactive object can be exhibited using a terminal device, and the terminal device includes a television, an all-in-one device having a display function, a projector, a virtual reality (VR) device, and an augmented reality (AR). ) The present invention may be a device or the like, and the present invention is not limited to a specific form of the terminal device.

図１は、本発明の少なくとも１つの実施例によって提供される表示デバイスを示す。図１に示すように、当該表示デバイスは、透明表示スクリーンを有し、透明表示スクリーンに立体画像を表示することによって、立体効果を有する仮想シーンおよびインタラクティブ対象を現わすことができる。たとえば、図１の透明表示スクリーンに表示されたインタラクティブ対象は、仮想漫画人物を含む。いくつかの実施例において、本発明に記載の端末デバイスは、上記の透明表示スクリーンを有する表示デバイスであってもよく、表示デバイスに、メモリとプロセッサと配置されており、メモリは、プロセッサ上で運行可能なコンピュータ命令を記憶し、前記プロセッサは、前記コンピュータ命令が実行されるときに、本発明によって提供されるインタラクティブ対象の駆動方法を実現することによって、透明表示スクリーンに表示されたインタラクティブ対象を駆動して目標対象と交流または応答を行うようにすることができる。 FIG. 1 shows a display device provided by at least one embodiment of the present invention. As shown in FIG. 1, the display device has a transparent display screen, and by displaying a stereoscopic image on the transparent display screen, a virtual scene having a stereoscopic effect and an interactive object can be represented. For example, the interactive object displayed on the transparent screen of FIG. 1 includes a virtual cartoon person. In some embodiments, the terminal device described in the present invention may be a display device having the transparent display screen described above, wherein the display device is arranged with a memory and a processor, and the memory is stored on the processor. The processor stores an operable computer instruction, and when the computer instruction is executed, the interactive object displayed on the transparent display screen is realized by realizing the driving method of the interactive object provided by the present invention. It can be driven to interact or respond to the target.

いくつかの実施例において、インタラクティブ対象が音声を出力するように駆動するための音声駆動データに応答して、インタラクティブ対象は、目標対象に対して指定された音声を発することができる。端末デバイスは、端末デバイスの周辺の目標対象の動作、表情、身分、好みなどに基づいて、音声駆動データを生成することによって、インタラクティブ対象が指定された音声を発して交流または応答を行うように駆動することで、目標対象に対して擬人化サービスを提供することができる。音声駆動データは、その他の方法によって生成されてもよく、たとえば、サーバによって生成して端末デバイスに送信してもよいことを説明する必要がある。 In some embodiments, the interactive object can emit a designated voice to the target object in response to voice-driven data for driving the interactive object to output voice. The terminal device generates voice-driven data based on the movement, facial expression, status, preference, etc. of the target object around the terminal device so that the interactive target emits a specified voice to interact or respond. By driving, it is possible to provide anthropomorphic services to the target target. It should be explained that the voice-driven data may be generated by other methods, for example, by a server and transmitted to a terminal device.

インタラクティブ対象が目標対象とインタラクティブを行う過程において、当該音声駆動データに基づいてインタラクティブ対象が指定された音声を発するように駆動するときに、前記インタラクティブ対象が当該指定された音声と同期化された顔部の動作を行うように駆動することができなく、インタラクティブ対象が音声を発するときに鈍く不自然になり、目標対象のインタラクティブ対象とのインタラクティブ体験に影響を与える可能性がある。これに鑑みて、本発明の少なくとも１つの実施例は、インタラクティブ対象駆動方法を提出して、目標対象のインタラクティブ対象とのインタラクティブの体験を向上させる。 In the process of the interactive object interacting with the target object, when the interactive object is driven to emit the specified voice based on the voice-driven data, the interactive target has a face synchronized with the specified voice. It cannot be driven to perform the movement of the part, and it becomes dull and unnatural when the interactive object emits a voice, which may affect the interactive experience with the interactive object of the target object. In view of this, at least one embodiment of the present invention submits an interactive object driving method to enhance the interactive experience with the interactive object of the target object.

図２は、本発明の少なくとも１つの実施例に係るインタラクティブ対象の駆動方法のフローチャートであり、前記インタラクティブ対象は、表示デバイスに展示されており、図２に示すように、前記方法は、ステップ２０１～ステップ２０３を含む。 FIG. 2 is a flowchart of a method of driving an interactive object according to at least one embodiment of the present invention, wherein the interactive object is exhibited in a display device, and as shown in FIG. 2, the method is step 201. -Including step 203.

ステップ２０１において、前記インタラクティブ対象の駆動データを取得し、前記駆動データの駆動モードを確定する。 In step 201, the driving data of the interactive target is acquired, and the driving mode of the driving data is determined.

本発明の実施例において、前記音声駆動データは、オーディオデータ（音声データ）、テキストなどを含み得る。前記音声駆動データは、サーバまたは端末デバイスによりインタラクティブ対象とインタラクティブを行う目標対象の動作、表情、身分、好みなどに基づいて生成した駆動データであってもよいし、端末デバイスにより直接取得した内部メモリから呼び出した音声駆動データであってもよい。本発明は、当該音声駆動データの取得方法に対して限定しない。 In the embodiment of the present invention, the voice-driven data may include audio data (voice data), text, and the like. The voice-driven data may be drive data generated based on the movement, facial expression, status, preference, etc. of the target object that interacts with the interactive target by the server or the terminal device, or may be the internal memory directly acquired by the terminal device. It may be voice-driven data called from. The present invention is not limited to the method of acquiring the voice-driven data.

前記駆動データのタイプおよび前記駆動データ中に含まれている情報に基づいて、前記駆動データの駆動モードを確定することができる。 The drive mode of the drive data can be determined based on the type of the drive data and the information contained in the drive data.

１例において、前記駆動データのタイプに基づいて前記駆動データに対応する音声データシーケンスを取得することができ、ここで、前記音声データシーケンスは複数の音声データ単位を含む。ここで、前記音声データ単位は、文字または単語を単位として構成されてもよいし、音素または音節を単位として構成されてもよい。テキストタイプの駆動データに対応して、前記駆動データに対応する文字シーケンス、単語シーケンスなどを得ることができ、オーディオタイプの駆動データに対応して、前記駆動データに対応する音素シーケンス、音節シーケンス、音声フレームシーケンスなどを得ることができる。１実施例において、オーディオデータとテキストデータとは、互いに変換されることができる。たとえば、オーディオデータをテキストデータに変換してから音声データ単位の分割を実行し、または、テキストデータをオーディオデータに変換してから音声データ単位の分割を実行することができ、本発明はこれに対して限定しない。 In one example, a voice data sequence corresponding to the drive data can be obtained based on the type of drive data, where the voice data sequence includes a plurality of voice data units. Here, the voice data unit may be configured in units of characters or words, or may be configured in units of phonemes or syllables. Character sequences, word sequences, etc. corresponding to the drive data can be obtained corresponding to the text type drive data, and phoneme sequences, syllable sequences, etc. corresponding to the drive data can be obtained corresponding to the audio type drive data. You can get audio frame sequences and the like. In one embodiment, the audio data and the text data can be converted to each other. For example, audio data can be converted to text data and then divided into voice data units, or text data can be converted to audio data and then divided into voice data units. On the other hand, it is not limited.

前記音声データ単位に含まれている目標データが検出された場合、前記駆動データの駆動モードを第１駆動モードとして確定することができ、ここで、前記目標データは、インタラクティブ対象の所定の制御パラメータ値に対応する。 When the target data included in the voice data unit is detected, the drive mode of the drive data can be determined as the first drive mode, where the target data is a predetermined control parameter of the interactive target. Corresponds to the value.

前記目標データは、設定されたキー単語またはキー文字などであり得、前記キー単語または前記キー文字は、インタラクティブ対象の所定の動作の所定の制御パラメータ値に対応する。 The target data may be a set key word or key character, and the key word or key character corresponds to a predetermined control parameter value of a predetermined operation of an interactive target.

本発明の実施例において、事前に各々の目標データに、所定の動作をマッチングさせる。各所定の動作は、該当する制御パラメータ値によって制御して実現される。したがって、各目標データと所定の動作の制御パラメータ値とは、マッチングされる。キー単語が「手振り」である例をとると、前記音声データ単位がテキストの形の「手振り」および／または音声の形の「手振り」を含む場合、前記駆動データが目標データを含むと確定することができる。 In the embodiment of the present invention, a predetermined operation is matched with each target data in advance. Each predetermined operation is realized by being controlled by the corresponding control parameter value. Therefore, each target data and the control parameter value of the predetermined operation are matched. Taking the example where the key word is "hand gesture", if the voice data unit includes a text-shaped "hand gesture" and / or a voice-shaped "hand gesture", it is determined that the driving data includes the target data. be able to.

例示的に、前記目標データは、音節を含み、前記音節は、前記インタラクティブ対象の所定の口形状動作の所定の制御パラメータ値に対応する。 Illustratively, the target data includes a syllable, which corresponds to a predetermined control parameter value of a predetermined mouth shape motion of the interactive object.

前記目標データに対応する音節は、事前に分割された互いに異なる音節タイプに属し、また、前記互いに異なる音節タイプは、互いに異なる所定の口形状にマッチングされる。ここで、音節は、少なくとも１つの音素を組み合わせて形成して音声単位を含む。前記音節は、ピンイン言語の音節および非ピンイン言語（たとえば中国語である）の音節を含み得る。互いに異なる音節タイプは、発音動作と一致するか基本的に一致する音節であり、互いに異なる音節タイプは、インタラクティブ対象の互いに異なる動作に対応する。１実施例において、互いに異なる音節タイプは、インタラクティブ対象が話すときの互いに異なる所定の口形状に対応し、すなわち、互いに異なる発音動作に対応する。この場合、互いに異なる音節タイプは、それぞれ互いに異なる所定の口形状の制御パラメータ値にマッチングする。たとえば、ピンインである「ｍａ」、「ｍａｎ」、「ｍａｎｇ」などのタイプの音節は、その発音動作が基本的に一致するため、同一のタイプに見なすことができ、いずれも、インタラクティブ対象が話すときの「口が開いている」の口形状の制御パラメータ値に対応する。 The syllables corresponding to the target data belong to the pre-divided and different syllable types, and the different syllable types are matched to different predetermined mouth shapes. Here, the syllable is formed by combining at least one phoneme and includes a voice unit. The syllables may include pinyin language syllables and non-pinyin language (eg, Chinese) syllables. Different syllable types are syllables that match or essentially match the pronunciation action, and different syllable types correspond to different actions of the interactive object. In one embodiment, the different syllable types correspond to different predetermined mouth shapes as the interactive object speaks, i.e., to correspond to different pronunciation movements. In this case, the different syllable types match the different predetermined mouth shape control parameter values. For example, syllables of the type "ma", "man", "mang", etc., which are pin-ins, can be regarded as the same type because their pronunciation behaviors are basically the same, and all of them are spoken by an interactive object. Corresponds to the control parameter value of the mouth shape of "the mouth is open" at the time.

前記音声データ単位に含まれている目標データが検出されなかった場合、前記駆動データの駆動モードを第２駆動モードとして確定することができ、ここで、前記目標データはインタラクティブ対象の所定の制御パラメータ値に対応する。 When the target data included in the voice data unit is not detected, the drive mode of the drive data can be determined as the second drive mode, where the target data is a predetermined control parameter of the interactive target. Corresponds to the value.

当業者は、上述した第１駆動モードと第２駆動モードは例に過ぎず、本発明の実施例は具体的な駆動モードを限定しないことを理解すべきである。 Those skilled in the art should understand that the first drive mode and the second drive mode described above are merely examples, and the embodiments of the present invention do not limit specific drive modes.

ステップ２０２において、前記駆動モードに応答して、前記駆動データに基づいて前記インタラクティブ対象の制御パラメータ値を取得する。 In step 202, in response to the drive mode, the control parameter value of the interactive object is acquired based on the drive data.

駆動データのさまざまな駆動モードに対して、該当する方式を採用して前記インタラクティブ対象の制御パラメータ値を取得することができる。 It is possible to acquire the control parameter value of the interactive target by adopting the corresponding method for various drive modes of the drive data.

１例において、ステップ２０１で確定した第１駆動モードに応答して、前記目標データに対応する前記所定の制御パラメータ値を前記インタラクティブ対象の制御パラメータ値として使用することができる。たとえば、第１駆動モードに対して、前記音声データシーケンスに含まれている目標データ（たとえば「手振り」である）に対応する所定の制御パラメータ値を、前記インタラクティブ対象の制御パラメータ値として使用することができる。 In one example, in response to the first drive mode determined in step 201, the predetermined control parameter value corresponding to the target data can be used as the control parameter value of the interactive object. For example, for the first drive mode, a predetermined control parameter value corresponding to the target data (for example, “hand gesture”) included in the voice data sequence is used as the control parameter value of the interactive target. Can be done.

１例において、ステップ２０１で確定した第２駆動モードに応答して、前記音声データシーケンス中の少なくとも１つの音声データ単位の特徴情報を取得し、前記特徴情報に対応する前記インタラクティブ対象の制御パラメータ値を取得することができる。つまり、音声データシーケンスに含まれている目標データが検出されないと、前記音声データ単位の特徴情報に基づいて対応する制御パラメータ値を取得することができる。前記特徴情報は、前記音声データシーケンスに対して特徴エンコーディングを実行して得られた音声データ単位の特徴情報、前記音声データシーケンスの音響特徴情報に基づいて得られた音声データ単位の特徴情報などを含み得る。 In one example, in response to the second drive mode determined in step 201, the feature information of at least one voice data unit in the voice data sequence is acquired, and the control parameter value of the interactive object corresponding to the feature information is acquired. Can be obtained. That is, if the target data included in the voice data sequence is not detected, the corresponding control parameter value can be acquired based on the feature information of the voice data unit. The feature information includes feature information of a voice data unit obtained by executing feature encoding on the voice data sequence, feature information of a voice data unit obtained based on the acoustic feature information of the voice data sequence, and the like. Can include.

ステップ２０３において、前記制御パラメータ値に基づいて前記インタラクティブ対象の姿態を制御する。 In step 203, the appearance of the interactive object is controlled based on the control parameter value.

いくつかの実施例において、前記インタラクティブ対象の制御パラメータは、顔部姿態パラメータを含み、前記顔部姿態パラメータは、顔部筋肉制御係数を含み、当該顔部筋肉制御係数は、少なくとも１つの顔部筋肉の運動状態を制御するために使用される。１実施例において、前記駆動データに基づいて前記インタラクティブ対象の顔部筋肉制御係数を取得し、取得した顔部筋肉制御係数に基づいて前記インタラクティブ対象が前記駆動データにマッチングする顔部動作を行うように前記インタラクティブ対象を駆動することができる。 In some embodiments, the interactive subject control parameter comprises a facial appearance parameter, the facial appearance parameter comprises a facial muscle control factor, and the facial muscle control coefficient comprises at least one face. Used to control the state of muscle movement. In one embodiment, the facial muscle control coefficient of the interactive target is acquired based on the driving data, and the interactive target performs a facial motion matching the driving data based on the acquired facial muscle control coefficient. The interactive object can be driven.

いくつかの実施例において、前記インタラクティブ対象の制御パラメータ値は、前記インタラクティブ対象の少なくとも１つの部分的領域の制御ベクトルを含む。１実施例において、前記駆動データに基づいて前記インタラクティブ対象の少なくとも１つの部分的領域の制御ベクトルを取得することができ、取得した前記少なくとも１つの部分的領域の制御ベクトルに基づいて前記インタラクティブ対象の顔部動作および／または肢体動作を制御するができる。 In some embodiments, the control parameter value of the interactive object comprises a control vector of at least one partial region of the interactive object. In one embodiment, the control vector of at least one partial region of the interactive object can be acquired based on the driving data, and the interactive object can be acquired based on the acquired control vector of at least one partial region of the interactive object. Can control facial and / or limb movements.

前記インタラクティブ対象の駆動データの駆動モードに基づいて、前記インタラクティブ対象の制御パラメータ値を取得することによって、前記インタラクティブ対象の姿態を制御する。ここで、互いに異なる駆動モードに対して、互いに異なる方式によって該当するインタラクティブ対象の制御パラメータ値を取得することによって、インタラクティブ対象が前記駆動データの内容および／または対応する音声にマッチングする姿態を展示するようにして、目標対象にインタラクティブ対象と交流しているような感覚を与え、目標対象のインタラクティブ対象とのインタラクティブ体験を改善した。 The appearance of the interactive object is controlled by acquiring the control parameter value of the interactive object based on the drive mode of the drive data of the interactive object. Here, we will exhibit a mode in which the interactive target matches the content of the drive data and / or the corresponding voice by acquiring the control parameter values of the corresponding interactive target by different methods for different drive modes. In this way, the target was given the feeling of interacting with the interactive target, improving the interactive experience with the target's interactive target.

いくつかの実施例において、さらに、前記駆動データに基づいて、前記表示デバイス出力音声を制御し、および／または、テキストを展示することができる。また、音声の出力および／またはテキストの展示の同時に、前記制御パラメータ値に基づいて前記インタラクティブ対象の姿態を制御することができる。 In some embodiments, the display device output audio can be further controlled and / or the text can be exhibited based on the drive data. Further, at the same time as the output of the voice and / or the display of the text, the appearance of the interactive object can be controlled based on the control parameter value.

本発明の実施例において、制御パラメータ値と前記駆動データとがマッチングされるため、前記駆動データに基づく音声の出力および／またはテキストの展示が、前記制御パラメータ値に基づく制御インタラクティブ対象の姿態とが、同期化される場合、インタラクティブ対象が行った姿態と出力した音声および／または展示したテキストも同期化されるため、目標対象に前記インタラクティブ対象と交流しているような感覚を与える。 In the embodiment of the present invention, since the control parameter value and the drive data are matched, the output of the voice and / or the display of the text based on the drive data can be seen as the appearance of the control interactive object based on the control parameter value. When synchronized, the appearance performed by the interactive object and the output voice and / or the exhibited text are also synchronized, so that the target object is given the feeling of interacting with the interactive object.

いくつかの実施例において、前記音声データシーケンスは、音素シーケンスを含む。前記駆動データがオーディオデータを含むことに応答して、オーディオデータを複数のオーディオフレームに分割し、オーディオフレームの状態に基づいてオーディオフレームを組み合わせて音素を形成することができる。前記オーディオデータに基づいて形成した各音素は、音素シーケンスを形成する。ここで、音素は、音声の自然的な属性に基づいて分割した最小の音声単位であり、実在の人物の１つの発音動作が１つの音素を形成することができる。前記駆動データがテキストであることに応答して、前記テキストに含まれている形態素に基づいて、前記形態素に対応する音素を得ることによって、該当する音素シーケンスを得ることができる。 In some embodiments, the audio data sequence comprises a phoneme sequence. In response to the drive data including the audio data, the audio data can be divided into a plurality of audio frames, and the audio frames can be combined to form a phoneme based on the state of the audio frames. Each phoneme formed based on the audio data forms a phoneme sequence. Here, a phoneme is the smallest phoneme unit divided based on the natural attributes of speech, and one pronunciation action of a real person can form one phoneme. The corresponding phoneme sequence can be obtained by obtaining the phoneme corresponding to the morpheme based on the morpheme contained in the text in response to the driving data being a text.

いくつかの実施例において、以下の方法によって、前記音声データシーケンス中の少なくとも１つの音声データ単位の特徴情報を取得することができ、前記方法は、前記音素シーケンスに対して特徴エンコーディングを実行して、前記音素シーケンス対応する第１コードシーケンスを得ることと、前記第１コードシーケンスに基づいて少なくとも１つの音素に対応する特徴コードを取得することと、前記特徴コードに基づいて前記少なくとも１つの音素の特徴情報を得ることと、を含む。 In some embodiments, the following method can be used to obtain feature information for at least one voice data unit in the voice data sequence, wherein the method performs feature encoding on the phoneme sequence. To obtain the first chord sequence corresponding to the phoneme sequence, to acquire the feature code corresponding to at least one phoneme based on the first chord sequence, and to obtain the feature code corresponding to the at least one phoneme based on the feature code. Includes getting feature information.

図３は、音素シーケンスに対して特徴エンコーディングを実行する過程を示す模式図である。図３に示すように、音素シーケンス３１０は、音素ｊ、ｉ１、ｊ、ｉｅ４（簡素化のために、一部の音素のみを示す）を含み、各々の音素ｊ、ｉ１、ｉｅ４に対してそれぞれ対応するコードシーケンス３２１、３２２、３２３を得る。各々のコードシーケンスにおいて、前記音素が対応されている時点に対応するコード値を第１数値として設定し（たとえば１に設定する）、前記音素が対応されていない時点に対応するコード値を第２数値として設定する（たとえば０に設定する）。コードシーケンス３２１の例をとると、音素シーケンス３１０において音素ｊがいる時点で、コードシーケンス３２１の値が第１数値である１であり、音素ｊがない時点で、コードシーケンス３２１の値が第２数値である０に設定される。すべてのコードシーケンス３２１、３２２、３２３によって完全なコードシーケンス３２０が構成される。 FIG. 3 is a schematic diagram showing a process of performing feature encoding on a phoneme sequence. As shown in FIG. 3, the phoneme sequence 310 includes phonemes j, i1, j, ie4 (for simplification, only some phonemes are shown) for each phoneme j, i1, ie4, respectively. The corresponding chord sequences 321, 322, and 323 are obtained. In each chord sequence, the chord value corresponding to the time point corresponding to the phoneme is set as the first numerical value (for example, set to 1), and the chord value corresponding to the time point corresponding to the phoneme is second. Set as a number (for example, set to 0). Taking the example of the chord sequence 321, when the phoneme j is present in the phoneme sequence 310, the value of the chord sequence 321 is 1, which is the first numerical value, and when there is no phoneme j, the value of the chord sequence 321 is the second. It is set to 0, which is a numerical value. All chord sequences 321, 322, 323 make up the complete chord sequence 320.

音素ｊ、ｉ１、ｉｅ４にそれぞれ対応するコードシーケンス３２１、３２２、３２３のコード値、および、当該３つのコードシーケンス中に対応する音素の時間長さに基づいて、つまり、コードシーケンス３２１におけるｊの時間長さ、コードシーケンス３２２におけるｉ１の時間長さ、および、コードシーケンス３２３におけるｉｅ４の時間長さに基づいて、コードシーケンス３２１、３２２、３２３の特徴情報を得ることができる。 Based on the chord values of the chord sequences 321, 322, and 323 corresponding to the chord sequences j, i1, and ie4, and the time lengths of the chord sequences corresponding to the three chord sequences, that is, the time of j in the chord sequence 321. The feature information of the code sequence 321, 322, and 323 can be obtained based on the length, the time length of i1 in the code sequence 322, and the time length of ie4 in the code sequence 323.

たとえば、ガウスフィルターを利用してそれぞれ前記コードシーケンス３２１、３２２、３２３中の音素ｊ、ｉ１、ｉｅ４の時間における連続値に対してガウス畳み込み操作を実行して、前記コードシーケンスの特徴情報を得ることができる。つまり、ガウスフィルターを利用して音素の時間における連続値に対してガウス畳み込み操作を実行することによって、各々のコードシーケンス中のコード値が第２数値から第１数値または第１数値から第２数値の変化の段階がスムーズになるようにする。各々のコードシーケンス３２１、３２２、３２３に対してそれぞれガウス畳み込み操作を実行することによって、各々のコードシーケンスの特徴値を得る。ここで、特徴値は特徴情報中のパラメータを構成し、各々のコードシーケンスの特徴情報のセットに基づいて、当該音素シーケンス３１０に対応する特徴情報３３０を得る。当業者は、各々のコードシーケンスに対して他の操作を実行して前記コードシーケンスの特徴情報を得ることができ、本発明はこれに対して限定しないことを理解すべきである。 For example, using a Gaussian filter, a Gaussian convolution operation is executed for continuous values of phonemes j, i1, and ie4 in the chord sequences 321, 322, and 323, respectively, to obtain feature information of the chord sequence. Can be done. That is, by using a Gaussian filter to execute a Gaussian convolution operation on continuous values in time of phonemes, the chord values in each chord sequence are changed from the second numerical value to the first numerical value or from the first numerical value to the second numerical value. Make the stage of change smooth. By executing the Gaussian convolution operation for each code sequence 321, 322, and 323, the feature value of each code sequence is obtained. Here, the feature value constitutes a parameter in the feature information, and the feature information 330 corresponding to the phoneme sequence 310 is obtained based on the set of feature information of each chord sequence. Those skilled in the art should understand that other operations can be performed on each code sequence to obtain characteristic information of the code sequence, and the present invention is not limited thereto.

本発明の実施例において、音素シーケンス中各々の音素の時間長さに基づいて前記コードシーケンスの特徴情報を得ることによって、コードシーケンスの変化の段階がスムーズになるようにする。たとえば、コードシーケンスの値は、０と１に加えて、中間状態の値であってもよく、たとえば０．２、０．３などであり得る。これら中間状態の値に基づいて取得した姿態パラメータ値は、インタラクティブ人物の姿態の変化がよりスムーズで自然になるようにし、特に、インタラクティブ人物の表情の変化をよりスムーズで自然になるようにして、目標対象のインタラクティブ体験を改善した。 In the embodiment of the present invention, by obtaining the feature information of the chord sequence based on the time length of each phoneme in the phoneme sequence, the stage of change of the chord sequence is made smooth. For example, the value of the code sequence may be an intermediate state value in addition to 0 and 1, and may be, for example, 0.2, 0.3, and the like. The figure parameter values obtained based on these intermediate state values allow the change in the figure of the interactive person to be smoother and more natural, and in particular, the change in the facial expression of the interactive person to be smoother and more natural. Improved the targeted interactive experience.

いくつかの実施例において、前記顔部姿態パラメータは、顔部筋肉制御係数を含み得る。 In some embodiments, the facial appearance parameters may include facial muscle control factors.

人間の顔の運動は、解剖学の観点から、さまざまな顔の筋肉の協調的な変形の結果である。したがって、インタラクティブ対象の顔筋肉を分割して顔筋肉モデルを得、分割して得られた各筋肉（領域）に対して対応する顔筋肉制御係数に基づいてその運動を制御し、つまり、各筋肉に対して収縮／拡張制御を実行して、インタラクティブ人物の顔がさまざまな表情を行うようにすることができる。前記顔筋肉モデルの各々の筋肉に対して、筋肉が位置している顔位置および筋肉自身の運動特徴に基づいて、異なる筋肉制御係数に対応する運動状態を設定することができる。たとえば、上唇の筋肉の場合、その制御係数の数値の範囲は０～１であり、当該範囲内の異なる数値は上唇の筋肉の異なる収縮／拡張状態に対応され、当該数値を変更することによって、口部の縦方向の開閉を実現することができる。口の筋肉の左隅の場合、その制御係数の数値の範囲は０～１であり、当該範囲内の異なる数値は口の筋肉の左隅の収縮／拡張状態に対応され、当該数値を変更することによって、口部の横方向の変化を実現することができる。 Human facial movements, from an anatomical point of view, are the result of coordinated deformations of various facial muscles. Therefore, the facial muscles of the interactive target are divided to obtain a facial muscle model, and the movement of each muscle (region) obtained by dividing is controlled based on the corresponding facial muscle control coefficient, that is, each muscle. It is possible to execute contraction / expansion control on the face so that the face of the interactive person has various facial expressions. For each muscle of the facial muscle model, it is possible to set an exercise state corresponding to a different muscle control coefficient based on the facial position where the muscle is located and the motor characteristics of the muscle itself. For example, in the case of the upper lip muscle, the range of the numerical value of the control coefficient is 0 to 1, and different numerical values in the range correspond to different contraction / expansion states of the upper lip muscle, and by changing the numerical value, the numerical value is changed. It is possible to open and close the mouth in the vertical direction. In the case of the left corner of the mouth muscle, the range of the numerical value of the control coefficient is 0 to 1, and different numerical values in the range correspond to the contraction / expansion state of the left corner of the mouth muscle, and by changing the numerical value. , The lateral change of the mouth can be realized.

音素シーケンスに基づいて音声を出力する同時に、前記音素シーケンスに対応する顔筋肉制御係数に基づいて前記インタラクティブ対象が顔表情を行うように駆動して、表示デバイスが音声を出力するときに、インタラクティブ対象が同時に当該音声を発する表情を行うようにすることによって、目標対象に当該インタラクティブ対象が話している感覚を与え、目標対象のインタラクティブ体験を改善した。 At the same time as outputting voice based on the phonetic sequence, the interactive target is driven to make a facial expression based on the facial muscle control coefficient corresponding to the phonetic sequence, and when the display device outputs voice, the interactive target is output. At the same time, by making the facial expression that emits the voice, the target is given the feeling that the interactive target is speaking, and the interactive experience of the target is improved.

いくつかの実施例において、前記インタラクティブ対象の顔動作と体姿態とを関連付けることができる。つまり、当該顔動作に対応する顔姿態パラメータ値と前記体姿態とを関連付けることができ、前記体姿態は、肢体動作、ジェスチャ動作、歩き姿態などを含み得る。 In some embodiments, the facial movement of the interactive object can be associated with body shape. That is, the facial figure parameter value corresponding to the facial motion can be associated with the body figure, and the body figure may include a limb motion, a gesture motion, a walking figure, and the like.

インタラクティブ対象を駆動する過程で、前記顔姿態パラメータ値に関連する体姿態の駆動データを取得し、前記音素シーケンスに基づいて音声を出力する同時に、前記顔姿態パラメータ値に関連する体姿態の駆動データに基づいて、前記インタラクティブ対象が肢体動作を行うように前記インタラクティブ対象を駆動する。つまり、前記インタラクティブ対象の音声駆動データに基づいて前記インタラクティブ対象が顔動作を行うように駆動する同時に、さらに、当該顔動作に対応する顔姿態パラメータ値に基づいて関連付けられる体姿態の駆動データを取得して、音声を出力するときに、インタラクティブ対象が同期に該当する顔動作および肢体動作を行うように駆動することによって、インタラクティブ対象の発話状態がより鮮やかで自然であるになるようにし、目標対象のインタラクティブ体験を改善した。 In the process of driving the interactive object, the drive data of the body shape related to the face shape parameter value is acquired, the voice is output based on the phonetic sequence, and at the same time, the drive data of the body shape related to the face shape parameter value is output. Based on, the interactive object is driven so that the interactive object performs a limb movement. That is, at the same time that the interactive target is driven to perform a face movement based on the voice driving data of the interactive target, at the same time, the driving data of the body shape associated with the interactive target is acquired based on the face shape parameter value corresponding to the face movement. Then, when outputting the voice, the interactive target is driven to perform the facial movement and the limb movement corresponding to the synchronization, so that the spoken state of the interactive target becomes more vivid and natural, and the target target is Improved the interactive experience.

音声の出力は連続性を維持する必要があるため、１実施例において、音素シーケンス上で時間ウィンドウを移動させ、毎回の移動過程で時間ウィンドウ内の音素を出力する。ここで、所定の時間長さを毎回の移動時間ウィンドウのステップサイズとして設定する。たとえば、時間ウィンドウの長さを１秒に設定し、所定の時間長さを０．１秒に設定することができる。時間ウィンドウ内の音素を出力する同時に、時間ウィンドウの所定の位置の音素または音素の特徴情報に対応する姿態パラメータ値を取得し、前記姿態パラメータ値を利用して前記インタラクティブ対象の姿態を制御する。当該所定の位置は、時間ウィンドウの開始位置からの所定の時間長さの位置であり、たとえば、時間ウィンドウの長さを１ｓに設定する場合、当該所定の位置は、時間ウィンドウの開始位置からの０．５ｓの位置であり得る。時間ウィンドウを移動するたびに、時間ウィンドウ内の音素を出力する同時に、時間ウィンドウの所定の位置に対応する姿態パラメータ値でインタラクティブ対象の姿態を制御することによって、インタラクティブ対象の姿態と出力される音声とが同期化されるようにし、目標対象に前記インタラクティブ対象と話している感覚を与える。 Since it is necessary to maintain continuity in the output of voice, in one embodiment, the time window is moved on the phoneme sequence, and the phonemes in the time window are output in each movement process. Here, a predetermined time length is set as the step size of each movement time window. For example, the length of the time window can be set to 1 second and the predetermined time length can be set to 0.1 seconds. At the same time as outputting the phoneme in the time window, the figure parameter value corresponding to the phoneme at a predetermined position in the time window or the characteristic information of the phoneme is acquired, and the figure of the interactive object is controlled by using the figure parameter value. The predetermined position is a position having a predetermined time length from the start position of the time window. For example, when the length of the time window is set to 1s, the predetermined position is from the start position of the time window. It can be in the position of 0.5s. Each time the time window is moved, the phonetic elements in the time window are output, and at the same time, the shape of the interactive object and the output voice are controlled by controlling the shape of the interactive object with the shape parameter value corresponding to the predetermined position of the time window. And are synchronized to give the target object the feeling of talking to the interactive object.

所定の時間長さを変更することによって、姿態パラメータ値を取得する時間間隔（頻度）を変更することができ、したがって、インタラクティブ対象が姿態を行う頻度を変更することができる。実際のインタラクティブのシーンに応じて当該所定の時間長さを設定することができ、インタラクティブ対象の姿態の変化がより自然になるようにすることができる。 By changing the predetermined time length, the time interval (frequency) for acquiring the appearance parameter value can be changed, and therefore, the frequency with which the interactive object performs the appearance can be changed. The predetermined time length can be set according to the actual interactive scene, and the change in the appearance of the interactive object can be made more natural.

いくつかの実施例において、インタラクティブ対象の少なくとも１つの部分的領域の制御ベクトルを得ることによって前記インタラクティブ対象の姿態を制御することができる。 In some embodiments, the appearance of the interactive object can be controlled by obtaining a control vector for at least one partial region of the interactive object.

前記局部領域は、インタラクティブ対象の全体（顔および／または体を含む）を分割して得られたものである。顔の１つまたは複数の局部領域の制御は、インタラクティブ対象の一連の顔表情または動作に対応され得る。たとえば、目部領域の制御は、インタラクティブ対象の開目、閉目、ウィンク、視角変換などの顔動作に対応され得る。また、たとえば、口部領域の制御は、インタラクティブ対象の閉口、異なる程度の開口などの顔動作に対応され得る。体のその中の１つまたは複数の局部領域の制御は、インタラクティブ対象の一連の肢体動作に対応され得る。たとえば、足部領域の制御は、インタラクティブ対象歩行、ジャンプ、蹴りなどの動作に対応され得る。 The local area is obtained by dividing the entire interactive object (including the face and / or the body). Control of one or more local areas of the face may correspond to a series of facial expressions or movements of interactive objects. For example, control of the eye area may correspond to facial movements such as eye opening, closing, winking, and visual angle conversion of an interactive object. Also, for example, control of the mouth region may correspond to facial movements such as closing the mouth of an interactive object, opening to different degrees. Control of one or more local areas within the body may correspond to a series of limb movements of an interactive subject. For example, control of the foot area may correspond to movements such as interactive target walking, jumping, and kicking.

前記インタラクティブ対象の局部領域の制御パラメータは、前記局部領域の姿態制御ベクトルを含む。各々の局部領域の姿態制御ベクトルは、前記インタラクティブ対象の前記局部領域の動作を駆動するために使用される。異なる姿態制御ベクトル値は、異なる動作または動作振幅に対応される。たとえば、口部領域の姿態制御ベクトルの場合、その１組の姿態制御ベクトル値は、前記インタラクティブ対象が口部をわずかに開くようにすることができ、もう１組の姿態制御ベクトル値は、前記インタラクティブ対象が口部を大きく開くようにすることができる。異なる姿態制御ベクトル値で前記インタラクティブ対象を駆動することによって、該当する局部領域が異なる動作または異なる振幅の動作を行うようにすることができる。 The control parameter of the local region of the interactive target includes a shape control vector of the local region. The shape control vector of each local region is used to drive the operation of the local region of the interactive object. Different morphological control vector values correspond to different motions or motion amplitudes. For example, in the case of a shape control vector in the mouth region, one set of shape control vector values can allow the interactive object to slightly open the mouth, and another set of shape control vector values is said. The interactive object can be made to open its mouth wide. By driving the interactive object with different shape control vector values, the corresponding local region can be made to perform different movements or different amplitude movements.

局部領域は、制御する必要があるインタラクティブ対象の動作に基づいて選択することができ、たとえば、前記インタラクティブ対象の顔と肢体が同時に動作を行う制御する必要がある場合には、全ての局部領域の姿態制御ベクトル値を取得することができ、前記インタラクティブ対象の表情を制御する必要がある場合には、前記顔に対応する局部領域の姿態制御ベクトル値を取得することができる。 The local area can be selected based on the movement of the interactive object that needs to be controlled, for example, if the face and limbs of the interactive object need to be controlled to perform simultaneous movements, the local area of all the local areas. The appearance control vector value can be acquired, and when it is necessary to control the facial expression of the interactive object, the appearance control vector value of the local region corresponding to the face can be acquired.

いくつかの実施例において、前記第１コードシーケンス上でウィンドウスライディングを実行する方法によって、少なくとも１つの音素に対応する特徴コードを取得することができる。ここで、前記第１コードシーケンスは、ガウス畳み込み操作を通じた後のコードシーケンスであり得る。 In some embodiments, a feature code corresponding to at least one phoneme can be obtained by performing window sliding on the first chord sequence. Here, the first chord sequence may be a chord sequence after passing through a Gaussian convolution operation.

所定の長さの時間ウィンドウおよび所定のステップサイズで、前記コードシーケンスに対してウィンドウスライディングを実行し、前記時間ウィンドウ内の特徴コードを対応する少なくとも１つの音素の特徴コードとして設定し、ウィンドウスライディングが完了した後に、得られた複数の特徴コードに基づいて第２コードシーケンスを得ることができる。図４に示すように、第１コードシーケンス３２０またはスムーズした後の第１コードシーケンス４３０上で、所定の長さの時間ウィンドウをスライディングして、特徴コード１、特徴コード２、および、特徴コード３をそれぞれ得ることができ、以下同様である。第１コードシーケンスをトラバースした後、特徴コード１、特徴コード２、特徴コード３、…、特徴コードＭを得ることによって、第２コードシーケンス４４０を得る。ここで、Ｍは正の整数であり、その数値は、第１コードシーケンスの長さ、時間ウィンドウの長さ、および、時間ウィンドウをスライディングするステップサイズに応じて決定される。 A window sliding is performed on the chord sequence with a time window of a predetermined length and a predetermined step size, and the feature code in the time window is set as the feature code of at least one phoneme corresponding to the window sliding. After completion, a second code sequence can be obtained based on the resulting feature codes. As shown in FIG. 4, on the first code sequence 320 or the smoothed first code sequence 430, a time window of a predetermined length is slid, and the feature code 1, the feature code 2, and the feature code 3 are slid. Can be obtained respectively, and so on. After traversing the first code sequence, the second code sequence 440 is obtained by obtaining the feature code 1, the feature code 2, the feature code 3, ..., The feature code M. Here, M is a positive integer, the numerical value of which is determined according to the length of the first code sequence, the length of the time window, and the step size for sliding the time window.

特徴コード１、特徴コード２、特徴コード３、…、特徴コードＭに基づいて、該当する姿態制御ベクトル１、姿態制御ベクトル２、姿態制御ベクトル３、…、姿態制御ベクトルＭをそれぞれ得ることができ、したがって、姿態制御ベクトルのシーケンス４５０を得る。 Based on the feature code 1, the feature code 2, the feature code 3, ..., The feature code M, the corresponding appearance control vector 1, the appearance control vector 2, the appearance control vector 3, ..., The appearance control vector M can be obtained, respectively. Therefore, the sequence 450 of the shape control vector is obtained.

姿態制御ベクトルのシーケンス４５０と第２コードシーケンス４４０とは、時間的に整列される。前記第２コードシーケンス中の各々の特徴コードが音素シーケンス中の少なくとも１つの音素に基づいて得たものであるため、姿態制御ベクトルのシーケンス４５０中の各々の制御ベクトルも同様に音素シーケンス中の少なくとも１つの音素に基づいて得たものである。テキストデータに対応する音素シーケンスを再生する同時に、前記姿態制御ベクトルのシーケンスに基づいて前記インタラクティブ対象が動作を行うように駆動すると、駆動インタラクティブ対象がテキスト内容に対応する音声を発するようにする同時に、音声に同期化された動作を行うようにすることができ、目標対象に前記インタラクティブ対象と話している感覚を与える、目標対象のインタラクティブ体験を改善した。 The sequence 450 of the shape control vector and the second code sequence 440 are time-aligned. Since each feature code in the second chord sequence is obtained based on at least one phoneme in the phoneme sequence, each control vector in the appearance control vector sequence 450 is also at least in the phoneme sequence. It was obtained based on one phoneme. At the same time as playing the phonetic sequence corresponding to the text data, when the interactive object is driven to perform an action based on the sequence of the state control vector, the driven interactive object emits a voice corresponding to the text content at the same time. Improved the interactive experience of the target, which can be made to perform voice-synchronized movements and give the target the feeling of talking to the interactive target.

１番目の時間ウィンドウの所定のタイミングから特徴コードを出力し始めると仮定すると、前記所定のタイミングの前の姿態制御ベクトル値をデフォルト値に設定することができ、つまり、音素シーケンスを最初に再生するときに、前記インタラクティブ対象がデフォルトの動作を行うようにし、前記所定のタイミングの後で第１コードシーケンスに基づいて得られた姿態制御ベクトルのシーケンスを利用して前記インタラクティブ対象が動作を行うように駆動し始める。図４を例にとると、ｔ０のタイミングで特徴コード１を出力し始め、ｔ０のタイミングの前に対応するのはデフォルトの姿態制御ベクトルである。 Assuming that the feature code starts to be output at a predetermined timing in the first time window, the appearance control vector value before the predetermined timing can be set to the default value, that is, the phoneme sequence is played first. Occasionally, the interactive object is made to perform the default action, and the interactive object is made to perform the action by using the sequence of the shape control vector obtained based on the first code sequence after the predetermined timing. Start driving. Taking FIG. 4 as an example, the feature code 1 starts to be output at the timing of t0, and the default shape control vector corresponds to the timing before the timing of t0.

前記時間ウィンドウの長さは、前記特徴コードに含まれている情報の量に関連している。時間ウィンドウに含まれている情報の量がより大きい場合、前記リカレントニューラルネットワーク処理を通じてより均一な結果を出力することになる。時間ウィンドウの長さが大き過ぎると、インタラクティブ対象が話すときの表情が一部の文字に対応できなくなる。時間ウィンドウの長さが小さ過ぎると、インタラクティブ対象が話すときの表情が硬く見えるようになる。したがって、時間ウィンドウの時間長さは、テキストデータに対応する音素が持続する最小時間によって確定することによって、前記インタラクティブ対象を駆動して行った動作が音声とより強い関連性を有するようにする。 The length of the time window is related to the amount of information contained in the feature code. If the amount of information contained in the time window is larger, a more uniform result will be output through the recurrent neural network processing. If the time window is too long, the facial expressions when the interactive object speaks will not be able to accommodate some characters. If the length of the time window is too small, the interactive subject will look stiff when speaking. Therefore, the time length of the time window is determined by the minimum time that the phoneme corresponding to the text data lasts, so that the action performed by driving the interactive object has a stronger relationship with the voice.

時間ウィンドウをスライディングするステップサイズは、姿態制御ベクトルを取得する時間間隔（頻度）に関連しており、つまり、駆動インタラクティブ対象が動作を行う頻度に関連している。実際のインタラクティブのシーンに応じて、前記時間ウィンドウの長さおよびステップサイズを設定することによって、インタラクティブ対象が行う表情および動作と音声との関連性がより強くて、また、より鮮やかで自然になるようにする。 The step size of sliding the time window is related to the time interval (frequency) of acquiring the shape control vector, that is, to the frequency with which the driven interactive object performs an action. By setting the length and step size of the time window according to the actual interactive scene, the facial expression and movement performed by the interactive object are more closely related to the voice, and become more vivid and natural. To do so.

いくつかの実施例において、前記音素シーケンス中の音素間の時間間隔が所定の閾値よりも大きい場合、前記局部領域の所定の姿態制御ベクトルに基づいて、前記インタラクティブ対象が動作を行うように駆動する。つまり、インタラクティブ人物の発話の停頓がより長いと、前記インタラクティブ対象が所定の動作を行うように駆動する。たとえば、出力する音声の停頓がより長いときに、インタラクティブ対象が微笑の表情を行うか、または体を少し振るようにすることによって、停頓がより長いときにインタラクティブ対象が表情なしで直立していることを回避し、インタラクティブ対象が発話する過程がより自然でスムーズになるようにし、目標対象のインタラクティブ対象とのインタラクティブエクスペリエンスを改善した。 In some embodiments, when the time interval between phonemes in the phoneme sequence is greater than a predetermined threshold, the interactive object is driven to perform an action based on a predetermined shape control vector in the local region. .. That is, if the interactive person's utterance is stopped for a longer period of time, the interactive object is driven to perform a predetermined action. For example, when the output audio has a longer stagnation, the interactive object makes a smiling expression or shakes a little, so that the interactive object stands upright without a facial expression when the stagnation is longer. Avoiding this, making the process of speaking by the interactive target more natural and smooth, and improving the interactive experience with the target's interactive target.

いくつかの実施例において、前記音声データシーケンスは、音声フレームシーケンスを含み、前記音声データシーケンス中の少なくとも１つの音声データ単位の特徴情報を取得することは、前記音声フレームシーケンス対応する第１音響特徴シーケンスを取得することであって、前記第１音響特徴シーケンスは、前記音声フレームシーケンス中の各音声フレームに対応する音響特徴ベクトルを含むことと、前記第１音響特徴シーケンスに基づいて少なくとも１つの音声フレームに対応する音響特徴ベクトルを取得することと、前記音響特徴ベクトルに基づいて前記少なくとも１つの音声フレームに対応する特徴情報を得ることと、を含む。 In some embodiments, the voice data sequence comprises a voice frame sequence, and acquiring feature information of at least one voice data unit in the voice data sequence is a first acoustic feature corresponding to the voice frame sequence. Acquiring a sequence, wherein the first acoustic feature sequence includes an acoustic feature vector corresponding to each audio frame in the audio frame sequence, and at least one audio based on the first acoustic feature sequence. It includes acquiring the acoustic feature vector corresponding to the frame and obtaining the feature information corresponding to the at least one voice frame based on the acoustic feature vector.

本発明の実施例において、前記音声フレームシーケンスの音響特徴に基づいて、インタラクティブ対象の少なくとも１つの部分的領域の制御パラメータを確定してもよいし、前記音声フレームシーケンスの他の特徴に基づいて制御パラメータを確定してもよい。 In an embodiment of the invention, the control parameters of at least one partial region of the interactive object may be determined based on the acoustic features of the voice frame sequence, or may be controlled based on the other features of the voice frame sequence. The parameters may be fixed.

まず、前記音声フレームシーケンス対応する音響特徴シーケンスを取得する。ここで、後続で言及される音響特徴シーケンスと区別するために、前記音声フレームシーケンスに対応する音響特徴シーケンスを第１音響特徴シーケンスと呼ぶ。 First, the acoustic feature sequence corresponding to the voice frame sequence is acquired. Here, in order to distinguish it from the acoustic feature sequence referred to later, the acoustic feature sequence corresponding to the voice frame sequence is referred to as a first acoustic feature sequence.

本発明の実施例において、音響特徴は、基本周波数特徴、共通ピーク特徴、メル周波数係数（ＭｅｌＦｒｅｑｕｅｎｃｙＣｏｆｆｉｃｉｅｎｔ、ＭＦＣＣ）などのような、音声情感に関連する特徴であり得る。 In the embodiments of the present invention, the acoustic feature may be a feature related to voice emotion, such as a fundamental frequency feature, a common peak feature, a Mel Frequency Cofficient, MFCC, and the like.

前記第１音響特徴シーケンスは、音声フレームシーケンスの全体に対して処理を実行して得たものであり、ＭＦＣＣ特徴の例をとると、前記音声フレームシーケンス中の各々の音声フレームに対して、ウィンドウ、高速フーリエ変換、フィルタリング、対数処理、離散コサイン処理を実行して、各々の音声フレームに対応するＭＦＣＣ係数を得る。 The first acoustic feature sequence is obtained by executing processing on the entire voice frame sequence. Taking an example of the MFCC feature, a window is displayed for each voice frame in the voice frame sequence. , Fast Fourier Transform, Filtering, Logarithmic Processing, Discrete Cosine Transform, to Obtain MFCC Coefficients Corresponding to Each Speech Frame.

前記第１音響特徴シーケンスは、音声フレームシーケンスの全体に対して処理を実行して得たものであり、音声データシーケンスの全体の音響特徴を反映した。 The first acoustic feature sequence was obtained by executing processing on the entire voice frame sequence, and reflected the entire acoustic feature of the voice data sequence.

本発明の実施例において、前記第１音響特徴シーケンスは、前記音声フレームシーケンス中の各音声フレームに対応する音響特徴ベクトルを含む。ＭＦＣＣの例をとると、前記第１音響特徴シーケンスは、各音声フレームのＭＦＣＣ係数を含む。前記音声フレームシーケンスに基づいて得られた第１音響特徴シーケンスは、図５に示すとおりである。 In the embodiment of the present invention, the first acoustic feature sequence includes an acoustic feature vector corresponding to each voice frame in the voice frame sequence. Taking the example of MFCC, the first acoustic feature sequence includes the MFCC coefficient of each audio frame. The first acoustic feature sequence obtained based on the voice frame sequence is as shown in FIG.

続いて、前記第１音響特徴シーケンスに基づいて少なくとも１つの音声フレームに対応する音響特徴を取得する。 Subsequently, the acoustic features corresponding to at least one voice frame are acquired based on the first acoustic feature sequence.

前記第１音響特徴シーケンスが前記音声フレームシーケンス中の各音声フレームに対応する音響特徴ベクトルを含む場合、前記少なくとも１つの音声フレームに対応する同じ数の特徴ベクトルを前記音声フレームの音響特徴として利用することができる。ここで、上述した同じ数の特徴ベクトルは、１つの特徴マトリックスを形成することができ、当該特徴マトリックスが前記少なくとも１つの音声フレームの音響特徴である。 When the first acoustic feature sequence includes an acoustic feature vector corresponding to each voice frame in the voice frame sequence, the same number of feature vectors corresponding to the at least one voice frame are used as the acoustic feature of the voice frame. be able to. Here, the same number of feature vectors described above can form one feature matrix, which is the acoustic feature of the at least one audio frame.

図５の例をとると、前記第１音響特徴シーケンス中のＮ個の特徴ベクトルによって対応するＮ個の音声フレームの音響特徴を形成し、ここで、Ｎは正の整数である。前記第１音響特徴マトリックスは、複数の音響特徴を含み得、ここで、各々の前記音響特徴に対応する音声フレーム間は一部が重複する場合がある。 Taking the example of FIG. 5, the acoustic features of the corresponding N voice frames are formed by the N feature vectors in the first acoustic feature sequence, where N is a positive integer. The first acoustic feature matrix may include a plurality of acoustic features, where some overlap between audio frames corresponding to the respective acoustic features.

最後に、前記音響特徴に対応する前記インタラクティブ対象の少なくとも１つの部分的領域の制御ベクトルを取得する。 Finally, the control vector of at least one partial region of the interactive object corresponding to the acoustic feature is acquired.

得られた少なくとも１つの音声フレームに対応する音響特徴に対して、少なくとも１つの部分的領域の制御ベクトルを取得することができる。部分的領域は、制御する必要があるインタラクティブ対象の動作に基づいて選択することができる。たとえば、前記インタラクティブ対象の顔部および肢体が同時に動作を行うように制御する必要がある場合には、全部の部分的領域の制御ベクトルを取得することができ、前記インタラクティブ対象の表情を制御する必要がある場合には、前記顔部に対応する部分的領域の制御ベクトルを取得することができる。 It is possible to acquire the control vector of at least one partial region for the acoustic feature corresponding to the obtained at least one audio frame. The partial area can be selected based on the behavior of the interactive object that needs to be controlled. For example, when it is necessary to control the face and limbs of the interactive object to perform simultaneous movements, it is possible to acquire a control vector of the entire partial region, and it is necessary to control the facial expression of the interactive object. If there is, the control vector of the partial region corresponding to the facial expression can be acquired.

音声データシーケンスを再生する同時に、前記第１音響特徴シーケンスによって得られた各々の音響特徴に対応する制御ベクトルに基づいて前記インタラクティブ対象が動作を行う駆動することによって、端末デバイスが音声を出力する同時に、インタラクティブ対象が出力された音声にマッチングされる動作を行うようにすることができ、ここで、当該動作は、顔部動作、表情、肢体動作などを含む。したがって、目標対象に当該インタラクティブ対象が話している感覚を与えることができる。前記制御ベクトルが出力される音声の音響特徴に関連しているため、前記制御ベクトルに基づいて駆動することによって、インタラクティブ対象の表情と肢体動作に感情的な要素が加わり、インタラクティブ対象が発話する過程がより自然で鮮やかになるようにし、目標対象のインタラクティブ体験を改善した。 At the same time that the voice data sequence is reproduced, the terminal device outputs voice at the same time by driving the interactive object to perform an operation based on the control vector corresponding to each acoustic feature obtained by the first acoustic feature sequence. , The interactive object can be made to perform an action matched to the output voice, where the action includes a facial action, a facial expression, a limb action, and the like. Therefore, it is possible to give the target object the feeling that the interactive object is speaking. Since the control vector is related to the acoustic characteristics of the output voice, by driving based on the control vector, an emotional element is added to the facial expression and limb movement of the interactive object, and the process in which the interactive object speaks. Improves the interactive experience of the target by making it more natural and vivid.

いくつかの実施例において、前記第１音響特徴シーケンスにおいてウィンドウスライディングを実行の方法によって、前記少なくとも１つの音声フレームに対応する音響特徴を取得することができる。 In some embodiments, the method of performing window sliding in the first acoustic feature sequence can be used to obtain acoustic features corresponding to the at least one audio frame.

所定の長さの時間ウィンドウと所定のステップサイズで、前記第１音響特徴シーケンスに対してウィンドウスライディングを実行して、前記時間ウィンドウ内の音響特徴ベクトルを対応する同じ数の音声フレームの音響特徴に設定することによって、これら音声フレームに共同に対応される音響特徴を得ることができる。ウィンドウスライディングを完了した後に、得られた複数の音響特徴に基づいて第２音響特徴シーケンスを得ることができる。 Perform window sliding on the first acoustic feature sequence with a time window of a given length and a given step size to make the acoustic feature vectors in the time window into the acoustic features of the corresponding number of audio frames. By setting, it is possible to obtain acoustic features that are jointly supported by these audio frames. After completing window sliding, a second acoustic feature sequence can be obtained based on the plurality of acoustic features obtained.

図５に示したインタラクティブ対象の駆動方法の例をとると、前記音声フレームシーケンスは、１秒あたり１００個の音声フレームが含まれ、前記時間ウィンドウの長さは１ｓであり、ステップサイズは０．０４ｓである。前記第１音響特徴シーケンス中の各特徴ベクトルは音声フレームに対応され、これに応じて、前記第１音響特徴シーケンスも、１秒同様に１００個の特徴ベクトルが含まれる。前記第１音響特徴シーケンスにおいてウィンドウスライディングを実行する過程において、前記時間ウィンドウ内の１００個の特徴ベクトルを得るたびに、得られた１００個の特徴ベクトルを対応する１００個の音声フレームの音響特徴に設定する。前記第１音響特徴シーケンスにおいて０．０４ｓのステップサイズで前記時間ウィンドウを移動することで、それぞれ第１～１００個の音声フレームに対応する音響特徴１、および、第４～１０４音声フレームに対応する音響特徴２を得、類推により、第１音響特徴に対する処理を完了した後、音響特徴１、音響特徴２、…、音響特徴Ｍを得、したがって、第２音響特徴シーケンスを得る。ここで、Ｍは、正の整数であり、その数値は、音声フレームシーケンスのフレーム数（第１音響特徴シーケンス中の特徴ベクトルの数）、時間ウィンドウの長さ、および、ステップサイズによって、確定される。 Taking the example of the driving method of the interactive object shown in FIG. 5, the voice frame sequence includes 100 voice frames per second, the time window length is 1 s, and the step size is 0. It is 04s. Each feature vector in the first acoustic feature sequence corresponds to a voice frame, and accordingly, the first acoustic feature sequence also includes 100 feature vectors as in 1 second. In the process of performing window sliding in the first acoustic feature sequence, each time 100 feature vectors in the time window are obtained, the obtained 100 feature vectors are converted into the acoustic features of the corresponding 100 audio frames. Set. By moving the time window in the step size of 0.04 s in the first acoustic feature sequence, it corresponds to the acoustic feature 1 corresponding to the first to 100 voice frames and the fourth to 104 voice frames, respectively. After obtaining the acoustic feature 2 and completing the processing for the first acoustic feature by analogy, the acoustic feature 1, the acoustic feature 2, ..., The acoustic feature M is obtained, and thus the second acoustic feature sequence is obtained. Here, M is a positive integer, and the numerical value is determined by the number of frames of the voice frame sequence (the number of feature vectors in the first acoustic feature sequence), the length of the time window, and the step size. To.

音響特徴１、音響特徴２、…、音響特徴Ｍに基づいて、該当する制御ベクトル１、制御ベクトル２、…、制御ベクトルＭをそれぞれ得ることができ、したがって、制御ベクトルのシーケンスを得ることができる。 Based on the acoustic feature 1, the acoustic feature 2, ..., The acoustic feature M, the corresponding control vector 1, control vector 2, ..., Control vector M can be obtained, and thus a sequence of control vectors can be obtained. ..

図５に示すように、前記制御ベクトルのシーケンスは、前記第２音響特徴シーケンスと時間上で整列され、前記第２音響特徴シーケンス中の音響特徴１、音響特徴２、…、音響特徴Ｍは、それぞれ前記第１音響特徴シーケンス中のＮ個の特徴ベクトルに基づいて得られ、したがって、前記音声フレームを再生する同時に、前記制御ベクトルのシーケンスに基づいて前記インタラクティブ対象が動作を行うように駆動することができる。 As shown in FIG. 5, the sequence of the control vector is aligned with the second acoustic feature sequence in time, and the acoustic feature 1, the acoustic feature 2, ..., The acoustic feature M in the second acoustic feature sequence are Each is obtained based on the N feature vectors in the first acoustic feature sequence, thus playing the audio frame and at the same time driving the interactive object to perform an action based on the sequence of control vectors. Can be done.

１番目の時間ウィンドウの所定のタイミングに音響特徴の出力を開始すると仮定すると、前記所定のタイミングの前の制御ベクトルをデフォルト値に設定することができ、つまり、音声フレームシーケンスを再生したばかりのときに、前記インタラクティブ対象がデフォルトの動作を行うようにし、前記所定のタイミングの後に第１音響特徴シーケンスに基づいて得られた制御ベクトルのシーケンスを利用して前記インタラクティブ対象が動作を行う駆動し始める。 Assuming that the output of acoustic features starts at a given timing in the first time window, the control vector prior to that given timing can be set to the default value, that is, when the audio frame sequence has just been played. First, the interactive object is made to perform the default operation, and after the predetermined timing, the interactive object starts to be driven to perform the operation by using the sequence of the control vector obtained based on the first acoustic feature sequence.

図５の例をとると、ｔ０タイミングで音響特徴１の出力を開始し、ステップサイズに対応する時間０．０４ｓを間隔として音響特徴を出力し、ｔ１タイミングで音響特徴２の出力を開始し、ｔ２タイミングで音響特徴３の出力を開始し、…、ｔ（Ｍ－１）タイミングで音響特徴Ｍを出力する。これに対して、ｔｉ～ｔ（ｉ＋１）時間帯には、特徴ベクトル（ｉ＋１）が対応され、ここで、ｉは（Ｍ－１）未満の整数である。ただし、ｔ０タイミングの前には、制御ベクトルはデフォルト制御ベクトルである。 Taking the example of FIG. 5, the output of the acoustic feature 1 is started at the t0 timing, the acoustic feature is output at an interval of 0.04 s corresponding to the step size, and the output of the acoustic feature 2 is started at the t1 timing. The output of the acoustic feature 3 is started at the t2 timing, and the acoustic feature M is output at the t (M-1) timing. On the other hand, the feature vector (i + 1) corresponds to the ti to t (i + 1) time zone, where i is an integer less than (M-1). However, before the t0 timing, the control vector is the default control vector.

本発明の実施例において、前記音声データシーケンスを再生する同時に、前記制御ベクトルのシーケンスに基づいて前記インタラクティブ対象が動作を行うように駆動することによって、インタラクティブ対象の動作が出力する音声に同期化されるようにし、目標対象に前記インタラクティブ対象が話している感覚を与え、目標対象のインタラクティブ対象とのインタラクティブ体験を改善した。 In the embodiment of the present invention, the speech data sequence is reproduced, and at the same time, the interactive object is driven to perform an action based on the sequence of the control vector, so that the motion of the interactive object is synchronized with the output voice. To give the target object the feeling that the interactive object is talking, and to improve the interactive experience with the interactive object of the target object.

前記時間ウィンドウの長さは、前記音響特徴に含まれている情報量に関連している。時間ウィンドウの長さが大きいほど、含まれている情報が多く、前記インタラクティブ対象を駆動して行う動作と音声の関連性もより強い。時間ウィンドウスライディングのステップサイズは、制御ベクトルを取得する時間間隔（頻度）に関連しており、つまり、インタラクティブ対象が動作を行うように駆動する頻度に関連している。実際のインタラクティブシーンに応じて前記時間ウィンドウの長さおよびステップサイズを設定することができ、インタラクティブ対象が表情および動作と音声の関連性がより強くなるようにし、より鮮やかで自然である。 The length of the time window is related to the amount of information contained in the acoustic feature. The larger the length of the time window, the more information is contained, and the stronger the relationship between the action performed by driving the interactive object and the voice. The step size of the time window sliding is related to the time interval (frequency) of acquiring the control vector, that is, the frequency of driving the interactive object to perform an action. The length and step size of the time window can be set according to the actual interactive scene, allowing the interactive object to have a stronger relationship between facial expressions and movements and speech, making it more vivid and natural.

いくつかの実施例において、前記音響特徴は、Ｌ個の次元のメル周波数係数ＭＦＣＣを含み、ここで、Ｌは正の整数である。ＭＦＣＣは、音声信号のエネルギーの頻度の範囲における分布である。前記音声フレームシーケンス中の複数の音声フレームデータを周波数領域に変換し、Ｌ個のサブ領域を含むメルフィルターを利用して、Ｌ個の次元のＭＦＣＣを得る。音声データシーケンスのＭＦＣＣに基づいて制御ベクトルを取得し、前記制御ベクトルに基づいて前記インタラクティブ対象が顔部動作と肢体動作を実行するように駆動することによって、インタラクティブ対象の表情と肢体動作に感情的な要素が加わり、インタラクティブ対象が話している過程より自然で鮮やかになるようにし、したがって、目標対象のインタラクティブ対象とのインタラクティブ体験を改善する。 In some embodiments, the acoustic feature comprises L-dimensional Mel frequency coefficients MFCC, where L is a positive integer. MFCC is a distribution in the frequency range of the energy of an audio signal. A plurality of voice frame data in the voice frame sequence are converted into frequency domains, and a mel filter containing L sub-regions is used to obtain L-dimensional MFCCs. Emotional to the facial and limb movements of the interactive object by acquiring a control vector based on the MFCC of the audio data sequence and driving the interactive object to perform facial and limb movements based on the control vector. It adds elements to make the interactive object more natural and vibrant than the talking process, thus improving the interactive experience with the target's interactive object.

いくつかの実施例において、前記音声データ単位の特徴情報を事前に訓練されたリカレントニューラルネットワークに入力して、前記特徴情報に対応する前記インタラクティブ対象の制御パラメータ値を得ることを含む。前記リカレントニューラルネットワークは、時間リカレントニューラルネットワークであり、入力された特徴情報の履歴情報を学習し、音声単位シーケンスに基づいて制御パラメータを出力することができる。たとえば、当該制御パラメータは、顔部姿態制御パラメータ、または、少なくとも１つの部分的領域の制御ベクトルであり得る。 In some embodiments, the feature information of the voice data unit is input to a pre-trained recurrent neural network to obtain the control parameter value of the interactive object corresponding to the feature information. The recurrent neural network is a time recurrent neural network, and can learn historical information of input feature information and output control parameters based on a voice unit sequence. For example, the control parameter may be a facial appearance control parameter or a control vector of at least one partial region.

本発明の実施例において、事前に訓練されたリカレントニューラルネットワークを利用して前記音声データ単位の特徴情報に対応する制御パラメータを取得し、関連性がある履歴特徴情報と現在特徴情報とを融合することによって、履歴制御パラメータが現在制御パラメータの変化に対して影響を与えるようにすることによって、インタラクティブ人物の表情変化と肢体動作がよりスムーズで自然になるようにする。 In an embodiment of the present invention, a pre-trained recurrent neural network is used to acquire control parameters corresponding to the feature information of the voice data unit, and the relevant historical feature information and the current feature information are fused. By allowing the history control parameters to influence changes in the current control parameters, the facial expression changes and limb movements of the interactive person are made smoother and more natural.

いくつかの実施例において、以下の方法によって前記リカレントニューラルネットワークを訓練することができる。 In some embodiments, the recurrent neural network can be trained by the following methods.

まず、特徴情報サンプルを取得する。たとえば、以下の方法によって前記特徴情報サンプルを取得することができる。 First, a feature information sample is acquired. For example, the feature information sample can be obtained by the following method.

キャラクターを発した音声のビデオセグメントを取得し、前記ビデオセグメント中からキャラクターの該当する音声セグメントを抽出する。たとえば、実在の人物が話しているビデオセグメントを取得することができる。前記ビデオセグメントに対してサンプリングを実行して複数の前記キャラクターの第１画像フレームを取得することができる。また、前記音声セグメントに対してサンプリングを実行して、複数の音声フレームを得ることができる。 The video segment of the voice that emitted the character is acquired, and the corresponding voice segment of the character is extracted from the video segment. For example, you can get the video segment that a real person is talking about. Sampling can be performed on the video segment to acquire a plurality of first image frames of the character. Further, sampling can be performed on the voice segment to obtain a plurality of voice frames.

前記第１画像フレームに対応する前記音声フレームに含まれている音声データ単位に基づいて、前記音声フレームに対応する特徴情報を取得することができる。 The feature information corresponding to the voice frame can be acquired based on the voice data unit included in the voice frame corresponding to the first image frame.

前記第１画像フレームを前記インタラクティブ対象が含まれた第２画像フレームに変換して、前記第２画像フレームに対応する前記インタラクティブ対象の制御パラメータ値を取得することができる。 The first image frame can be converted into a second image frame including the interactive object, and the control parameter value of the interactive object corresponding to the second image frame can be acquired.

前記制御パラメータ値に基づいて、前記第１画像フレームに対応する特徴情報をラベリングして、特徴情報サンプルを得ることができる。 Based on the control parameter value, the feature information corresponding to the first image frame can be labeled to obtain a feature information sample.

いくつかの実施例において、前記特徴情報は、音素の特徴コードを含み、前記制御パラメータは、顔部筋肉制御係数を含む。上述した特徴情報サンプルを取得する方法によって、得られた顔部筋肉制御係数を利用して、前記第１画像フレームに対応する音素の特徴エンコーディングをラベリングして、音素の特徴コードに対応する特徴情報サンプルを得ることができる。 In some embodiments, the feature information includes a phoneme feature code and the control parameter includes a facial muscle control factor. Using the facial muscle control coefficient obtained by the method for acquiring the feature information sample described above, the feature encoding of the phoneme corresponding to the first image frame is labeled, and the feature information corresponding to the feature code of the phoneme is labeled. Samples can be obtained.

いくつかの実施例において、前記特徴情報は、音素の特徴コードを含み、前記制御パラメータは、前記インタラクティブ対象の少なくとも１つの部分的な制御ベクトルを含む。上述した特徴情報サンプルを取得する方法によって、得られた少なくとも１つの部分的な制御ベクトルを利用して、前記第１画像フレームに対応する音素の特徴コードをラベリングして、音素の特徴エンコーディングに対応する特徴情報サンプルを得ることができる。 In some embodiments, the feature information comprises a phoneme feature code and the control parameter comprises at least one partial control vector of the interactive object. By using at least one partial control vector obtained by the above-mentioned method for acquiring the feature information sample, the phoneme feature code corresponding to the first image frame is labeled to correspond to the phoneme feature encoding. You can get a sample of feature information.

いくつかの実施例において、前記特徴情報は、音声フレームの音響特徴を含み、前記制御パラメータは、前記インタラクティブ対象の少なくとも１つの部分的な制御ベクトルを含む。上述した特徴情報サンプルを取得する方法によって、得られた少なくとも１つの部分的な制御ベクトルを利用して、前記第１画像フレームに対応する音声フレームの音響特徴をラベリングして、音声フレームの音響特徴に対応する特徴情報サンプルを得ることができる。 In some embodiments, the feature information includes acoustic features of the audio frame and the control parameters include at least one partial control vector of the interactive object. The acoustic feature of the audio frame corresponding to the first image frame is labeled by using at least one partial control vector obtained by the method of acquiring the feature information sample described above, and the acoustic feature of the audio frame is used. It is possible to obtain a feature information sample corresponding to.

当業者は、前記特徴情報サンプルは、上記に記載に限定されず、各々のタイプの音声データ単位のさまざまな特徴に対応して、該当する特徴情報サンプルを得ることができることを理解すべきである。 Those skilled in the art should understand that the feature information sample is not limited to the above description, and that the corresponding feature information sample can be obtained corresponding to various features of each type of audio data unit. ..

前記特徴情報サンプルを得た後に、前記特徴情報サンプルに基づいて初期リカレントニューラルネットワークを訓練し、ネットワークの損失の変化が収束条件を満たすと、前記リカレントニューラルネットワーク訓練して得る。前記ネットワーク損失は、前記リカレントニューラルネットワークが予測して得た制御パラメータ値とラベリングした制御パラメータ値との間の差異を含む。 After obtaining the feature information sample, an initial recurrent neural network is trained based on the feature information sample, and when the change in network loss satisfies the convergence condition, the recurrent neural network is trained. The network loss includes a difference between the control parameter value predicted and obtained by the recurrent neural network and the labeled control parameter value.

本発明の実施例において、キャラクターのビデオセグメントを対応する複数の第１画像フレームと複数の音声フレームに分割し、実在の人物が含まれた第１画像フレームをインタラクティブ対象が含まれた第２画像フレームに変換して、少なくとも１つの音声フレームの特徴情報に対応する制御パラメータ値を取得することによって、特徴情報と制御パラメータ値との対応性がより良くなるようにし、高品質の特徴情報サンプルを得、インタラクティブ対象の姿態が対応するキャラクターの実在の姿態に近くなるようにする。 In an embodiment of the present invention, a video segment of a character is divided into a plurality of corresponding first image frames and a plurality of audio frames, and a first image frame containing a real person is divided into a second image including an interactive object. By converting to a frame and acquiring the control parameter value corresponding to the feature information of at least one voice frame, the correspondence between the feature information and the control parameter value can be improved, and a high-quality feature information sample can be obtained. Obtain, make the appearance of the interactive object closer to the actual appearance of the corresponding character.

図６は、本発明の少なくとも１つの実施例に係るインタラクティブ対象の駆動装置の構成を示す模式図であり、図６に示すように、当該装置は、前記インタラクティブ対象の駆動データを取得し、前記駆動データの駆動モードを確定するための第１取得ユニット６０１と、前記駆動モードに応答して、前記駆動データに基づいて前記インタラクティブ対象の制御パラメータ値を取得するための第２取得ユニット６０２と、前記制御パラメータ値に基づいて前記インタラクティブ対象の姿態を制御するための駆動ユニット６０３と、を備える。 FIG. 6 is a schematic diagram showing a configuration of a drive device for an interactive object according to at least one embodiment of the present invention. As shown in FIG. 6, the device acquires drive data for the interactive object, and the device is described. A first acquisition unit 601 for determining the drive mode of the drive data, and a second acquisition unit 602 for acquiring the control parameter value of the interactive object based on the drive data in response to the drive mode. A drive unit 603 for controlling the appearance of the interactive object based on the control parameter value is provided.

いくつかの実施例において、前記装置は、前記駆動データに基づいて、前記表示デバイス出力音声を制御し、および／または、テキストを展示するための出力ユニットをさらに備える。 In some embodiments, the device further comprises an output unit for controlling the display device output audio and / or displaying text based on the drive data.

いくつかの実施例において、前記駆動データに対応する駆動モードを確定するときに、前記第１取得ユニットは、具体的に、前記駆動データのタイプに基づいて、前記駆動データに対応する音声データシーケンスを取得し、ここで、前記音声データシーケンスは複数の音声データ単位を含み、また、前記音声データ単位に含まれている目標データが検出されると、前記駆動データの駆動モードを第１駆動モードとして確定し、ここで、前記目標データは、インタラクティブ対象の所定の制御パラメータ値に対応し、また、前記駆動モードに応答して、前記駆動データに基づいて前記インタラクティブ対象の制御パラメータ値を取得することは、前記第１駆動モードに応答して、前記目標データに対応する前記所定の制御パラメータ値を前記インタラクティブ対象の制御パラメータ値として使用することを含む。 In some embodiments, when determining the drive mode corresponding to the drive data, the first acquisition unit specifically has a voice data sequence corresponding to the drive data based on the type of drive data. Here, the voice data sequence includes a plurality of voice data units, and when the target data included in the voice data unit is detected, the drive mode of the drive data is changed to the first drive mode. Here, the target data corresponds to a predetermined control parameter value of the interactive target, and in response to the drive mode, the control parameter value of the interactive target is acquired based on the drive data. This includes using the predetermined control parameter value corresponding to the target data as the control parameter value of the interactive object in response to the first drive mode.

いくつかの実施例において、前記目標データは、キー単語またはキー文字を含み、前記キー単語または前記キー文字は、インタラクティブ対象の所定の動作の所定の制御パラメータ値に対応し、または、前記目標データは、音節を含み、前記音節は、前記インタラクティブ対象の所定の口形状動作の所定の制御パラメータ値に対応する。 In some embodiments, the target data includes a key word or key character, the key word or the key character corresponds to a predetermined control parameter value of a predetermined operation of an interactive target, or the target data. Includes a syllable, which corresponds to a predetermined control parameter value for a predetermined mouth-shaped movement of the interactive object.

いくつかの実施例において、前記駆動データの駆動モードを認識するときに、前記第１取得ユニットは、具体的に、前記駆動データのタイプに基づいて、前記駆動データに対応する音声データシーケンスを取得し、ここで、前記音声データシーケンスは複数の音声データ単位を含み、また、前記音声データ単位に含まれている目標データが検出されないと、前記駆動データの駆動モードを第２駆動モードとして確定し、ここで、前記目標データは、インタラクティブ対象の所定の制御パラメータ値に対応し、また、前記駆動モードに応答して、前記駆動データに基づいて前記インタラクティブ対象の制御パラメータ値を取得することは、前記第２駆動モードに応答して、前記音声データシーケンス中の少なくとも１つの音声データ単位の特徴情報を取得することと、前記特徴情報に対応する前記インタラクティブ対象の制御パラメータ値を取得することと、を含む。 In some embodiments, when recognizing the drive mode of the drive data, the first acquisition unit acquires a voice data sequence corresponding to the drive data, specifically based on the type of drive data. Here, the voice data sequence includes a plurality of voice data units, and if the target data included in the voice data unit is not detected, the drive mode of the drive data is determined as the second drive mode. Here, the target data corresponds to a predetermined control parameter value of the interactive target, and it is possible to acquire the control parameter value of the interactive target based on the drive data in response to the drive mode. In response to the second drive mode, acquiring the feature information of at least one voice data unit in the voice data sequence, and acquiring the control parameter value of the interactive target corresponding to the feature information. including.

いくつかの実施例において、前記音声データシーケンスは、音素シーケンスを含み、前記音声データシーケンス中の少なくとも１つの音声データ単位の特徴情報を取得するときに、前記第２取得ユニットは、具体的に、前記音素シーケンスに対して特徴エンコーディングを実行して、前記音素シーケンス対応する第１コードシーケンスを得、前記第１コードシーケンスに基づいて少なくとも１つの音素に対応する特徴コードを取得し、前記特徴コードに基づいて前記少なくとも１つの音素の特徴情報を得る。 In some embodiments, the voice data sequence comprises a phoneme sequence, and when acquiring feature information of at least one voice data unit in the voice data sequence, the second acquisition unit specifically. Feature encoding is performed on the phoneme sequence to obtain a first chord sequence corresponding to the phoneme sequence, and a feature code corresponding to at least one phoneme is obtained based on the first chord sequence. Based on this, the characteristic information of at least one phoneme is obtained.

いくつかの実施例において、前記音声データシーケンスは、音声フレームシーケンスを含み、前記音声データシーケンス中の少なくとも１つの音声データ単位の特徴情報を取得するときに、前記第２取得ユニットは、具体的に、前記音声フレームシーケンス対応する第１音響特徴シーケンスを取得し、ここで、前記第１音響特徴シーケンスは、前記音声フレームシーケンス中の各音声フレームに対応する音響特徴ベクトルを含み、また、前記第１音響特徴シーケンスに基づいて少なくとも１つの音声フレームに対応する音響特徴ベクトルを取得し、前記音響特徴ベクトルに基づいて前記少なくとも１つの音声フレームに対応する特徴情報を得る。 In some embodiments, the voice data sequence comprises a voice frame sequence, and when acquiring feature information of at least one voice data unit in the voice data sequence, the second acquisition unit specifically. , The first acoustic feature sequence corresponding to the voice frame sequence is acquired, wherein the first acoustic feature sequence includes an acoustic feature vector corresponding to each voice frame in the voice frame sequence, and the first acoustic feature sequence is also included. The acoustic feature vector corresponding to at least one voice frame is acquired based on the acoustic feature sequence, and the feature information corresponding to the at least one voice frame is obtained based on the acoustic feature vector.

いくつかの実施例において、前記インタラクティブ対象の制御パラメータは、顔部姿態パラメータを含み、前記顔部姿態パラメータは、顔部筋肉制御係数を含み、当該顔部筋肉制御係数は、少なくとも１つの顔部筋肉の運動状態を制御するために使用され、前記駆動データに基づいて前記インタラクティブ対象の制御パラメータ値を取得するときに、前記第２取得ユニットは、具体的に、前記駆動データに基づいて前記インタラクティブ対象の顔部筋肉制御係数を取得し、前記駆動ユニットは、具体的に、取得した顔部筋肉制御係数に基づいて前記インタラクティブ対象が前記駆動データにマッチングする顔部動作を行うように前記インタラクティブ対象を駆動し、また、前記装置は、前記顔部姿態パラメータに関連している体姿態の駆動データを取得し、前記顔部姿態パラメータ値に関連している体姿態の駆動データに基づいて前記インタラクティブ対象が肢体動作を行うように駆動するための肢体駆動ユニットをさらに備える。 In some embodiments, the interactive subject control parameter comprises a facial appearance parameter, the facial appearance parameter comprises a facial muscle control coefficient, and the facial muscle control coefficient comprises at least one face. Used to control the motor state of the muscle, when acquiring the control parameter value of the interactive object based on the driving data, the second acquisition unit specifically, the interactive based on the driving data. The facial muscle control coefficient of the target is acquired, and the drive unit specifically performs the interactive target so that the interactive target performs a facial motion matching the drive data based on the acquired facial muscle control coefficient. The device also acquires drive data of the body shape related to the face shape parameter, and the interactive is based on the drive data of the body shape related to the face shape parameter value. It further comprises a limb drive unit for driving the subject to perform limb movements.

いくつかの実施例において、前記インタラクティブ対象の制御パラメータは、前記インタラクティブ対象の少なくとも１つの部分的領域の制御ベクトルを含み、前記駆動データに基づいて前記インタラクティブ対象の制御パラメータ値を取得するときに、前記第２取得ユニットは、具体的に、前記駆動データに基づいて前記インタラクティブ対象の少なくとも１つの部分的領域の制御ベクトルを取得し、前記駆動ユニットは、具体的に、取得した前記少なくとも１つの部分的領域の制御ベクトルに基づいて前記インタラクティブ対象の顔部動作および／または肢体動作を制御する。 In some embodiments, the control parameter of the interactive object comprises a control vector of at least one partial region of the interactive object, and when the control parameter value of the interactive object is acquired based on the driving data, the control parameter value of the interactive object is acquired. The second acquisition unit specifically acquires the control vector of at least one partial region of the interactive object based on the drive data, and the drive unit specifically acquires the at least one portion. The facial movement and / or limb movement of the interactive object is controlled based on the control vector of the target area.

本発明の１態様によると、電子デバイスを提供し、前記デバイスメモリとプロセッサとを備え、前記メモリは、プロセッサ上で運行可能なコンピュータ命令を記憶し、前記プロセッサは、前記コンピュータ命令が実行されるときに、本発明によって提供される任意の実施形態に記載のインタラクティブ対象の駆動方法を実現する。 According to one aspect of the invention, an electronic device is provided that comprises the device memory and a processor, the memory storing computer instructions operable on the processor, the processor executing the computer instructions. Occasionally, the interactive object driving method described in any of the embodiments provided by the present invention is realized.

本発明の１態様によると、コンピュータプログラムが記憶されているコンピュータ可読記録媒体を提供し、前記プログラムがプロセッサによって実行されるときに、本発明によって提供される任意の実施形態に記載のインタラクティブ対象の駆動方法が実現される。 According to one aspect of the invention, a computer-readable recording medium in which a computer program is stored is provided, and when the program is executed by a processor, the interactive object according to any embodiment provided by the present invention. The drive method is realized.

本明細書の少なくとも１つの実施例は、電子デバイスをさらに提供し、図７に示すように、前記デバイスは、メモリとプロセッサとを備え、メモリは、プロセッサ上で運行可能なコンピュータ命令を記憶し、プロセッサは、前記コンピュータ命令が実行されるときに、本発明の任意の実施例に記載のインタラクティブ対象の駆動方法を実現する。 At least one embodiment of the present specification further provides an electronic device, which, as shown in FIG. 7, comprises a memory and a processor, the memory storing computer instructions that can be run on the processor. , The processor realizes the method of driving an interactive object according to any embodiment of the present invention when the computer instruction is executed.

本明細書少なくとも１つの実施例は、コンピュータプログラムが記憶されているコンピュータ可読記録媒体をさらに提供し、前記プログラムがプロセッサによって実行されるときに、本発明の任意の実施例に記載のインタラクティブ対象の駆動方法が実現される。 At least one embodiment of the present invention further provides a computer-readable recording medium in which a computer program is stored and is an interactive subject according to any embodiment of the invention when the program is executed by a processor. The drive method is realized.

当業者は、本発明の１つまたは複数の実施例は、方法、システム、または、コンピュータプログラム製品として提供することができることを了解すべきである。したがって、本発明の１つまたは複数の実施例は、完全なハードウェアの実施例、完全なソフトウェアの実施例、または、ソフトウェアとハードウェアを組み合わせた実施例の形式を使用することができる。また、本発明の１つまたは複数の実施例は、コンピュータ利用可能なプログラムコードを含む１つまたは複数のコンピュータ利用可能な記録媒体（ディスクメモリ、ＣＤ－ＲＯＭ、光学メモリなどを含むが、これらに限定されない）上で実施されるコンピュータプログラム製品の形式を使用することができる。 Those skilled in the art should appreciate that one or more embodiments of the invention may be provided as a method, system, or computer program product. Accordingly, one or more embodiments of the invention may use the form of complete hardware embodiments, complete software embodiments, or a combination of software and hardware embodiments. Also, one or more embodiments of the present invention include, but include, one or more computer-usable recording media (disk memory, CD-ROM, optical memory, etc.) containing computer-usable program code. You can use the format of the computer program product implemented on (but not limited to).

本発明における各実施例は、いずれも、漸進的な方法を使用して叙述され、各実施例同士の間の同一または類似な一部は互いに参照することができ、各々の実施例では他の実施例との異なるところに焦点を合わせて説明した。特に、データ処理デバイスの実施例の場合、基本的に方法の実施例と類似であるため、比較的に的に簡単に叙述したが、関連するところは方法の実施例の一部の説明を参照すればよい。 Each of the embodiments in the present invention is described using a gradual method, the same or similar parts between the embodiments can be referred to each other, and the other embodiments can be referred to each other. The explanation focused on the differences from the examples. In particular, in the case of the example of the data processing device, since it is basically similar to the example of the method, it is described relatively briefly, but for the relevant part, refer to the explanation of a part of the example of the method. do it.

上記で本発明の特定の実施例を叙述した。他の実施例は、添付する「特許請求の範囲」の範囲内にいる。いくつかの場合、特許請求の範囲に記載の行為またはステップは、実施例と異なる順序に従って実行されることができ、このときにも依然として期待する結果が実現されることができる。また、図面で描かれた過程は、期待する結果するために、必ずとしても、示された特定の順序または連続的な順序を必要としない。いくつかの実施形態において、マルチタスク処理および並列処理も可能であるか、または、有益であり得る。 Specific embodiments of the invention have been described above. Other examples are within the scope of the attached "claims". In some cases, the actions or steps described in the claims may be performed in a different order than in the examples, and the expected result may still be achieved. Also, the process depicted in the drawings does not necessarily require the specific order or sequential order shown to achieve the expected result. In some embodiments, multitasking and parallel processing are also possible or may be beneficial.

本発明における主題および機能操作の実施例は、デジタル電子回路、有形コンピュータソフトウェアまたはファームウェア、本発明に開示される構成およびその構造的同等物を含むコンピュータハードウェア、または、それらの１つまたは複数の組み合わせで、実現されることができる。本発明における主題の実施例は、１つまたは複数のコンピュータプログラムとして実現されることができ、すなわち、有形の非一時的プログラムキャリア上に符号化されて、データ処理装置によって実行されるか、または、データ処理装置の操作を制御するための、コンピュータプログラム命令中の１つまたは複数のモジュールとして実現されることができる。代替的または追加的に、プログラム命令は、手動で生成する伝播信号上に符号化されることができ、例えば、機械が生成する電気信号、光信号、または、電磁信号に符号化されることができる。当該信号は、情報を符号化して適切な受信機装置に伝送して、データ処理装置によって実行されるようにするために、生成される。コンピュータ記録媒体は、機械可読記憶デバイス、機械可読記憶基板、ランダムにまたはシリアルアクセスメモリデバイス、または、それらの１つまたは複数の組み合わせであり得る。 Examples of the subject matter and functional operation in the present invention are digital electronic circuits, tangible computer software or firmware, computer hardware including the configurations and structural equivalents thereof disclosed in the present invention, or one or more thereof. It can be realized by combination. The embodiments of the subject in the present invention can be realized as one or more computer programs, i.e., encoded on a tangible non-temporary program carrier and executed by a data processing apparatus. , Can be implemented as one or more modules in a computer program instruction to control the operation of the data processing device. Alternatively or additionally, the program instruction can be encoded on a manually generated propagation signal, for example, on a machine-generated electrical, optical, or electromagnetic signal. can. The signal is generated to encode the information and transmit it to the appropriate receiver device for execution by the data processing device. The computer recording medium can be a machine-readable storage device, a machine-readable storage board, a random or serial access memory device, or a combination thereof.

本発明における処理と論理フローは、１つまたは複数のコンピュータプログラムを実行する１つまたは複数のプログラム可能なコンピュータによって実行されることができ、入力データに基づいて操作を実行して出力を生成することによって該当する機能を実行する。前記処理と論理フローは、さらに、例えば、ＦＰＧＡ（フィールドプログラマブルゲートアレイ）またはＡＳＩＣ（専用集積回路）などの専用論理回路によって実行されることができ、また、装置も専用論理回路として実現されることができる。 The processing and logical flow in the present invention can be performed by one or more programmable computers running one or more computer programs, performing operations based on input data to produce output. By doing so, the corresponding function is executed. The processing and logic flow can be further executed by a dedicated logic circuit such as FPGA (field programmable gate array) or ASIC (dedicated integrated circuit), and the device can also be realized as a dedicated logic circuit. Can be done.

コンピュータプログラムの実行に適したコンピュータは、例えば、汎用、および／または、専用マイクロプロセッサ、または、いかなる他の種類の中央処理ユニットを含む。一般的に、中央処理ユニットは、読み取り専用メモリ、および／または、ランダムアクセスメモリから、命令とデータを受信することになる。コンピュータの基本コンポーネントは、命令を実施または実行するための中央処理ユニット、および、命令とデータを記憶するための１つまたは複数のメモリデバイスを含む。一般的に、コンピュータは、磁気ディスク、磁気光学ディスク、または、光学ディスクなどの、データを記憶するための１つまたは複数の大容量記憶デバイスをさらに含むか、または、操作可能に当該大容量記憶デバイスと結合されてデータを受信するかまたはデータを伝送するか、または、その両方を兼有する。しかしながら、コンピュータは、必ずとして、このようなデバイスを有するわけではない。なお、コンピュータは、もう１デバイスに埋め込まれることができ、例えば、携帯電話、パーソナルデジタルアシスタント（ＰＤＡ）、モバイルオーディオまたはビデオおプレーヤー、ゲームコンソール、グローバルポジショニングシステム（ＧＰＳ）レジーバー、または、汎用シリアルバス（ＵＳＢ）フラッシュドライブなどのポータブル記憶デバイスに埋め込まれることができ、これらデバイスはいくつかの例に過ぎない。 Suitable computers for running computer programs include, for example, general purpose and / or dedicated microprocessors, or any other type of central processing unit. Generally, the central processing unit will receive instructions and data from read-only memory and / or random access memory. The basic components of a computer include a central processing unit for executing or executing instructions, and one or more memory devices for storing instructions and data. In general, a computer further includes or is operable with one or more mass storage devices for storing data, such as magnetic disks, magnetic optical disks, or optical disks. Combined with a device to receive data, transmit data, or both. However, computers do not necessarily have such devices. The computer can be embedded in another device, such as a mobile phone, personal digital assistant (PDA), mobile audio or video player, game console, Global Positioning System (GPS) receiver, or general purpose serial bus. It can be embedded in portable storage devices such as (USB) flash drives, and these devices are just a few examples.

コンピュータプログラム命令とデータの記憶に適したコンピュータ可読媒体は、すべての形式の不揮発性メモリ、媒介、および、メモリデバイスを含み、例えば、半導体メモリデバイス（例えば、ＥＰＲＯＭ、ＥＥＰＲＯＭ、および、フラッシュデバイス）、磁気ディスク（例えば、内部ハードディスクまたは移動可能ディスク）、磁気光学ディスク、および、ＣＤＲＯＭ、および、ＤＶＤ－ＲＯＭディスクを含む。プロセッサとメモリは、専用論理回路によって補完されるかまたは専用論理回路に組み込まれることができる。 Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, intermediaries, and memory devices, such as semiconductor memory devices (eg, EPROMs, EEPROMs, and flash devices). Includes magnetic discs (eg, internal hard disks or mobile discs), magnetic optical discs, and CD ROMs, and DVD-ROM discs. Processors and memory can be complemented by dedicated logic circuits or incorporated into dedicated logic circuits.

本発明は、多くの具体的な実施の細部を含むが、これらを本発明の範囲または保護しようとする範囲を限定するものとして解釈すべきではなく、主に本発明のいくつかの実施例の特徴を叙述するために使用される。本発明の複数の実施例中の特定の特徴は、単一の実施例に組み合わせて実施されることもできる。他方、単一の実施例中の各種の特徴は、複数の実施例で別々に実施されるかまたはいかなる適切なサブ組み合わせで実施されることもできる。なお、特徴が上記のように特定の組み合わせで役割を果たし、また最初からこのように保護すると主張したが、保護すると主張した組み合わせからの１つまたは複数の特徴は、場合によって当該組み合わせから除外されることができ、また保護すると主張した組み合わせはサブ組み合わせまたはサブ組み合わせからの変形に向けることができる。 The present invention contains many specific implementation details, which should not be construed as limiting the scope of the invention or the scope of which it seeks to protect, primarily of some embodiments of the invention. Used to describe features. Specific features in a plurality of embodiments of the present invention may also be implemented in combination with a single embodiment. On the other hand, the various features in a single embodiment can be implemented separately in multiple embodiments or in any suitable sub-combination. It should be noted that the features play a role in a particular combination as described above and are claimed to be protected in this way from the beginning, but one or more features from the combination claimed to be protected may be excluded from the combination in some cases. And the combinations claimed to be protected can be directed to sub-combinations or variants from sub-combinations.

類似的に、図面で特定の順序に従って操作を描いたが、これはこれら操作を示した特定の順序にしたがって実行するかまたは順次に実行するように要求するか、または、例示したすべての操作が実行されることによって期待する結果が実現されると要求することであると理解すべきではない。場合によっては、マルチタスクおよび並列処理が有利である可能性がある。なお、上記の実施例中の各種のシステムモジュールとコンポーネントの分離は、すべての実施例でいずれもこのように分離されなければならないと理解すべきではないし、また、叙述したプログラムコンポーネントとシステムは、一般的に、一緒に単一のソフトウェア製品に統合されるか、または、複数のソフトウェア製品にパッケージされることができることを理解すべきである。 Similarly, the drawings depict operations in a particular order, which either requires them to be performed in a specific order or requires them to be performed in sequence, or all the operations illustrated. It should not be understood that it is a requirement that the expected result be achieved by being carried out. In some cases, multitasking and parallel processing can be advantageous. It should not be understood that the separation of the various system modules and components in the above embodiments must be separated in this way in all embodiments, and the described program components and systems are: In general, it should be understood that they can be integrated together into a single software product or packaged into multiple software products.

したがって、主題の特定の実施例がすでに叙述された。他の実施例は、添付する「特許請求の範囲」の範囲内にある。場合によっては、特許請求の範囲に記載されている動作は、異なる順序によって実行されても、依然として期待する結果が実現されることができる。なお、図面で描かれた処理は、期待する結果を実現するために、必ずとして、示めされた特定の順序または順次を必要としない。一部の実現において、マルチタスクおよび並列処理が有益である可能性がある。 Therefore, specific examples of the subject have already been described. Other examples are within the scope of the attached "Claims". In some cases, the actions described in the claims may be performed in different orders and still achieve the expected results. It should be noted that the processes depicted in the drawings do not necessarily require the specific order or sequence shown to achieve the expected results. Multitasking and parallel processing can be beneficial in some implementations.

上記は、本発明のいくつかの実施例に過ぎず、本発明を限定するために使用されるものではない。本発明の精神と原則の範囲内で行われたいかなる修正、同等の置換、改良などは、いずれも本発明の１つまたは複数の実施例の範囲に含まれるべきである。 The above are only a few embodiments of the invention and are not used to limit the invention. Any modifications, equivalent substitutions, improvements, etc. made within the spirit and principles of the invention should be included within the scope of one or more embodiments of the invention.

Claims

It is a driving method for interactive objects displayed on display devices.
Acquiring the drive data of the interactive target and determining the drive mode of the drive data,
Acquiring the control parameter value of the interactive object based on the drive data in response to the drive mode,
A method for driving an interactive object, which comprises controlling the appearance of the interactive object based on the control parameter value.

The driving method for an interactive object according to claim 1, further comprising controlling the display device output voice and / or displaying text based on the driving data.

Determining the drive mode corresponding to the drive data is
Acquiring a voice data sequence corresponding to the drive data based on the type of drive data, wherein the voice data sequence includes a plurality of voice data units.
In response to the detection of the target data included in the voice data unit, the drive mode of the drive data is determined as the first drive mode, and the target data is a predetermined object of the interactive object. Corresponding to control parameter values, including
Acquiring the control parameter value of the interactive object based on the drive data in response to the drive mode
The interactive according to claim 1 or 2, wherein the predetermined control parameter value corresponding to the target data is used as the control parameter value of the interactive target in response to the first drive mode. The driving method of the target.

The target data includes a key word or a key character, and the key word or the key character corresponds to a predetermined control parameter value of a predetermined operation of the interactive target, or.
The driving method for an interactive object according to claim 3, wherein the target data includes a syllable, and the syllable corresponds to a predetermined control parameter value of a predetermined mouth shape operation of the interactive object.

Determining the drive mode of the drive data is
Acquiring a voice data sequence corresponding to the drive data based on the type of drive data, wherein the voice data sequence includes a plurality of voice data units.
In response to the fact that the target data included in the voice data unit is not detected, the drive mode of the drive data is determined as the second drive mode, and the target data is the interactive target. Corresponding to a given control parameter value, including
Acquiring the control parameter value of the interactive object based on the drive data in response to the drive mode
Acquiring the feature information of at least one voice data unit in the voice data sequence in response to the second drive mode,
The driving method for an interactive object according to any one of claims 1 to 4, wherein the control parameter value of the interactive object corresponding to the feature information is acquired, and the control parameter value of the interactive object is acquired.

The audio data sequence includes a phoneme sequence.
Acquiring the feature information of at least one voice data unit in the voice data sequence is
Performing feature encoding on the phoneme sequence to obtain a first chord sequence corresponding to the phoneme sequence,
Acquiring a feature code corresponding to at least one phoneme based on the first chord sequence,
The driving method for an interactive object according to claim 5, wherein the feature information of at least one phoneme is obtained based on the feature code, and the feature information is included.

The audio data sequence includes an audio frame sequence.
Acquiring the feature information of at least one voice data unit in the voice data sequence is
Acquiring the first acoustic feature sequence corresponding to the voice frame sequence, the first acoustic feature sequence includes the acoustic feature vector corresponding to each voice frame in the voice frame sequence.
Acquiring an acoustic feature vector corresponding to at least one audio frame based on the first acoustic feature sequence,
The driving method for an interactive object according to claim 5, wherein the feature information corresponding to the at least one voice frame is obtained based on the acoustic feature vector.

The control parameter of the interactive object includes a facial appearance parameter, the facial appearance parameter includes a facial muscle control coefficient, and the facial muscle control coefficient controls the motor state of at least one facial muscle. Used for
Acquiring the control parameter value of the interactive object based on the driving data is
Including acquiring the facial muscle control coefficient of the interactive object based on the driving data.
Controlling the appearance of the interactive object based on the control parameter value
One of claims 1 to 7, wherein the interactive object drives the interactive object so as to perform a facial motion matching the driving data based on the acquired facial muscle control coefficient. The method of driving the interactive object described in the section.

Acquiring the drive data of the body shape related to the face shape parameter, and
The interactive object according to claim 8, further comprising driving the interactive object to perform a limb movement based on the driving data of the body condition related to the facial appearance parameter value. Driving method.

The control parameters of the interactive object include a control vector of at least one partial region of the interactive object.
Acquiring the control parameter value of the interactive object based on the driving data is
Including acquiring the control vector of at least one partial region of the interactive object based on the driving data.
Controlling the appearance of the interactive object based on the control parameter value
6. How to drive an interactive object.

Acquiring the control parameter value of the interactive object corresponding to the feature information is
The driving of an interactive object according to claim 5, wherein the feature information is input to a pre-trained recurrent neural network to obtain a control parameter value of the interactive object corresponding to the feature information. Method.

An interactive drive device on display in a display device.
A first acquisition unit for acquiring the driving data of the interactive target and determining the driving mode of the driving data, and
A second acquisition unit for acquiring the control parameter value of the interactive object based on the drive data in response to the drive mode.
A drive device for an interactive object, comprising: a drive unit for controlling the appearance of the interactive object based on the control parameter value.

12. The interactive drive device according to claim 12, further comprising an output unit for controlling the display device output voice and / or displaying text based on the drive data.

When determining the drive mode corresponding to the drive data,
The first acquisition unit is
Based on the type of drive data, a voice data sequence corresponding to the drive data is acquired, wherein the voice data sequence comprises a plurality of voice data units.
Further, in response to the detection of the target data included in the voice data unit, the drive mode of the drive data is determined as the first drive mode, where the target data is a predetermined interactive target. Corresponds to the control parameter value of
Acquiring the control parameter value of the interactive object based on the drive data in response to the drive mode
In response to the first drive mode, the predetermined control parameter value corresponding to the target data is used as the control parameter value of the interactive object.
Here, the target data includes a key word or a key character, and the key word or the key character corresponds to a predetermined control parameter value of a predetermined operation of the interactive target, or.
The driving device for an interactive object according to claim 12, wherein the target data includes a syllable, and the syllable corresponds to a predetermined control parameter value of a predetermined mouth shape operation of the interactive object.

When determining the drive mode of the drive data,
The first acquisition unit is
Based on the type of drive data, a voice data sequence corresponding to the drive data is acquired, wherein the voice data sequence comprises a plurality of voice data units.
Further, in response to the fact that the target data included in the voice data unit is not detected, the drive mode of the drive data is determined as the second drive mode, and the target data is a predetermined object of the interactive target. Corresponds to the control parameter value,
Acquiring the control parameter value of the interactive object based on the drive data in response to the drive mode
Acquiring the feature information of at least one voice data unit in the voice data sequence in response to the second drive mode,
The driving device for an interactive object according to any one of claims 12 to 14, further comprising acquiring a control parameter value of the interactive object corresponding to the feature information.

The audio data sequence includes a phoneme sequence.
When acquiring the feature information of at least one voice data unit in the voice data sequence,
The second acquisition unit is
Feature encoding is performed on the phoneme sequence to obtain a first chord sequence corresponding to the phoneme sequence.
The feature code corresponding to at least one phoneme is acquired based on the first chord sequence, and the feature code is obtained.
Based on the feature code, the feature information of the at least one phoneme is obtained.
or,
The audio data sequence includes an audio frame sequence.
When acquiring the feature information of at least one voice data unit in the voice data sequence,
The second acquisition unit is
The first acoustic feature sequence corresponding to the voice frame sequence is acquired, wherein the first acoustic feature sequence includes an acoustic feature vector corresponding to each voice frame in the voice frame sequence.
Further, an acoustic feature vector corresponding to at least one voice frame is acquired based on the first acoustic feature sequence, and the sound feature vector is acquired.
The driving device for an interactive object according to claim 15, wherein the feature information corresponding to the at least one voice frame is obtained based on the acoustic feature vector.

The control parameter of the interactive object includes a facial appearance parameter, the facial appearance parameter includes a facial muscle control coefficient, and the facial muscle control coefficient controls the motor state of at least one facial muscle. Used for
When acquiring the control parameter value of the interactive object based on the driving data,
The second acquisition unit is
The facial muscle control coefficient of the interactive object is acquired based on the driving data, and the face muscle control coefficient is acquired.
The drive unit is
The interactive object is driven so that the interactive object performs a facial motion matching the driving data based on the acquired facial muscle control coefficient.
The drive device of the interactive object acquires the drive data of the body shape related to the face shape parameter, and the interactive target is based on the drive data of the body shape related to the face shape parameter value. The interactive object driving device according to any one of claims 12 to 16, further comprising a limb driving unit for driving to perform limb movements.

The control parameters of the interactive object include a control vector of at least one partial region of the interactive object.
When acquiring the control parameter value of the interactive object based on the driving data,
The second acquisition unit is
Based on the driving data, the control vector of at least one partial region of the interactive object is acquired, and the control vector is acquired.
The drive unit is
The interactive object according to any one of claims 12 to 16, wherein the face movement and / or the limb movement of the interactive object is controlled based on the acquired control vector of the at least one partial region. Drive device.

It ’s an electronic device,
Equipped with memory and processor,
The memory stores computer instructions that can be operated on the processor.
An electronic device, wherein the processor performs the method according to any one of claims 1 to 11 when the computer instruction is executed.

A computer-readable recording medium that stores computer programs.
A computer-readable recording medium, wherein the method according to any one of claims 1 to 11 is executed when the computer program is executed by a processor.