JP2022531057A

JP2022531057A - Interactive target drive methods, devices, devices, and recording media

Info

Publication number: JP2022531057A
Application number: JP2021549867A
Authority: JP
Inventors: 文岩 ▲呉▼; 潜溢 ▲呉▼; 晨 ▲錢▼; 林森宋
Original assignee: Beijing Sensetime Technology Development Co Ltd
Current assignee: Beijing Sensetime Technology Development Co Ltd
Priority date: 2020-03-31
Filing date: 2020-11-18
Publication date: 2022-07-06
Also published as: WO2021196643A1; SG11202109464YA; CN111459450A; TW202138993A; TWI766499B; KR20210124312A

Abstract

本発明は、インタラクティブ対象の駆動方法、装置、デバイス、及び記録媒体を開示し、前記インタラクティブ対象は、表示デバイスに展示されており、前記方法は、前記インタラクティブ対象の音声駆動データに対応する音素シーケンスを取得することと、前記音素シーケンスにマッチングする前記インタラクティブ対象の姿態パラメータ値を取得することと、前記姿態パラメータ値に基づいて前記表示デバイスに展示している前記インタラクティブ対象の姿態を制御することと、を含む。【選択図】図２The present invention discloses a method, apparatus, device and recording medium for driving an interactive object, wherein the interactive object is displayed on a display device, and the method includes a phoneme sequence corresponding to voice driving data of the interactive object obtaining a pose parameter value of the interactive object that matches the phoneme sequence; and controlling a pose of the interactive object displayed on the display device based on the pose parameter value. ,including. [Selection drawing] Fig. 2

Description

＜関連出願の相互引用＞
本発明は、出願番号が２０２０１０２４５７６１９であり、出願日が２０２０年３月３１日である中国特許出願の優先権を主張し、当該中国特許出願の全ての内容が援用により本願に組み入れられる。
本発明は、コンピュータ技術分野に関し、具体的には、インタラクティブ対象の駆動方法、装置、デバイス、及び記録媒体に関する。 <Mutual citation of related applications>
The present invention claims the priority of a Chinese patent application with an application number of 202010247569 and a filing date of March 31, 2020, and all the contents of the Chinese patent application are incorporated herein by reference.
The present invention relates to the field of computer technology, specifically, to driving methods, devices, devices, and recording media for interactive objects.

人間とコンピュータの相互作用は、主に、キーストローク、タッチ、および音声によって入力し、表示スクリーンに画像、テキスト、または仮想キャラクターを表示して応答する。現在、仮想キャラクターは主に音声アシスタントに基づいて改善されている。 Human-computer interactions are primarily input by keystrokes, touch, and voice, and respond by displaying images, text, or virtual characters on the display screen. Currently, virtual characters are being improved primarily based on voice assistants.

本発明の実施例は、インタラクティブ対象の駆動の技術的解決策を提供する。 The embodiments of the present invention provide a technical solution for driving an interactive object.

本発明の１態様によると、表示デバイスに展示されているインタラクティブ対象の駆動方法を提供し、前記方法は、前記インタラクティブ対象の音声駆動データに対応する音素シーケンスを取得することと、前記音素シーケンスにマッチングする前記インタラクティブ対象の姿態パラメータ値を取得することと、前記姿態パラメータ値に基づいて前記表示デバイスに展示されている前記インタラクティブ対象の姿態を制御することと、を含む。本発明によって提供される任意の実施形態に結合して、前記方法は、前記音素シーケンスに基づいて前記表示デバイスの出力音声および/またはテキストを制御することをさらに含む。 According to one aspect of the present invention, a method for driving an interactive object exhibited in a display device is provided, in which the method obtains a phoneme sequence corresponding to the voice-driven data of the interactive object, and the phoneme sequence is used. It includes acquiring the appearance parameter value of the interactive object to be matched, and controlling the appearance of the interactive object displayed on the display device based on the appearance parameter value. Combined with any embodiment provided by the present invention, the method further comprises controlling the output speech and / or text of the display device based on the phoneme sequence.

本発明によって提供される任意の実施形態に結合して、前記音素シーケンスにマッチングする前記インタラクティブ対象の姿態パラメータ値を取得することは、前記音素シーケンスに対して特徴エンコーディングを実行して前記音素シーケンスの特徴情報を得ることと、前記音素シーケンスの特徴情報に対応する前記インタラクティブ対象の姿態パラメータ値を取得することと、を含む。 Coupling with any embodiment provided by the present invention to obtain the figure parameter value of the interactive object matching the phoneme sequence is to perform feature encoding on the phoneme sequence to obtain the phoneme sequence. It includes obtaining the feature information and acquiring the figure parameter value of the interactive object corresponding to the feature information of the phoneme sequence.

本発明によって提供される任意の実施形態に結合して、前記音素シーケンスに対して特徴エンコーディングを実行して前記音素シーケンスの特徴情報を得ることは、前記音素シーケンスに含まれている複数種類の音素の中の各々の音素に対して、前記音素それぞれのコードシーケンスを生成することと、前記音素にそれぞれ対応するコードシーケンスのコード値、および、前記音素シーケンス中複数種類の音素にそれぞれ対応する時間長さに基づいて、前記音素それぞれのコードシーケンスの特徴情報を得ることと、前記複数種類の音素にそれぞれ対応するコードシーケンスの特徴情報に基づいて、前記音素シーケンスの特徴情報を得ることと、を含む。 In combination with any embodiment provided by the present invention, performing feature encoding on the phoneme sequence to obtain feature information of the phoneme sequence is a plurality of phoneme types included in the phoneme sequence. For each phoneme in the phoneme, the chord sequence for each phoneme is generated, the chord value of the chord sequence corresponding to each phoneme, and the time length corresponding to a plurality of types of phonemes in the phoneme sequence. Based on the above, the feature information of the chord sequence of each of the phonemes is obtained, and the feature information of the phoneme sequence is obtained based on the feature information of the chord sequence corresponding to each of the plurality of types of phonemes. ..

本発明によって提供される任意の実施形態に結合して、前記音素シーケンスに含まれている複数種類の音素の中の各々の音素に対して、前記音素それぞれのコードシーケンスを生成することは、各時点に前記音素が対応されているか否かを検出することと、前記音素が対応されている時点のコード値を第１数値として設定し、前記音素が対応されていない時点のコード値を第２数値として設定することによって、前記音素に対応する前記コードシーケンスを得ることと、を含む。 In combination with any embodiment provided by the present invention, it is possible to generate a chord sequence for each phoneme for each phoneme among a plurality of types of phonemes included in the phoneme sequence. Detecting whether or not the phoneme is supported at the time point, setting the code value at the time when the phoneme is supported as the first numerical value, and setting the code value at the time when the phoneme is not supported as the second numerical value. By setting it as a numerical value, it includes obtaining the chord sequence corresponding to the phoneme.

本発明によって提供される任意の実施形態に結合して、前記複数種類の音素にそれぞれ対応するコードシーケンスのコード値、および、前記複数種類の音素にそれぞれ対応する時間長さに基づいて、前記複数種類の音素にそれぞれ対応するコードシーケンスの特徴情報を得ることは、前記複数種類の音素の中の各々の音素ごとに、前記音素に対応するコードシーケンスに対して、ガウスフィルターを利用して前記音素の時間における連続値に対してガウス畳み込み操作を実行して、前記音素に対応するコードシーケンスの特徴情報を得ることを含む。 Combined with any embodiment provided by the present invention, the plurality is based on the chord value of the chord sequence corresponding to each of the plurality of phonemes and the time length corresponding to each of the plurality of phonemes. To obtain the feature information of the chord sequence corresponding to each type of phoneme, the phoneme is used for the chord sequence corresponding to the phoneme for each phoneme in the plurality of types of phonemes by using a Gaussian filter. Includes performing a Gaussian convolution operation on continuous values in time to obtain feature information of the chord sequence corresponding to the phoneme.

本発明によって提供される任意の実施形態に結合して、姿態パラメータは、顔姿態パラメータを含み、前記顔姿態パラメータは、顔筋肉制御係数を含み、前記顔筋肉制御係数は、少なくとも１つの顔筋肉の運動状態を制御するために使用され、前記姿態パラメータ値に基づいて前記表示デバイスに展示されている前記インタラクティブ対象の姿態を制御することは、前記音素シーケンスにマッチングする顔筋肉制御係数値に基づいて、前記インタラクティブ対象が前記音素シーケンスの中の各々の音素にマッチングする顔動作を行うように前記インタラクティブ対象を駆動することを含む。 Combined with any embodiment provided by the present invention, the facial features parameter comprises a facial figure parameter, the facial figure parameter comprises a facial muscle control coefficient, and the facial muscle control coefficient comprises at least one facial muscle. Controlling the appearance of the interactive object displayed on the display device based on the appearance parameter values used to control the motor state of the is based on facial muscle control factor values matching the phonetic sequence. It includes driving the interactive object so that the interactive object performs a facial action matching each phone element in the element sequence.

本発明によって提供される任意の実施形態に結合して、前記方法は、前記顔姿態パラメータ値に関連する体姿態の駆動データを取得することと、前記姿態パラメータ値に基づいて前記表示デバイスに展示されている前記インタラクティブ対象の姿態を制御することと、前記顔姿態パラメータ値に関連する前記体姿態の前記駆動データに基づいて、前記インタラクティブ対象が肢体動作を行うように前記インタラクティブ対象を駆動することと、をさらに含む。 Combined with any embodiment provided by the present invention, the method obtains body shape drive data related to the facial shape parameter value and exhibits it on the display device based on the shape parameter value. Controlling the shape of the interactive object and driving the interactive target so that the interactive target performs limb movements based on the driving data of the body shape related to the facial shape parameter value. And further include.

本発明によって提供される任意の実施形態に結合して、前記音素シーケンスの特徴情報に対応する前記インタラクティブ対象の姿態パラメータ値を取得することは、所定の時間間隔で前記音素シーケンスの特徴情報をサンプリングして、第１サンプリング時間に対応するサンプリング特徴情報を得ることと、前記第１サンプリング時間に対応するサンプリング特徴情報を事前に訓練されたニューラルネットワークに入力して、前記サンプリング特徴情報に対応する前記インタラクティブ対象の前記姿態パラメータ値を得ることと、を含む。 In combination with any embodiment provided by the present invention, acquiring the figure parameter value of the interactive object corresponding to the characteristic information of the phonetic sequence samples the feature information of the phonetic sequence at predetermined time intervals. Then, the sampling feature information corresponding to the first sampling time is obtained, and the sampling feature information corresponding to the first sampling time is input to the pre-trained neural network to correspond to the sampling feature information. Includes obtaining the figure parameter value of an interactive object.

本発明によって提供される任意の実施形態に結合して、前記ニューラルネットワークは、長短期記憶ネットワークと完全接続ネットワークとを含み、前記第１サンプリング時間に対応する前記サンプリング特徴情報を事前に訓練されたニューラルネットワークに入力して、前記サンプリング特徴情報に対応するインタラクティブ対象の姿態パラメータ値を得ることは、前記第１サンプリング時間に対応するサンプリング特徴情報を前記長短期記憶ネットワークに入力し、前記第１サンプリング時間の前のサンプリング特徴情報に基づいて関連特徴情報を出力することと、前記関連特徴情報を前記完全接続ネットワークに入力し、前記完全接続ネットワークの分類結果に基づいて、前記関連特徴情報に対応する姿態パラメータ値を確定することと、を含み、ここで、前記分類結果中の各々の種類は、１組の姿態パラメータ値に対応される。 Combined with any embodiment provided by the present invention, the neural network includes a long-term storage network and a fully connected network, and is pre-trained with the sampling feature information corresponding to the first sampling time. To obtain the shape parameter value of the interactive object corresponding to the sampling feature information by inputting to the neural network, the sampling feature information corresponding to the first sampling time is input to the long-term storage network and the first sampling is performed. Output related feature information based on sampling feature information before time, input the related feature information to the fully connected network, and correspond to the related feature information based on the classification result of the fully connected network. Each type in the classification result corresponds to a set of appearance parameter values, including determining the appearance parameter value.

本発明によって提供される任意の実施形態に結合して、前記ニューラルネットワークは、音素シーケンスサンプルを利用して訓練して得られたものであり、前記方法は、キャラクターが発した音声のビデオセグメントを取得することと、前記ビデオセグメントに基づいて前記キャラクターが含まれた複数の第１画像フレームおよび複数の前記第１画像フレームにそれぞれ対応する複数のオーディオフレームを取得することと、前記第１画像フレームを前記インタラクティブ対象が含まれた第２画像フレームに変換し、前記第２画像フレームに対応する姿態パラメータ値を取得することと、前記第２画像フレームに対応する前記姿態パラメータ値に基づいて、前記第１画像フレームに対応する前記オーディオフレームをレーベリングすることと、前記姿態パラメータ値がレーベリングされている前記オーディオフレームに基づいて、前記音素シーケンスサンプルを得ることと、をさらに含む。 Combined with any embodiment provided by the present invention, the neural network was obtained by training using a phonetic sequence sample, the method of which is a video segment of the voice emitted by the character. Acquiring, acquiring a plurality of first image frames including the character based on the video segment, and acquiring a plurality of audio frames corresponding to the plurality of the first image frames, and the first image frame. Is converted into a second image frame including the interactive object, and the figure parameter value corresponding to the second image frame is acquired, and the figure parameter value corresponding to the second image frame is obtained. Further comprising labeling the audio frame corresponding to the first image frame and obtaining the phonetic sequence sample based on the audio frame to which the appearance parameter values are leveled.

本発明によって提供される任意の実施形態に結合して、前記方法は、前記音素シーケンスに対してサンプル特徴エンコーディングを実行して、前記第２サンプリング時間に対応する特徴情報を得、前記特徴情報レーベリングに対応する姿態パラメータ値に対して、特徴情報サンプルを得ることと、前記特徴情報サンプルに基づいて初期ニューラルネットワークを訓練し、ネットワーク損失の変化が収束条件を満たす後に前記ニューラルネットワークを訓練し得ることと、をさらに含み、ここで、前記ネットワーク損失は、前記初期ニューラルネットワークが予測して得た姿態パラメータ値とレーベリングした前記姿態パラメータ値との間の差異を含む。 Combined with any embodiment provided by the present invention, the method performs sample feature encoding on the phonetic sequence to obtain feature information corresponding to the second sampling time and the feature information neural network. It is possible to obtain a feature information sample for the shape parameter value corresponding to the ring, train the initial neural network based on the feature information sample, and train the neural network after the change in network loss satisfies the convergence condition. And, further included, where the network loss includes the difference between the shape parameter value predicted and obtained by the initial neural network and the leveled shape parameter value.

本発明によって提供される任意の実施形態に結合して、前記ネットワーク損失は、前記初期ニューラルネットワークが予測して得た前記姿態パラメータ値とレーベリングした前記姿態パラメータ値との差の２番目のノルムを含み、前記ネットワーク損失は、前記初期ニューラルネットワークが予測して得た前記姿態パラメータ値の最初のノルムをさらに含む。 Combined with any embodiment provided by the present invention, the network loss is the second norm of the difference between the shape parameter values predicted and obtained by the initial neural network and the leveled shape parameter values. The network loss further includes the first norm of the appearance parameter values predicted and obtained by the initial neural network.

本発明の１態様によると、表示デバイスに展示されているインタラクティブ対象の駆動装置を提供し、前記装置は、前記インタラクティブ対象の音声駆動データに対応する音素シーケンスを取得するための音素シーケンス取得ユニットと、前記音素シーケンスにマッチングする前記インタラクティブ対象の姿態パラメータ値を取得するためのパラメータ取得ユニットと、前記姿態パラメータ値に基づいて前記表示デバイスに展示されている前記インタラクティブ対象の姿態を制御するための駆動ユニットと、を備える。 According to one aspect of the present invention, a drive device for an interactive object exhibited in a display device is provided, and the device is a phoneme sequence acquisition unit for acquiring a phoneme sequence corresponding to the voice drive data of the interactive object. , A parameter acquisition unit for acquiring the figure parameter value of the interactive object matching the phoneme sequence, and a drive for controlling the figure of the interactive object displayed on the display device based on the figure parameter value. Equipped with a unit.

本発明の１態様によると、電子デバイスを提供し、前記デバイスは、メモリとプロセッサとを備え、前記メモリは、プロセッサ上で運行可能なコンピュータ命令を記憶し、前記プロセッサは、前記コンピュータ命令が実行されるときに、本発明によって提供される任意の実施形態に記載のインタラクティブ対象の駆動方法が実現される。 According to one aspect of the invention, an electronic device is provided, the device comprising a memory and a processor, the memory storing computer instructions operable on the processor, the processor executing the computer instructions. When done, the interactive object driving method described in any of the embodiments provided by the present invention is realized.

本発明の１態様によると、コンピュータプログラムが記憶されているコンピュータ可読記録媒体を提供し、前記プログラムがプロセッサによって実行されるときに、本発明によって提供される任意の実施形態に記載のインタラクティブ対象の駆動方法が実現される。 According to one aspect of the invention, a computer-readable recording medium in which a computer program is stored is provided, and when the program is executed by a processor, the interactive object according to any embodiment provided by the present invention. The drive method is realized.

本発明の１つまたは複数の実施例のインタラクティブ対象の駆動方法、装置、デバイス、及びコンピュータ可読記録媒体によると、表示デバイスに展示されているインタラクティブ対象の音声駆動データに対応する音素シーケンスを取得し、前記音素シーケンスにマッチングする前記インタラクティブ対象の姿態パラメータ値を取得し、前記音素シーケンスにマッチングする前記インタラクティブ対象の姿態パラメータ値に基づいて、前記表示デバイスに展示されている前記インタラクティブ対象の姿態を制御することによって、前記インタラクティブ対象が、前記目標対象と交流するかまたは前記目標対象に応答するためのマッチングする姿態を行うようにするため、目標対象がインタラクティブ対象と交流しているように感じるようにし、目標対象のインタラクティブ対象とのインタラクティブ体験を改善した。 According to the driving method, device, device, and computer-readable recording medium of the interactive object of one or more embodiments of the present invention, the phonetic sequence corresponding to the voice-driven data of the interactive object displayed on the display device is acquired. , The figure parameter value of the interactive object matching the phonetic sequence is acquired, and the figure of the interactive object displayed on the display device is controlled based on the figure parameter value of the interactive object matching the phonetic sequence. By doing so, the target object feels as if it is interacting with the interactive object so that the interactive object interacts with the target object or has a matching appearance in response to the target object. Improves the interactive experience with the target interactive target.

以下、本明細書の１つまたは複数の実施例または先行技術での技術的解決策をより明確に説明するために、実施例または先行技術の説明に使用する必要のある図面を簡単に紹介する。明らかに、以下に説明する図面は、本明細書の１つまたは複数の実施例に記載のいくつかの実施例に過ぎず、当業者は創造的な作業なしにこれら図面に基づいて他の図面を得ることができる。
本発明の少なくとも１つの実施例によって提供されるインタラクティブ対象の駆動方法中の表示デバイスの模式図である。本発明の少なくとも１つの実施例によって提供されるインタラクティブ対象の駆動方法のフローチャートである。本発明の少なくとも１つの実施例によって提供される音素シーケンスに対して特徴エンコーディングを実行する過程の模式図である。本発明の少なくとも１つの実施例によって提供される音素シーケンスサンプルの模式図である。本発明の少なくとも１つの実施例によって提供されるインタラクティブ対象の駆動装置の構成の模式図である。本発明の少なくとも１つの実施例によって提供される電子デバイスの構成の模式図である。 Hereinafter, in order to more clearly explain the technical solution in one or more embodiments or prior art of the present specification, the drawings that need to be used in the description of the embodiment or prior art are briefly introduced. .. Obviously, the drawings described below are only a few examples described in one or more embodiments herein, and those skilled in the art will be able to base on these drawings without any creative work. Can be obtained.
It is a schematic diagram of the display device in the driving method of an interactive object provided by at least one embodiment of the present invention. It is a flowchart of the driving method of an interactive object provided by at least one Embodiment of this invention. It is a schematic diagram of the process of performing feature encoding on the phoneme sequence provided by at least one embodiment of the present invention. FIG. 3 is a schematic diagram of a phoneme sequence sample provided by at least one embodiment of the present invention. It is a schematic diagram of the configuration of the drive device of an interactive object provided by at least one embodiment of the present invention. It is a schematic diagram of the configuration of the electronic device provided by at least one embodiment of the present invention.

以下、例示的な実施例を詳細に説明し、その例を図面に示す。以下の説明が図面を言及している場合、特に明記しない限り、異なる図面における同一の数字は、同一または類似な要素を示す。以下の例示的な実施例で叙述される実施形態は、本発明と一致するすべての実施形態を代表しない。逆に、それらは、添付された特許請求の範囲に記載された、本発明のいくつかの態様と一致する装置及び方法の例に過ぎない。 Hereinafter, exemplary embodiments will be described in detail and examples will be shown in the drawings. Where the following description refers to drawings, the same numbers in different drawings indicate the same or similar elements, unless otherwise stated. The embodiments described in the following exemplary examples do not represent all embodiments consistent with the present invention. Conversely, they are merely examples of devices and methods consistent with some aspects of the invention described in the appended claims.

本明細書における「および/または」という用語は、ただ関連対象の関連関係を説明するものであり、３つの関係が存在できることを示し、たとえば、Ａおよび/またはＢは、Ａが単独に存在すること、ＡとＢが同時に存在すること、および、Ｂが単独に存在することのような３つの関係が存在する。また、本明細書における「少なくとも１種」という用語は、複数種類の中の任意の１種または複数種類の中の少なくとも２種の任意の組み合わせを示し、たとえば、Ａ、Ｂ、Ｃの中の少なくとも１種を含むことは、Ａ、Ｂ、および、Ｃから構成されたセットから選択した任意の１つまたは複数の要素を含むことを示す。 The term "and / or" as used herein merely describes the relationship of a related object and indicates that three relationships can exist, for example, A and / or B, where A is present alone. There are three relationships, such as the existence of A and B at the same time, and the existence of B alone. Further, the term "at least one kind" in the present specification refers to any one kind in a plurality of kinds or any combination of at least two kinds in a plurality of kinds, for example, in A, B, C. Inclusion of at least one indicates that it comprises any one or more elements selected from a set composed of A, B, and C.

本発明の少なくとも１つの実施例は、インタラクティブ対象の駆動方法を提供し、前記駆動方法は、端末デバイスまたはサーバなどの電子デバイスによって実行され得る。前記端末デバイスは、携帯電話、タブレットパソコン、ゲーム機、デスクトップパソコン、広告機、オールインワン機、車載端末などの、固定端末または移動端末であり得る。前記サーバは、ローカルサーバまたはクラウドサーバなどを含む。前記方法は、プロセッサによりメモリに記憶されているコンピュータ可読命令を呼び出す方法によって実現されることができる。 At least one embodiment of the present invention provides a driving method for an interactive object, which driving method can be performed by an electronic device such as a terminal device or a server. The terminal device may be a fixed terminal or a mobile terminal such as a mobile phone, a tablet personal computer, a game machine, a desktop personal computer, an advertising machine, an all-in-one machine, and an in-vehicle terminal. The server includes a local server, a cloud server, and the like. The method can be realized by a method of calling a computer-readable instruction stored in a memory by a processor.

本発明の実施例において、インタラクティブ対象は、目標対象とインタラクティブを実行できる任意の仮想イメージであり得る。１実施例において、インタラクティブ対象は、仮想キャラクターであり得、さらに、仮想動物、仮想物品、漫画イメージなどの、インタラクティブ機能を実現できる他の仮想イメージであり得る。インタラクティブ対象の表示形式は、２Ｄまたは３Ｄであるが、本発明はこれに対して限定しない。前記目標対象は、ユーザ、ロボット、またはその他のスマートデバイスであり得る。前記インタラクティブ対象の前記目標対象とのインタラクティブ方法は、能動的インタラクティブ方法または受動的インタラクティブ方法であり得る。１例において、目標対象により、ジェスチャまたは肢体動作を行うことによって要求を発して、能動的インタラクティブ方法によってインタラクティブ対象をトリガしてインタラクティブを行うことができる。もう１例において、インタラクティブ対象により、能動的に挨拶して、目標対象が動作などを行うようにプロンプトする方法によって、目標対象が受動的方法によってインタラクティブ対象とインタラクティブを行うようにすることができる。 In an embodiment of the invention, the interactive object can be any virtual image capable of performing interactive with the target object. In one embodiment, the interactive object may be a virtual character, and may be another virtual image capable of realizing an interactive function, such as a virtual animal, a virtual article, or a cartoon image. The display format of the interactive object is 2D or 3D, but the present invention is not limited thereto. The target can be a user, a robot, or other smart device. The interactive method of the interactive object with the target object may be an active interactive method or a passive interactive method. In one example, depending on the target object, a request can be made by performing a gesture or a limb movement, and the interactive object can be triggered and interactively performed by an active interactive method. In another example, the interactive object can be made to interact with the interactive object by a passive method by actively greeting and prompting the target object to perform an action or the like.

前記インタラクティブ対象は、端末デバイスを利用して展示することができ、前記端末デバイスは、テレビ、表示機能を有するオールインワン器、プロジェクター、仮想現実（ＶｉｒｔｕａｌＲｅａｌｉｔｙ、ＶＲ）デバイス、拡張現実（ＡｕｇｍｅｎｔｅｄＲｅａｌｉｔｙ、ＡＲ）デバイスなどであり得、本発明は端末デバイスの具体的な形態に対して限定しない。 The interactive object can be exhibited using a terminal device, and the terminal device includes a television, an all-in-one device having a display function, a projector, a virtual reality (VR) device, and an augmented reality (AR). ) The present invention may be a device or the like, and the present invention is not limited to a specific form of the terminal device.

図１は、本発明の少なくとも１つの実施例によって提供される表示デバイスを示す。図１に示すように、当該表示デバイスは、透明表示スクリーンを有し、透明表示スクリーンに立体画像を表示することによって、立体効果を有する仮想シーンおよびインタラクティブ対象を現わすことができる。たとえば、図１の透明表示スクリーンに表示されたインタラクティブ対象は、仮想漫画人物を含む。いくつかの実施例において、本発明に記載の端末デバイスは、上記の透明表示スクリーンを有する表示デバイスであってもよく、表示デバイスに、メモリとプロセッサと配置されており、メモリは、プロセッサ上で運行可能なコンピュータ命令を記憶し、前記プロセッサは、前記コンピュータ命令が実行されるときに、本発明によって提供されるインタラクティブ対象の駆動方法を実現することによって、透明表示スクリーンに表示されたインタラクティブ対象を駆動して目標対象と交流または応答を行うようにすることができる。 FIG. 1 shows a display device provided by at least one embodiment of the present invention. As shown in FIG. 1, the display device has a transparent display screen, and by displaying a stereoscopic image on the transparent display screen, a virtual scene having a stereoscopic effect and an interactive object can be represented. For example, the interactive object displayed on the transparent screen of FIG. 1 includes a virtual cartoon person. In some embodiments, the terminal device described in the present invention may be a display device having the transparent display screen described above, wherein the display device is arranged with a memory and a processor, and the memory is stored on the processor. The processor stores an operable computer instruction, and when the computer instruction is executed, the interactive object displayed on the transparent display screen is realized by realizing the driving method of the interactive object provided by the present invention. It can be driven to interact or respond to the target.

いくつかの実施例において、インタラクティブ対象が音声を出力するように駆動するための音声駆動データに応答して、インタラクティブ対象は、目標対象に対して指定された音声を発することができる。端末デバイスは、端末デバイスの周辺の目標対象の動作、表情、身分、好みなどに基づいて、音声駆動データを生成することによって、インタラクティブ対象が指定された音声を発して交流または応答を行うように駆動することで、目標対象に対して擬人化サービスを提供することができる。音声駆動データは、その他の方法によって生成されてもよく、たとえば、サーバによって生成して端末デバイスに送信してもよいことを説明する必要がある。 In some embodiments, the interactive object can emit a designated voice to the target object in response to voice-driven data for driving the interactive object to output voice. The terminal device generates voice-driven data based on the movement, facial expression, status, preference, etc. of the target object around the terminal device so that the interactive target emits a specified voice to interact or respond. By driving, it is possible to provide anthropomorphic services to the target target. It should be explained that the voice-driven data may be generated by other methods, for example, by a server and transmitted to a terminal device.

インタラクティブ対象が目標対象とインタラクティブを行う過程において、当該音声駆動データに基づいてインタラクティブ対象が指定された音声を発するように駆動するときに、前記インタラクティブ対象が当該指定された音声と同期化された顔部の動作を行うように駆動することができなく、インタラクティブ対象が音声を発するときに鈍く不自然になり、目標対象のインタラクティブ対象とのインタラクティブ体験に影響を与える可能性がある。これに鑑みて、本発明の少なくとも１つの実施例は、インタラクティブ対象駆動方法を提出して、目標対象のインタラクティブ対象とのインタラクティブの体験を向上させる。 In the process of the interactive object interacting with the target object, when the interactive object is driven to emit the specified voice based on the voice-driven data, the interactive target has a face synchronized with the specified voice. It cannot be driven to perform the movement of the part, and it becomes dull and unnatural when the interactive object emits a voice, which may affect the interactive experience with the interactive object of the target object. In view of this, at least one embodiment of the present invention submits an interactive object driving method to enhance the interactive experience with the interactive object of the target object.

図２は、本発明の少なくとも１つの実施例のインタラクティブ対象の駆動方法を示すフローチャートであり、図２に示すように、前記方法は、ステップ２０１～ステップ２０３を含む。 FIG. 2 is a flowchart showing a driving method of an interactive object according to at least one embodiment of the present invention, and as shown in FIG. 2, the method includes steps 201 to 203.

ステップ２０１において、前記インタラクティブ対象の音声駆動データに対応する音素シーケンスを取得する。 In step 201, the phoneme sequence corresponding to the voice-driven data of the interactive target is acquired.

前記音声駆動データは、オーディオデータ（音声データ）、テキストなどを含み得る。音声駆動データがオーディオデータであることに応答して、直接当該オーディオデータを利用してインタラクティブ対象が音声を出力するように駆動することができ、つまり、端末デバイスは、当該オーディオデータに基づいて直接音声を出力することができる。音声駆動データがテキストであることに応答して、前記テキストに含まれている形態素に基づいて該当する音素を生成し、生成した音素に基づいてインタラクティブ対象が音声を出力するように駆動する必要がある。前記音声駆動データは、他の形の駆動データであってもよく、本発明はこれに対して限定しない。 The voice-driven data may include audio data (voice data), text, and the like. In response to the audio-driven data being audio data, the audio data can be used directly to drive the interactive object to output audio, that is, the terminal device is directly based on the audio data. Audio can be output. In response to the voice-driven data being text, it is necessary to generate the corresponding phoneme based on the morpheme contained in the text and drive the interactive object to output voice based on the generated phoneme. be. The voice-driven data may be other forms of drive data, and the present invention is not limited thereto.

本発明の実施例において、前記音声駆動データは、サーバまたは端末デバイスによりインタラクティブ対象とインタラクティブを行う目標対象の動作、表情、身分、好みなどに基づいて生成した駆動データであってもよいし、端末デバイスにより内部メモリから呼び出した音声駆動データであってもよい。本発明は、当該音声駆動データの取得方法に対して限定しない。 In the embodiment of the present invention, the voice-driven data may be drive data generated based on the movement, facial expression, status, preference, etc. of the target object that interacts with the interactive object by the server or the terminal device, or the terminal. It may be voice-driven data called from the internal memory by the device. The present invention is not limited to the method of acquiring the voice-driven data.

前記音声駆動データがオーディオデータであることに応答して、オーディオデータを複数のオーディオフレームに分割し、オーディオフレームの状態に基づいてオーディオフレームを組み合わせて音素を形成することによって、前記オーディオデータによって形成された各々の音素に基づいて音素シーケンスを形成することができる。ここで、音素は、音声の自然な属性に基づいて分割された最小の音声単位であり、実在の人物の１つの発音動作が１つの音素を形成することができる。 Formed by the audio data by dividing the audio data into a plurality of audio frames in response to the voice-driven data being audio data and combining the audio frames to form phonemes based on the state of the audio frames. A phoneme sequence can be formed based on each phoneme made. Here, a phoneme is the smallest phoneme unit divided based on the natural attributes of speech, and one pronunciation action of a real person can form one phoneme.

前記音声駆動データがテキストであることに応答して、前記テキストに含まれている形態素に基づいて前記形態素に対応する音素を得ることによって、該当する音素シーケンスを得ることができる。 The corresponding phoneme sequence can be obtained by obtaining the phoneme corresponding to the morpheme based on the morpheme contained in the text in response to the voice-driven data being a text.

当業者は、さらに、他の方法によって前記音声駆動データに対応する音素シーケンスを得ることができ、本発明はこれに対して限定しないことを理解すべきである。 Those skilled in the art should further understand that the phoneme sequence corresponding to the voice driven data can be obtained by other methods, and the present invention is not limited thereto.

ステップ２０２において、前記音素シーケンスにマッチングする前記インタラクティブ対象の姿態パラメータ値を取得する。 In step 202, the appearance parameter value of the interactive object that matches the phoneme sequence is acquired.

本発明の実施例において、前記音素シーケンスの音響特徴に基づいて、前記音素シーケンスにマッチングするインタラクティブ対象の姿態パラメータ値を得ることもでき、前記音素シーケンスに対して特徴エンコーディングを実行して、特徴コードに対応する姿態パラメータ値を確定することによって、前記音素シーケンスに対応する姿態パラメータ値を確定することもできる。 In the embodiment of the present invention, it is also possible to obtain the figure parameter value of the interactive object matching the phoneme sequence based on the acoustic feature of the phoneme sequence, and the feature encoding is executed on the phoneme sequence to perform the feature code. It is also possible to determine the appearance parameter value corresponding to the phoneme sequence by determining the appearance parameter value corresponding to.

姿態パラメータは、前記インタラクティブ対象の姿態を制御するために使用され、異なる姿態パラメータ値を利用して前記インタラクティブ対象が該当する姿態を行うように駆動することができる。当該姿態パラメータは、顔姿態パラメータを含み、いくつかの実施例において、当該姿態パラメータは、肢体姿態パラメータをさらに含み得る。ここで、顔姿態パラメータは、表情、口の形、五官の動作、頭の姿態などを含む、前記インタラクティブ対象の顔姿態を、制御するために使用され、肢体姿態パラメータは、前記インタラクティブ対象の肢体姿態を制御するために使用され、つまり、前記インタラクティブ対象が肢体動作を行うように前記インタラクティブ対象を駆動するために使用される。本発明の実施例において、音素シーケンスの特定の特徴とインタラクティブ対象の姿態パラメータ値との間の対応関係を事前に構築することによって、前記音素シーケンスに基づいて対応する姿態パラメータ値を得ることができる。前記音素シーケンスにマッチングする前記インタラクティブ対象の姿態パラメータ値を取得する具体的な方法は、後で詳細に説明する。姿態パラメータの具体的な形は、インタラクティブ対象モデルのタイプに応じて確定することができる。 The form parameters are used to control the form of the interactive object, and different form parameter values can be used to drive the interactive object to perform the appropriate form. The figure parameter may include a face figure parameter, and in some embodiments, the figure parameter may further include a limb figure parameter. Here, the facial appearance parameter is used to control the facial appearance of the interactive object, including facial expressions, mouth shapes, movements of the five officials, head appearance, etc., and the limb appearance parameter is the limb of the interactive object. It is used to control facial expressions, that is, to drive the interactive object to perform limb movements. In the embodiment of the present invention, by preliminarily constructing a correspondence relationship between a specific feature of the phoneme sequence and the appearance parameter value of the interactive object, the corresponding appearance parameter value can be obtained based on the phoneme sequence. .. A specific method for acquiring the shape parameter value of the interactive object that matches the phoneme sequence will be described in detail later. The specific shape of the figure parameter can be determined according to the type of interactive target model.

ステップ２０３において、前記姿態パラメータ値に基づいて前記表示デバイスに展示されている前記インタラクティブ対象の姿態を制御する。 In step 203, the appearance of the interactive object displayed on the display device is controlled based on the appearance parameter value.

前記姿態パラメータ値は、前記インタラクティブ対象の音声駆動データに対応する音素シーケンスにマッチングされる。前記姿態パラメータ値に基づいて前記インタラクティブ対象の姿態を制御すると、インタラクティブ対象の姿態と、インタラクティブ対象が前記目標対象に対して実行する交流または応答とが、マッチングされるようにすることができる。たとえば、インタラクティブ対象が音声で目標対象と交流または応答を行っているときに、行う姿態と出力する音声は同期化され、したがって、目標対象に前記インタラクティブ対象が話しているような感覚を与える。 The appearance parameter value is matched with the phoneme sequence corresponding to the voice-driven data of the interactive object. By controlling the appearance of the interactive object based on the appearance parameter value, it is possible to match the appearance of the interactive object with the interaction or response that the interactive object performs with respect to the target object. For example, when the interactive object interacts with or responds to the target object by voice, the appearance to be performed and the output voice are synchronized, thus giving the target object the feeling that the interactive object is speaking.

本発明の実施例において、表示デバイスに展示されているインタラクティブ対象の音声駆動データに対応する音素シーケンスを取得し、前記音素シーケンスにマッチングする前記インタラクティブ対象の姿態パラメータ値を取得し、前記音素シーケンスにマッチングする前記インタラクティブ対象の姿態パラメータ値に基づいて、前記表示デバイスに展示されている前記インタラクティブ対象の姿態を制御することによって、前記目標対象と交流するかまたは前記目標対象に応答するためのマッチングする姿態を行うようにし、目標対象がインタラクティブ対象と交流しているように感じるようにし、目標対象のインタラクティブ体験を改善した。 In the embodiment of the present invention, the phonetic sequence corresponding to the voice-driven data of the interactive object displayed on the display device is acquired, the figure parameter value of the interactive object matching the phonetic sequence is acquired, and the phonetic sequence is used. Matching to interact with or respond to the target object by controlling the shape of the interactive object displayed on the display device based on the shape parameter value of the interactive object to be matched. Improve the interactive experience of the target by making it look like it is interacting with the interactive target.

いくつかの実施例において、前記方法は、サーバに適用され、当該サーバは、ローカルサーバまたはクラウドサーバなどを含む。前記サーバは、インタラクティブ対象の音声駆動データに対して処理を実行して、前記インタラクティブ対象の姿態パラメータ値を生成し、前記姿態パラメータ値に基づいて３次元レンダリングエンジンを利用してレンダリングして、前記インタラクティブ対象の動画を得る。前記サーバは、前記動画を端末に送信して展示することによって、目標対象と交流または応答を実行することができ、さらに、前記動画をクラウドに送信することによって、端末がクラウドから前記動画を取得して目標対象と交流または応答を実行するようにすることができる。サーバは、前記インタラクティブ対象の姿態パラメータ値を生成した後に、さらに、前記姿態パラメータ値を端末に送信することによって、端末が、レンダリングの実行、動画の生成、展示の実行などの過程を完了するようにする。 In some embodiments, the method applies to a server, which includes a local server, a cloud server, and the like. The server executes processing on the voice-driven data of the interactive target, generates the shape parameter value of the interactive target, renders the image using a three-dimensional rendering engine based on the shape parameter value, and then renders the interactive target. Get an interactive video. The server can interact with or respond to the target target by transmitting the video to the terminal and displaying it, and further, by transmitting the video to the cloud, the terminal acquires the video from the cloud. Can be made to interact with or respond to the target. After the server generates the appearance parameter value of the interactive object, the server further sends the appearance parameter value to the terminal so that the terminal completes the process of performing rendering, generating a moving image, executing an exhibition, and the like. To.

いくつかの実施例において、前記方法は、端末に適用され、前記端末は、インタラクティブ対象の音声駆動データに対して処理を実行して、前記インタラクティブ対象の姿態パラメータ値を生成し、前記姿態パラメータ値に基づいて３次元レンダリングエンジンを利用してレンダリングして、前記インタラクティブ対象の動画を得る。前記端末は、前記動画を展示することによって目標対象と交流または応答を実行することができる。 In some embodiments, the method is applied to a terminal, which performs processing on the voice-driven data of the interactive object to generate a state parameter value for the interactive object, the state parameter value. Based on the above, rendering is performed using a three-dimensional rendering engine to obtain the moving object to be interactive. The terminal can interact with or respond to the target by displaying the moving image.

いくつかの実施例において、前記音素シーケンスに基づいて前記表示デバイスの出力音声および/または展示テキストを制御することができる。また、前記音素シーケンスに基づいて前記表示デバイスの出力音声および/または展示テキストを制御する同時に、前記姿態パラメータ値に基づいて前記表示デバイスに展示されている前記インタラクティブ対象の姿態を制御することができる。 In some embodiments, the output audio and / or exhibition text of the display device can be controlled based on the phoneme sequence. Further, it is possible to control the output voice and / or the exhibition text of the display device based on the phoneme sequence, and at the same time, control the appearance of the interactive object displayed on the display device based on the appearance parameter value. ..

本発明の実施例において、前記姿態パラメータ値と前記音素シーケンスとがマッチングされるため、音素シーケンスに基づいて出力した音声および/または展示のテキストと、前記姿態パラメータ値に基づいて制御したインタラクティブ対象の姿態とが、同期化される場合、インタラクティブ対象が行った姿態と、出力した音声および/または展示したテキストとが、同期化され、目標対象に前記インタラクティブ対象と話している感覚を与える。 In the embodiment of the present invention, since the appearance parameter value and the phoneme sequence are matched, the voice and / or the text of the exhibition output based on the phoneme sequence and the interactive object controlled based on the appearance parameter value. When the appearance is synchronized, the appearance performed by the interactive object and the output voice and / or the exhibited text are synchronized, giving the target object the feeling of talking to the interactive object.

音声の出力は連続性を維持する必要があるため、１実施例において、音素シーケンス上で時間ウィンドウを移動させ、毎回の移動過程で時間ウィンドウ内の音素を出力する。ここで、所定の時間長さを毎回の移動時間ウィンドウのステップサイズとして設定する。たとえば、時間ウィンドウの長さを１秒に設定し、所定の時間長さを０.１秒に設定することができる。時間ウィンドウ内の音素を出力する同時に、時間ウィンドウの所定の位置の音素または音素の特徴情報に対応する姿態パラメータ値を取得し、前記姿態パラメータ値を利用して前記インタラクティブ対象の姿態を制御する。当該所定の位置は、時間ウィンドウの開始位置からの所定の時間長さの位置であり、たとえば、時間ウィンドウの長さを１ｓに設定する場合、当該所定の位置は、時間ウィンドウの開始位置からの０.５ｓの位置であり得る。時間ウィンドウを移動するたびに、時間ウィンドウ内の音素を出力する同時に、時間ウィンドウの所定の位置に対応する姿態パラメータ値でインタラクティブ対象の姿態を制御することによって、インタラクティブ対象の姿態と出力される音声とが同期化されるようにし、目標対象に前記インタラクティブ対象と話している感覚を与える。 Since it is necessary to maintain continuity in the output of voice, in one embodiment, the time window is moved on the phoneme sequence, and the phonemes in the time window are output in each movement process. Here, a predetermined time length is set as the step size of each movement time window. For example, the length of the time window can be set to 1 second and the predetermined time length can be set to 0.1 second. At the same time as outputting the phoneme in the time window, the figure parameter value corresponding to the phoneme at a predetermined position in the time window or the characteristic information of the phoneme is acquired, and the figure of the interactive object is controlled by using the figure parameter value. The predetermined position is a position having a predetermined time length from the start position of the time window. For example, when the length of the time window is set to 1s, the predetermined position is from the start position of the time window. It can be in the 0.5s position. Each time the time window is moved, the phonetic elements in the time window are output, and at the same time, the shape of the interactive object and the output voice are controlled by controlling the shape of the interactive object with the shape parameter value corresponding to the predetermined position of the time window. And are synchronized to give the target object the feeling of talking to the interactive object.

所定の時間長さを変更することによって、姿態パラメータ値を取得する時間間隔（頻度）を変更することができ、したがって、インタラクティブ対象が姿態を行う頻度を変更することができる。実際のインタラクティブのシーンに応じて当該所定の時間長さを設定することができ、インタラクティブ対象の姿態の変化がより自然になるようにすることができる。 By changing the predetermined time length, the time interval (frequency) for acquiring the appearance parameter value can be changed, and therefore, the frequency with which the interactive object performs the appearance can be changed. The predetermined time length can be set according to the actual interactive scene, and the change in the appearance of the interactive object can be made more natural.

いくつかの実施例において、前記音素シーケンスに対して特徴エンコーディングを実行して前記音素シーケンスの特徴情報を得、前記特徴情報に基づいて前記インタラクティブ対象の姿態パラメータ値を確定することができる。 In some embodiments, feature encoding can be performed on the phoneme sequence to obtain feature information of the phoneme sequence, and the figure parameter value of the interactive object can be determined based on the feature information.

本発明の実施例によると、インタラクティブ対象の音声駆動データに対応する音素シーケンスに対して特徴エンコーディングを実行し、得られた特徴情報に基づいて対応する姿態パラメータ値を得ることによって、音素シーケンスに基づいて音声を出力する同時に、前記特徴情報に対応する姿態パラメータ値に基づいて前記インタラクティブ対象の姿態を制御し、特に、前記特徴情報に対応する顔姿態パラメータ値に基づいて前記インタラクティブ対象が顔動作を行うように駆動し、前記インタラクティブ対象の表情と発する音声とを同期にすることができ、目標対象にインタラクティブ対象と話している感覚を与え、目標対象のインタラクティブ体験を改善した。 According to the embodiment of the present invention, the feature encoding is performed on the phonetic sequence corresponding to the voice-driven data of the interactive target, and the corresponding figure parameter value is obtained based on the obtained feature information, based on the phonetic sequence. At the same time, the appearance of the interactive object is controlled based on the appearance parameter value corresponding to the feature information, and in particular, the interactive object performs a facial movement based on the facial appearance parameter value corresponding to the feature information. Driven to do, the facial expression of the interactive object and the emitted voice could be synchronized, giving the target object the sensation of talking to the interactive object and improving the interactive experience of the target object.

いくつかの実施例において、以下の方法によって、前記音素シーケンスに対して特徴エンコーディングを実行して前記音素シーケンスの特徴情報を得ることができる。 In some embodiments, the feature encoding can be performed on the phoneme sequence to obtain the feature information of the phoneme sequence by the following method.

まず、前記音素シーケンスに含まれている複数種類の音素に対して、複数種類の音素にそれぞれ対応するコードシーケンスを生成する。 First, for a plurality of types of phonemes included in the phoneme sequence, a chord sequence corresponding to each of the plurality of types of phonemes is generated.

１例において、各時点に第１音素が対応されているか否かを検出し、前記第１音素は、前記複数の音素の中の任意の１つである。前記第１音素が対応されている時点のコード値を第１数値として設定し、前記第１音素が対応されていない時点のコード値を第２数値として設定することによって、各々の時点のコード値に対して値を割り当てた後に、第１音素に対応するコードシーケンスを得ることができる。たとえば、前記第１音素が対応されている時点のコード値を１に設定し、前記第１音素が対応されていない時点のコード値を０に設定することができる。つまり、前記音素シーケンスに含まれている複数の音素の中の各々の音素に対して、各時点に当該音素が対応されているか否かを検出し、前記音素が対応されている時点のコード値を第１数値として設定し、前記音素が対応されていない時点のコード値を第２数値として設定し、各々の時点のコード値に対して値を割り当てた後に、当該音素に対応するコードシーケンスを得ることができる。当業者は、上述したコード値の設定は例に過ぎず、さらに、コード値を他の値に設定してもよく、本発明はこれに対して限定しないことを理解すべきである。 In one example, it is detected whether or not the first phoneme corresponds to each time point, and the first phoneme is any one of the plurality of phonemes. By setting the code value at the time when the first phoneme is supported as the first numerical value and setting the code value at the time when the first phoneme is not supported as the second numerical value, the code value at each time point is set. After assigning a value to, the chord sequence corresponding to the first phoneme can be obtained. For example, the chord value at the time when the first phoneme is supported can be set to 1, and the chord value at the time when the first phoneme is not supported can be set to 0. That is, for each phoneme in the plurality of phonemes included in the phoneme sequence, it is detected whether or not the phoneme corresponds to each time point, and the code value at the time when the phoneme corresponds to the phoneme. Is set as the first numerical value, the chord value at the time when the phoneme is not supported is set as the second numerical value, the value is assigned to the chord value at each time point, and then the chord sequence corresponding to the phoneme is set. Obtainable. Those skilled in the art should understand that the above-mentioned setting of the code value is merely an example, and that the code value may be set to another value, and the present invention is not limited thereto.

その後に、前記各々の音素にそれぞれ対応するコードシーケンスのコード値、および、前記音素シーケンス中各々の音素の時間長さに基づいて、各々の音素にそれぞれ対応するコードシーケンスの特徴情報を得る。 After that, based on the chord value of the chord sequence corresponding to each phoneme and the time length of each phoneme in the phoneme sequence, the feature information of the chord sequence corresponding to each phoneme is obtained.

１例において、第１音素に対応するコードシーケンスに対して、ガウスフィルターを利用して前記第１音素の時間における連続値に対してガウス畳み込み操作を実行して、前記第１音素に対応するコードシーケンスの特徴情報を得、ここで、前記第１音素は、前記複数の音素の中の任意の１つである。 In one example, a Gaussian filter is used to execute a Gaussian convolution operation on a continuous value of the first phoneme for a chord sequence corresponding to the first phoneme, and the chord corresponding to the first phoneme is executed. The characteristic information of the sequence is obtained, and here, the first phoneme is any one of the plurality of phonemes.

最後に、各々のコードシーケンスの特徴情報のセットに基づいて、前記音素シーケンスの特徴情報を得る。 Finally, the feature information of the phoneme sequence is obtained based on the set of feature information of each chord sequence.

図３は、音素シーケンスに対して特徴エンコーディングを実行する過程を示す模式図である。図３に示すように、音素シーケンス３１０は、音素ｊ、ｉ、ｊ、ｉｅ４（簡素化のために、一部の音素のみを示す）を含み、各々の音素ｊ、ｉ、ｉｅ４に対してそれぞれ上述した各音素にそれぞれ対応するコードシーケンス３２１、３２２、３２３を得る。各々のコードシーケンスにおいて、前記音素が対応されている時点に対応するコード値を第１数値として設定し（たとえば１に設定する）、前記音素が対応されていない時点に対応するコード値第２数値として設定する（たとえば０に設定する）。コードシーケンス３２１の例をとると、音素シーケンス３１０において音素ｊがいる時点で、コードシーケンス３２１の値が第１数値であり、音素ｊがない時点で、コードシーケンス３２１の値が第２数値である。すべてのコードシーケンス３２１、３２２、３２３によって完全なコードシーケンス３２０を構成される。 FIG. 3 is a schematic diagram showing a process of performing feature encoding on a phoneme sequence. As shown in FIG. 3, the phoneme sequence 310 includes phonemes j, i, j, ie4 (for simplification, only some phonemes are shown) for each phoneme j, i, ie4, respectively. The chord sequences 321, 322, and 323 corresponding to each of the above-mentioned phonemes are obtained. In each chord sequence, the chord value corresponding to the time point corresponding to the phoneme is set as the first numerical value (for example, set to 1), and the chord value second numerical value corresponding to the time point corresponding to the time point corresponding to the phoneme is set. (For example, set to 0). Taking the example of the chord sequence 321, when the phoneme j is present in the phoneme sequence 310, the value of the chord sequence 321 is the first numerical value, and when there is no phoneme j, the value of the chord sequence 321 is the second numerical value. .. All chord sequences 321, 322, 323 make up the complete chord sequence 320.

音素ｊ、ｉ、ｉｅ４にそれぞれ対応するコードシーケンス３２１、３２２、３２３のコード値、および、当該３つのコードシーケンス中に対応する音素の時間長さに基づいて、つまり、コードシーケンス３２１におけるｊの時間長さ、コードシーケンス３２２におけるｉ１の時間長さ、および、コードシーケンス３２３におけるｉｅ４の時間長さに基づいて、コードシーケンス３２１、３２２、３２３の特徴情報を得ることができる。 Based on the chord values of the chord sequences 321, 322, and 323 corresponding to the chord sequences j, i, and ie4, and the time lengths of the chord sequences corresponding to the three chord sequences, that is, the time of j in the chord sequence 321. The feature information of the code sequence 321, 322, and 323 can be obtained based on the length, the time length of i1 in the code sequence 322, and the time length of ie4 in the code sequence 323.

たとえば、ガウスフィルターを利用してそれぞれ前記コードシーケンス３２１、３２２、３２３中の音素ｊ、ｉ、ｉｅ４の時間における連続値を利用して、ガウス畳み込み操作を実行して、前記コードシーケンスの特徴情報を得ることができる。つまり、ガウスフィルターを利用して音素の時間における連続値に対してガウス畳み込み操作を実行することによって、各々のコードシーケンス中のコード値が第２数値から第１数値または第１数値から第２数値への変化の段階がスムーズになるようにする。各々のコードシーケンス３２１、３２２、３２３に対してそれぞれガウス畳み込み操作を実行して、各々のコードシーケンスの特徴値を得る。ここで、特徴値は、特徴情報のパラメータを構成し、各々のコードシーケンスの特徴情報のセットに基づいて、当該音素シーケンス３１０に対応する特徴情報３３０を得る。当業者は、各々のコードシーケンスに対して他の操作を実行して前記コードシーケンスの特徴情報を得ることができ、本発明はこれに対して限定しないことを理解すべきである。 For example, using a Gaussian filter and using the continuous values of the phonemes j, i, and ie4 in the chord sequences 321, 322, and 323, respectively, a Gaussian convolution operation is executed to obtain the feature information of the chord sequence. Obtainable. That is, by performing a Gaussian convolution operation on continuous values in time of a phonetic element using a Gaussian filter, the chord values in each chord sequence are changed from the second numerical value to the first numerical value or from the first numerical value to the second numerical value. Make the stage of change to smooth. A Gaussian convolution operation is executed for each code sequence 321, 322, and 323, respectively, and the feature value of each code sequence is obtained. Here, the feature value constitutes a parameter of the feature information, and the feature information 330 corresponding to the phoneme sequence 310 is obtained based on the set of the feature information of each chord sequence. Those skilled in the art should understand that other operations can be performed on each code sequence to obtain characteristic information of the code sequence, and the present invention is not limited thereto.

本発明の実施例において、音素シーケンス中の各々の音素の時間長さに基づいて前記コードシーケンスの特徴情報を得ることによって、コードシーケンスの変化の段階がスムーズになるようにする。たとえば、コードシーケンスの値は、０と１に加えて、中間状態の値であってもよく、たとえば０.２、０.３などであり得る。これら中間状態の値に基づいて取得した姿態パラメータ値は、インタラクティブ人物の姿態の変化がよりスムーズで自然になるようにし、特に、インタラクティブ人物の表情の変化をよりスムーズで自然になるようにして、目標対象のインタラクティブ体験を改善した。 In the embodiment of the present invention, by obtaining the characteristic information of the chord sequence based on the time length of each phoneme in the phoneme sequence, the stage of change of the chord sequence is made smooth. For example, the value of the code sequence may be an intermediate state value in addition to 0 and 1, and may be, for example, 0.2, 0.3, and the like. The figure parameter values obtained based on these intermediate state values allow the change in the figure of the interactive person to be smoother and more natural, and in particular, the change in the facial expression of the interactive person to be smoother and more natural. Improved the targeted interactive experience.

いくつかの実施例において、前記顔姿態パラメータは、顔筋肉制御係数を含み得る。 In some embodiments, the facial appearance parameters may include facial muscle control factors.

人間の顔の運動は、解剖学の観点から、さまざまな顔の筋肉の協調的な変形の結果である。したがって、インタラクティブ対象の顔筋肉を分割して顔筋肉モデルを得、分割して得られた各筋肉（領域）に対して対応する顔筋肉制御係数に基づいてその運動を制御し、つまり、各筋肉に対して収縮/拡張制御を実行して、インタラクティブ人物の顔がさまざまな表情を行うようにすることができる。前記顔筋肉モデルの各々の筋肉に対して、筋肉が位置している顔位置および筋肉自身の運動特徴に基づいて、異なる筋肉制御係数に対応する運動状態を設定することができる。たとえば、上唇の筋肉の場合、その制御係数の数値の範囲は０～１であり、当該範囲内の異なる数値は上唇の筋肉の異なる収縮／拡張状態に対応され、当該数値を変更することによって、口部の縦方向の開閉を実現することができる。口の筋肉の左隅の場合、その制御係数の数値の範囲は０～１であり、当該範囲内の異なる数値は口の筋肉の左隅の収縮／拡張状態に対応され、当該数値を変更することによって、口部の横方向の変化を実現することができる。 Human facial movements, from an anatomical point of view, are the result of coordinated deformations of various facial muscles. Therefore, the facial muscles of the interactive target are divided to obtain a facial muscle model, and the movement of each muscle (region) obtained by dividing is controlled based on the corresponding facial muscle control coefficient, that is, each muscle. It is possible to perform contraction / expansion control on the face so that the face of the interactive person has various facial expressions. For each muscle of the facial muscle model, it is possible to set an exercise state corresponding to a different muscle control coefficient based on the facial position where the muscle is located and the motor characteristics of the muscle itself. For example, in the case of the upper lip muscle, the range of the numerical value of the control coefficient is 0 to 1, and different numerical values in the range correspond to different contraction / expansion states of the upper lip muscle, and by changing the numerical value, the numerical value is changed. It is possible to open and close the mouth in the vertical direction. In the case of the left corner of the mouth muscle, the range of the numerical value of the control coefficient is 0 to 1, and different numerical values in the range correspond to the contraction / expansion state of the left corner of the mouth muscle, and by changing the numerical value. , The lateral change of the mouth can be realized.

音素シーケンスに基づいて音声を出力する同時に、前記音素シーケンスに対応する顔筋肉制御係数値に基づいて前記インタラクティブ対象が顔表情を行うように駆動して、表示デバイスが音声を出力するときに、インタラクティブ対象が同時に当該音声を発する表情を行うようにすることによって、目標対象に当該インタラクティブ対象が話している感覚を与え、目標対象のインタラクティブ体験を改善した。 At the same time as outputting voice based on the phonetic sequence, the interactive object is driven to make a facial expression based on the facial muscle control coefficient value corresponding to the phonetic sequence, and when the display device outputs voice, it is interactive. By making the subject make a facial expression that emits the voice at the same time, the target subject is given the feeling that the interactive subject is speaking, and the interactive experience of the target subject is improved.

いくつかの実施例において、前記インタラクティブ対象の顔動作と体姿態とを関連付けることができる。つまり、当該顔動作に対応する顔姿態パラメータ値と前記体姿態とを関連付けることができ、前記体姿態は、肢体動作、ジェスチャ動作、歩き姿態などを含み得る。 In some embodiments, the facial movement of the interactive object can be associated with body shape. That is, the facial figure parameter value corresponding to the facial motion can be associated with the body figure, and the body figure may include a limb motion, a gesture motion, a walking figure, and the like.

インタラクティブ対象を駆動する過程で、前記顔姿態パラメータ値に関連する体姿態の駆動データを取得し、前記音素シーケンスに基づいて音声を出力する同時に、前記顔姿態パラメータ値に関連する体姿態の駆動データに基づいて、前記インタラクティブ対象が肢体動作を行うように前記インタラクティブ対象を駆動する。つまり、前記インタラクティブ対象の音声駆動データに基づいて前記インタラクティブ対象が顔動作を行うように駆動する同時に、さらに、当該顔動作に対応する顔姿態パラメータ値に基づいて関連付けられる体姿態の駆動データを取得して、音声を出力するときに、インタラクティブ対象が同期に該当する顔動作および肢体動作を行うように駆動することによって、インタラクティブ対象の発話状態がより鮮やかで自然になるようにし、目標対象のインタラクティブ体験を改善した。 In the process of driving the interactive object, the drive data of the body shape related to the face shape parameter value is acquired, the voice is output based on the phonetic sequence, and at the same time, the drive data of the body shape related to the face shape parameter value is output. Based on, the interactive object is driven so that the interactive object performs a limb movement. That is, at the same time that the interactive target is driven to perform a face movement based on the voice driving data of the interactive target, at the same time, the driving data of the body shape associated with the interactive target is acquired based on the face shape parameter value corresponding to the face movement. Then, when outputting the voice, the interactive target is driven to perform the facial movement and the limb movement corresponding to the synchronization, so that the speech state of the interactive target becomes more vivid and natural, and the target interactive target is interactive. Improved the experience.

いくつかの実施例において、以下の方法によって、前記音素シーケンスの特徴情報に対応するインタラクティブ対象の姿態パラメータ値を取得することができる。 In some embodiments, the figure parameter value of the interactive object corresponding to the feature information of the phoneme sequence can be obtained by the following method.

まず、所定の時間間隔で前記音素シーケンスの特徴情報をサンプリングして、各々の第１サンプリング時間に対応するサンプリング特徴情報を得る。たとえば、所定の時間間隔が０.１ｓであると、各々の第１サンプリング時間は、０.１ｓ、０.２ｓ、０.３ｓなどであり得る。 First, the feature information of the phoneme sequence is sampled at predetermined time intervals to obtain sampling feature information corresponding to each first sampling time. For example, if the predetermined time interval is 0.1 s, each first sampling time may be 0.1 s, 0.2 s, 0.3 s, or the like.

図３を参照すると、特徴情報３３０は、時間に基づく情報である。したがって、所定の時間間隔で当該特徴情報をサンプリングするときに、各々の第１サンプリング時間に対応するサンプリング特徴情報を得ることができる。 Referring to FIG. 3, feature information 330 is time-based information. Therefore, when the feature information is sampled at a predetermined time interval, the sampling feature information corresponding to each first sampling time can be obtained.

続いて、前記第１サンプリング時間に対応するサンプリング特徴情報を事前に訓練されたニューラルネットワークに入力して、前記サンプリング特徴情報に対応するインタラクティブ対象の姿態パラメータ値を得ることができる。各々の第１サンプリング時間に対応するサンプリング特徴情報に基づいて、各々の第１サンプリング時間に対応するインタラクティブ対象の姿態パラメータ値を得ることができる。 Subsequently, the sampling feature information corresponding to the first sampling time can be input to the pre-trained neural network to obtain the figure parameter value of the interactive object corresponding to the sampling feature information. Based on the sampling feature information corresponding to each first sampling time, the appearance parameter value of the interactive object corresponding to each first sampling time can be obtained.

上記のように、音素シーケンス上で時間ウィンドウを移動させて音素を出力する場合、時間ウィンドウの所定の位置の特徴情報を取得する。つまり、時間ウィンドウの所定の位置に対応する第１サンプリング時間の特徴情報を得、当該特徴情報に対応する姿態パラメータ値を取得して前記インタラクティブ対象の姿態を制御することによって、インタラクティブ対象が発している音声に一致する姿態を行うようにし、インタラクティブ対象が音声を発する過程がより鮮やで自然になるようにする。 As described above, when the time window is moved on the phoneme sequence and the phoneme is output, the feature information at a predetermined position of the time window is acquired. That is, the interactive target is emitted by obtaining the feature information of the first sampling time corresponding to the predetermined position of the time window, acquiring the figure parameter value corresponding to the feature information, and controlling the figure of the interactive target. Make sure that the appearance matches the existing voice, and that the process by which the interactive object emits the voice becomes more vivid and natural.

いくつかの実施例において、前記ニューラルネットワークは、長短期記憶ネットワーク（ＬｏｎｇＳｈｏｒｔ－ＴｅｒｍＭｅｍｏｒｙ、ＬＳＴＭ）と、完全接続ネットワークと、を含む。ここで、長短期記憶ネットワークは、時間リカレントニューラルネットワークであり、入力されたサンプリング特徴情報の履歴情報を学習することができる。また、前記長短期記憶ネットワークと完全接続ネットワークとは、共同で訓練される。 In some embodiments, the neural network comprises a long short-term memory network (Long Short-Term Memory, LSTM) and a fully connected network. Here, the long-short-term memory network is a time-recurrent neural network, and the historical information of the input sampling feature information can be learned. In addition, the long-term memory network and the fully connected network are jointly trained.

前記ニューラルネットワークが長短期記憶ネットワークと完全接続ネットワークとを含む場合、まず、前記第１サンプリング時間に対応するサンプリング特徴情報を長短期記憶ネットワークに入力する。長短期記憶ネットワークは、前記第１サンプリング時間の前のサンプリング特徴情報に基づいて関連特徴情報を出力する。つまり、長短期記憶ネットワークが出力した情報は、履歴特徴情報の現在特徴情報に対する影響を含む。続いて、前記関連特徴情報を前記完全接続ネットワークに入力し、前記完全接続ネットワークの分類結果に基づいて、前記関連特徴情報に対応する姿態パラメータ値を確定する。ここで、各分類は、１組の姿態パラメータ値に対応され、すなわち、１種の顔筋肉制御係数の分布状況に対応される。 When the neural network includes a long-short-term storage network and a fully connected network, first, sampling feature information corresponding to the first sampling time is input to the long-short-term storage network. The long-short-term storage network outputs related feature information based on the sampling feature information before the first sampling time. That is, the information output by the long-short-term storage network includes the influence of the historical feature information on the current feature information. Subsequently, the related feature information is input to the fully connected network, and the appearance parameter value corresponding to the related feature information is determined based on the classification result of the fully connected network. Here, each classification corresponds to a set of appearance parameter values, that is, to the distribution of one type of facial muscle control coefficient.

本発明の実施例において、長短期記憶ネットワークおよび完全接続ネットワークに基づいて音素シーケンスのサンプリング特徴情報に対応する姿態パラメータ値を予測し、関連性がある履歴特徴情報と現在特徴情報とを融合することによって、履歴姿態パラメータ値が現在姿態パラメータ値の変化に対して影響を与えるようにし、インタラクティブ人物の姿態パラメータ値の変化がよりスムーズで自然になるようにする。 In the embodiment of the present invention, the appearance parameter values corresponding to the sampling feature information of the phoneme sequence are predicted based on the long-term memory network and the fully connected network, and the relevant historical feature information and the current feature information are fused. Allows the historical form parameter value to affect the change in the current form parameter value so that the change in the figure state parameter value of the interactive person becomes smoother and more natural.

いくつかの実施例において、以下の方法によって、前記ニューラルネットワークを訓練することができる。 In some embodiments, the neural network can be trained by the following methods.

まず、音素シーケンスサンプルを取得する。前記音素シーケンスサンプルは、所定の時間間隔の第２サンプリング時間でレーベリングした前記インタラクティブ対象の姿態パラメータ値を含む。図４に示した音素シーケンスサンプルのように、その中の点線は、第２サンプリング時間を示し、各々の第２サンプリング時間にインタラクティブ対象の姿態パラメータ値をレーベリングする。 First, get a phoneme sequence sample. The phoneme sequence sample includes the appearance parameter values of the interactive object leveled at a second sampling time at predetermined time intervals. Like the phoneme sequence sample shown in FIG. 4, the dotted line in the sample indicates the second sampling time, and the appearance parameter value of the interactive object is leveled at each second sampling time.

続いて、前記音素シーケンスに対してサンプル特徴エンコーディングを実行して、各々の第２サンプリング時間に対応する特徴情報を得、前記特徴情報レーベリングに対応する姿態パラメータ値に対して、特徴情報サンプルを得る。つまり、特徴情報サンプルは、第２サンプリング時間にレーベリングした前記インタラクティブ対象の姿態パラメータ値を含む。 Subsequently, sample feature encoding is executed for the phoneme sequence to obtain feature information corresponding to each second sampling time, and feature information samples are provided for the appearance parameter values corresponding to the feature information leveling. obtain. That is, the feature information sample includes the appearance parameter values of the interactive object leveled during the second sampling time.

特徴情報サンプルを得た後、当該特徴情報サンプルに基づいて前記ニューラルネットワークを訓練することができる。ネットワーク損失が所定の損失値よりも小さいと、訓練を完了する。ここで、前記ネットワーク損失は、前記ニューラルネットワークが予測して得た姿態パラメータ値とレーベリングした姿態パラメータ値との間の差異を含む。 After obtaining the feature information sample, the neural network can be trained based on the feature information sample. If the network loss is less than the given loss value, the training is completed. Here, the network loss includes a difference between the appearance parameter value predicted and obtained by the neural network and the leveled appearance parameter value.

１例において、ネットワーク損失関数の式は、式（１）に示したようである。

In one example, the equation of the network loss function is as shown in equation (1).

ここで、

は、ニューラルネットワークが予測して得たi番目の姿態パラメータ値であり、

は、レーベリングしたi番目の姿態パラメータ値であり、つまり、実在の値であり、

は、ベクトルの２番目のノルムを示す。 here,

Is the i-th figure parameter value predicted and obtained by the neural network.

Is the leveled i-th form parameter value, that is, the real value,

Indicates the second norm of the vector.

前記ニューラルネットワークのネットワークパラメータ値を調整して、ネットワーク損失関数を最小化し、ネットワーク損失の変化が収束条件を満たすと、たとえばネットワーク損失の変化量が所定の閾値よりも小さいか、または、反復回数が所定の回数に達すると、訓練を完了し、訓練されたニューラルネットワークを得る。 When the network parameter value of the neural network is adjusted to minimize the network loss function and the change in network loss satisfies the convergence condition, for example, the amount of change in network loss is smaller than a predetermined threshold value, or the number of iterations is reduced. When the predetermined number of times is reached, the training is completed and the trained neural network is obtained.

もう１つの例において、ネットワーク損失関数の式は、式（２）に示したようである。

In another example, the equation of the network loss function is as shown in equation (2).

ここで、

は、ベクトルの２番目のノルムを示し、

は、ベクトルの最初のノルムを示す。 here,

Is the leveled i-th form parameter value, that is, the real value,

Indicates the second norm of the vector,

Indicates the first norm of the vector.

ネットワーク損失関数中に、予測して得た姿態パラメータ値の最初のノルムを追加することによって、顔パラメータのスパース性に対する制約を増加した。 By adding the first norm of the predicted shape parameter values to the network loss function, we increased the constraint on the sparsity of the face parameters.

いくつかの実施例において、以下の方法によって、音素シーケンスサンプルを得ることができる。 In some examples, phoneme sequence samples can be obtained by the following methods.

まず、キャラクターが発した音声のビデオセグメントを取得する。たとえば、実在の人物が話しているビデオセグメントを取得することができる。 First, the video segment of the voice emitted by the character is acquired. For example, you can get the video segment that a real person is talking about.

前記ビデオセグメントに対して、前記キャラクターが含まれた複数の第１画像フレームおよび前記第１画像フレームに対応する複数のオーディオフレームを取得する。つまり、前記ビデオセグメントを画像フレームとオーディオフレームとに分割する。ここで、各々の画像フレームは、各々のオーディオフレームに対応し、つまり、１つの画像フレームに対して、当該キャラクターが画像フレームの表情を行うときに発する音声に対応するオーディオフレームを確定することができる。 For the video segment, a plurality of first image frames including the character and a plurality of audio frames corresponding to the first image frame are acquired. That is, the video segment is divided into an image frame and an audio frame. Here, each image frame corresponds to each audio frame, that is, for one image frame, it is possible to determine an audio frame corresponding to the sound emitted when the character makes the facial expression of the image frame. can.

続いて、前記第１画像フレームである、前記キャラクターが含まれた画像フレームを、前記インタラクティブ対象が含まれた第２画像フレームに変換して、前記第２画像フレームに対応する姿態パラメータ値を取得する。前記第１画像フレームが実在の人物が含まれた画像フレームである例をとると、当該実在の人物の画像フレームをインタラクティブ対象が示すイメージを含む第２画像フレームに変換することができる。また、前記実在の人物の姿態パラメータ値が前記インタラクティブ対象の姿態パラメータ値に対応するため、各々の第２画像フレーム内のインタラクティブ対象の姿態パラメータ値を取得することができる。 Subsequently, the image frame including the character, which is the first image frame, is converted into the second image frame including the interactive object, and the figure parameter value corresponding to the second image frame is acquired. do. Taking an example in which the first image frame is an image frame including a real person, the image frame of the real person can be converted into a second image frame including an image indicated by an interactive object. Further, since the figure parameter value of the real person corresponds to the figure parameter value of the interactive object, the figure parameter value of the interactive object in each second image frame can be acquired.

その後に、前記第２画像フレームに対応する姿態パラメータ値に基づいて、前記第１画像フレームに対応するオーディオフレームをレーベリングし、姿態パラメータ値をレーベリングしたオーディオフレームに基づいて、音素シーケンスサンプルを得る。 After that, the audio frame corresponding to the first image frame is leveled based on the appearance parameter value corresponding to the second image frame, and the phoneme sequence sample is obtained based on the audio frame leveled with the appearance parameter value. obtain.

本発明の実施例において、キャラクターのビデオセグメントを、対応する画像フレームとオーディオフレームとに分割し、実在の人物が含まれた第１画像フレームをインタラクティブ対象が含まれた第２画像フレームに変換して、音素シーケンスに対応する姿態パラメータ値を取得することによって、音素と姿態パラメータ値との対応性がより良くようにし、より高い品質の音素シーケンスサンプルを得ることができる。 In an embodiment of the present invention, a video segment of a character is divided into a corresponding image frame and an audio frame, and a first image frame containing a real person is converted into a second image frame containing an interactive object. By acquiring the appearance parameter value corresponding to the phoneme sequence, the correspondence between the phoneme and the appearance parameter value can be improved, and a higher quality phoneme sequence sample can be obtained.

図５は、本発明の少なくとも１つの実施例に係るインタラクティブ対象の駆動装置の構成を示す模式図であり、前記インタラクティブ対象は、表示デバイスに展示されており、図５に示すように、当該装置は、前記インタラクティブ対象の音声駆動データに対応する音素シーケンスを取得するための音素シーケンス取得ユニット５０１と、前記音素シーケンスにマッチングする前記インタラクティブ対象の姿態パラメータ値を取得するためのパラメータ取得ユニット５０２と、前記姿態パラメータ値に基づいて前記表示デバイスに展示されている前記インタラクティブ対象の姿態を制御するための駆動ユニット５０３と、を備え得る。 FIG. 5 is a schematic diagram showing a configuration of a drive device for an interactive object according to at least one embodiment of the present invention. The interactive object is exhibited in a display device, and as shown in FIG. 5, the device is shown. A phoneme sequence acquisition unit 501 for acquiring a phoneme sequence corresponding to the voice-driven data of the interactive object, a parameter acquisition unit 502 for acquiring a figure parameter value of the interactive object matching with the phoneme sequence, and the like. A drive unit 503 for controlling the appearance of the interactive object displayed on the display device based on the appearance parameter value may be provided.

いくつかの実施例において、前記装置は、前記音素シーケンスに基づいて前記表示デバイスの出力音声および/または展示テキストを制御するための出力ユニットをさらに備える。 In some embodiments, the device further comprises an output unit for controlling the output audio and / or display text of the display device based on the phoneme sequence.

いくつかの実施例において、前記パラメータ取得ユニットは、具体的に、前記音素シーケンスに対して特徴エンコーディングを実行して前記音素シーケンスの特徴情報を得、前記音素シーケンスの特徴情報に対応する前記インタラクティブ対象の姿態パラメータ値を取得する。 In some embodiments, the parameter acquisition unit specifically performs feature encoding on the phoneme sequence to obtain feature information on the phoneme sequence and the interactive object corresponding to the feature information on the phoneme sequence. Get the appearance parameter value of.

いくつかの実施例において、前記音素シーケンスに対して特徴エンコーディングを実行して前記音素シーケンスの特徴情報を得るときに、前記パラメータ取得ユニットは、具体的に、前記音素シーケンスに含まれている複数種類の音素の中の各々の音素に対して、複数種類の音素にそれぞれ対応するコードシーケンスを生成し、前記複数種類の音素にそれぞれ対応するコードシーケンスのコード値、および、前記音素シーケンス中複数種類の音素にそれぞれ対応する時間長さに基づいて、前記複数種類の音素にそれぞれ対応するコードシーケンスの特徴情報を得、前記複数種類の音素にそれぞれ対応するコードシーケンスの特徴情報に基づいて、前記音素シーケンスの特徴情報を得る。 In some embodiments, when performing feature encoding on the phoneme sequence to obtain feature information of the phoneme sequence, the parameter acquisition unit specifically includes a plurality of types included in the phoneme sequence. For each phoneme in the phoneme, a chord sequence corresponding to each of a plurality of types of phonemes is generated, the chord value of the chord sequence corresponding to each of the plurality of types of phonemes, and a plurality of types in the phoneme sequence. Based on the time length corresponding to each phoneme, the characteristic information of the chord sequence corresponding to each of the plurality of types of phonemes is obtained, and the characteristic information of the chord sequence corresponding to each of the plurality of types of phonemes is used as the characteristic information of the said phoneme sequence. Get the feature information of.

いくつかの実施例において、前記音素シーケンスに含まれている複数種類の音素に対して、複数の音素にそれぞれ対応するコードシーケンスを生成するときに、前記パラメータ取得ユニットは、具体的に、各時点に第１音素が対応されているか否かを検出し、前記第１音素が対応されている時点のコード値を第１数値として設定し、前記第１音素が対応されていない時点のコード値を第２数値として設定することによって、前記第１音素に対応するコードシーケンスを得、ここで、前記第１音素は、前記複数の音素の中の任意の１つである。 In some embodiments, when the parameter acquisition unit specifically generates a chord sequence corresponding to a plurality of phonemes for a plurality of types of phonemes included in the phoneme sequence, the parameter acquisition unit is specified at each time point. Is detected whether or not the first phoneme is supported, the code value at the time when the first phoneme is supported is set as the first numerical value, and the code value at the time when the first phoneme is not supported is set. By setting it as a second numerical value, a chord sequence corresponding to the first phoneme is obtained, where the first phoneme is any one of the plurality of phonemes.

いくつかの実施例において、前記複数種類の音素にそれぞれ対応するコードシーケンスのコード値、および、前記音素シーケンス中複数種類の音素にそれぞれ対応する時間長さに基づいて、前記複数種類の音素にそれぞれ対応するコードシーケンスの特徴情報を得るときに、前記パラメータ取得ユニットは、具体的に、第１音素に対応するコードシーケンスに対して、ガウスフィルターを利用して前記第１音素の時間における連続値に対してガウス畳み込み操作を実行して、前記第１音素に対応するコードシーケンスの特徴情報を得、ここで、前記第１音素は、前記複数の音素の中の任意の１つである。 In some embodiments, the plurality of phonemes are each based on the chord value of the chord sequence corresponding to each of the plurality of phonemes and the time length corresponding to each of the plurality of phonemes in the phoneme sequence. When obtaining the feature information of the corresponding chord sequence, the parameter acquisition unit specifically, for the chord sequence corresponding to the first phoneme, uses a Gaussian filter to set the continuous value in the time of the first phoneme. On the other hand, a Gaussian convolution operation is executed to obtain feature information of a chord sequence corresponding to the first phoneme, where the first phoneme is any one of the plurality of phonemes.

いくつかの実施例において、姿態パラメータは、顔姿態パラメータを含み、前記顔姿態パラメータは、顔筋肉制御係数を含み、当該顔筋肉制御係数は、少なくとも１つの顔筋肉の運動状態を制御するために使用され、前記駆動ユニットは、具体的に、前記音素シーケンスにマッチングする顔筋肉制御係数に基づいて、前記インタラクティブ対象が前記音素シーケンスの中の各々の音素にマッチングする顔動作を行うように前記インタラクティブ対象を駆動する。 In some embodiments, the facial features parameter comprises a facial profile parameter, said facial profile parameters include a facial muscle control factor, the facial muscle control factor to control the motor state of at least one facial muscle. Used, the drive unit is specifically such that the interactive subject performs a facial action that matches each phone in the phone sequence, based on a facial muscle control factor that matches the phone sequence. Drive the subject.

いくつかの実施例において、前記装置は、前記顔姿態パラメータに関連する体姿態の駆動データを取得し、前記顔姿態パラメータ値に関連する体姿態の駆動データに基づいて、前記インタラクティブ対象が肢体動作を行うように前記インタラクティブ対象を駆動するための動作駆動ユニットをさらに備える。 In some embodiments, the device acquires body shape drive data related to the face shape parameter, and the interactive subject moves a limb based on the body shape drive data related to the face shape parameter value. The operation drive unit for driving the interactive object is further provided.

いくつかの実施例において、前記音素シーケンスの特徴情報に対応する前記インタラクティブ対象の姿態パラメータ値を取得するときに、前記パラメータ取得ユニットは、具体的に、所定の時間間隔で前記音素シーケンスの特徴情報をサンプリングして、第１サンプリング時間に対応するサンプリング特徴情報を得、前記第１サンプリング時間に対応するサンプリング特徴情報を事前に訓練されたニューラルネットワークに入力して、前記サンプリング特徴情報に対応するインタラクティブ対象の姿態パラメータ値を得る。 In some embodiments, when acquiring the figure parameter value of the interactive object corresponding to the feature information of the phonetic sequence, the parameter acquisition unit specifically takes the feature information of the phonetic sequence at predetermined time intervals. To obtain sampling feature information corresponding to the first sampling time, input the sampling feature information corresponding to the first sampling time into a pre-trained neural network, and interactively correspond to the sampling feature information. Obtain the shape parameter value of the target.

いくつかの実施例において、前記ニューラルネットワークは、長短期記憶ネットワークと完全接続ネットワークとを含み、前記第１サンプリング時間に対応するサンプリング特徴情報を事前に訓練されたニューラルネットワークに入力して、前記サンプリング特徴情報に対応するインタラクティブ対象の姿態パラメータ値を得るときに、前記パラメータ取得ユニットは、具体的に、前記第１サンプリング時間に対応するサンプリング特徴情報を前記長短期記憶ネットワークに入力し、前記第１サンプリング時間の前のサンプリング特徴情報に基づいて関連特徴情報を出力し、前記関連特徴情報を前記完全接続ネットワークに入力し、前記完全接続ネットワークの分類結果に基づいて、前記関連特徴情報に対応する姿態パラメータ値を確定し、ここで、前記分類結果中の各々の種類は、１組の姿態パラメータ値に対応される。 In some embodiments, the neural network comprises a long-term storage network and a fully connected network, and the sampling feature information corresponding to the first sampling time is input to the pre-trained neural network to perform the sampling. When obtaining the figure parameter value of the interactive object corresponding to the feature information, the parameter acquisition unit specifically inputs the sampling feature information corresponding to the first sampling time into the long-term storage network, and the first The related feature information is output based on the sampling feature information before the sampling time, the related feature information is input to the fully connected network, and the state corresponding to the related feature information is based on the classification result of the fully connected network. The parameter values are determined, and each type in the classification result corresponds to a set of appearance parameter values.

いくつかの実施例において、前記ニューラルネットワークは、音素シーケンスサンプルを利用して訓練して得られたものである。前記装置は、キャラクターが発した音声のビデオセグメントを取得し、前記ビデオセグメントに基づいて前記キャラクターが含まれた複数の第１画像フレームおよび前記第１画像フレームに対応する複数のオーディオフレームを取得し、前記第１画像フレームを前記インタラクティブ対象が含まれた第２画像フレームに変換し、前記第２画像フレームに対応する姿態パラメータ値を取得し、前記第２画像フレームに対応する姿態パラメータ値に基づいて、前記第１画像フレームに対応するオーディオフレームをレーベリングし、姿態パラメータ値がレーベリングされているオーディオフレームに基づいて、音素シーケンスサンプルを得るための、サンプル取得ユニットをさらに備える。 In some examples, the neural network was obtained by training using a phoneme sequence sample. The device acquires a video segment of the voice emitted by the character, and based on the video segment, acquires a plurality of first image frames including the character and a plurality of audio frames corresponding to the first image frame. , The first image frame is converted into a second image frame including the interactive object, a figure parameter value corresponding to the second image frame is acquired, and the figure parameter value corresponding to the second image frame is obtained. Further, a sample acquisition unit for leveling the audio frame corresponding to the first image frame and obtaining a sound element sequence sample based on the audio frame whose appearance parameter value is leveled is further provided.

本明細書の少なくとも１つの実施例は、電子デバイスをさらに提供し、図６に示すように、前記デバイスは、メモリとプロセッサとを備え、メモリは、プロセッサ上で運行可能なコンピュータ命令を記憶し、プロセッサは、前記コンピュータ命令が実行されるときに、本発明の任意の実施例に記載のインタラクティブ対象の駆動方法を実現する。 At least one embodiment of the present specification further provides an electronic device, which, as shown in FIG. 6, comprises a memory and a processor, the memory storing computer instructions that can be run on the processor. , The processor realizes the method of driving an interactive object according to any embodiment of the present invention when the computer instruction is executed.

本明細書の少なくとも１つの実施例は、コンピュータプログラムが記憶されているコンピュータ可読記録媒体をさらに提供し、前記プログラムがプロセッサによって実行されるときに、本発明の任意の実施例に記載のインタラクティブ対象の駆動方法を実現する。 At least one embodiment of the present specification further provides a computer-readable recording medium in which a computer program is stored and is an interactive subject according to any embodiment of the invention when the program is executed by a processor. Realize the driving method of.

当業者は、本発明の１つまたは複数の実施例は、方法、システム、または、コンピュータプログラム製品として提供することができることを了解すべきである。したがって、本発明の１つまたは複数の実施例は、完全なハードウェアの実施例、完全なソフトウェアの実施例、または、ソフトウェアとハードウェアを組み合わせた実施例の形式を使用することができる。また、本発明の１つまたは複数の実施例は、コンピュータ利用可能なプログラムコードを含む１つまたは複数のコンピュータ利用可能な記録媒体（ディスクメモリ、ＣＤ－ＲＯＭ、光学メモリなどを含むが、これらに限定されない）上で実施されるコンピュータプログラム製品の形式を使用することができる。 Those skilled in the art should appreciate that one or more embodiments of the invention may be provided as a method, system, or computer program product. Accordingly, one or more embodiments of the invention may use the form of complete hardware embodiments, complete software embodiments, or a combination of software and hardware embodiments. Also, one or more embodiments of the present invention include, but include, one or more computer-usable recording media (disk memory, CD-ROM, optical memory, etc.) containing computer-usable program code. You can use the format of the computer program product implemented on (but not limited to).

本発明における各実施例は、いずれも、漸進的な方法を使用して叙述され、各実施例同士の間の同一または類似な一部は互いに参照することができ、各々の実施例では他の実施例との異なるところに焦点を合わせて説明した。特に、データ処理デバイスの実施例の場合、基本的に方法の実施例と類似であるため、比較的に的に簡単に叙述したが、関連するところは方法の実施例の一部の説明を参照すればよい。 Each of the embodiments in the present invention is described using a gradual method, the same or similar parts between the embodiments can be referred to each other, and the other embodiments can be referred to each other. The explanation focused on the differences from the examples. In particular, in the case of the example of the data processing device, since it is basically similar to the example of the method, it is described relatively briefly, but for the relevant part, refer to the explanation of a part of the example of the method. do it.

上記で本発明の特定の実施例を叙述した。他の実施例は、添付する「特許請求の範囲」の範囲内にいる。いくつかの場合、特許請求の範囲に記載の行為またはステップは、実施例と異なる順序に従って実行されることができ、このときにも依然として期待する結果が実現されることができる。また、図面で描かれた過程は、期待する結果するために、必ずとしても、示された特定の順序または連続的な順序を必要としない。いくつかの実施形態において、マルチタスク処理および並列処理も可能であるか、または、有益であり得る。 Specific embodiments of the invention have been described above. Other examples are within the scope of the attached "claims". In some cases, the actions or steps described in the claims may be performed in a different order than in the examples, and the expected result may still be achieved. Also, the process depicted in the drawings does not necessarily require the specific order or sequential order shown to achieve the expected result. In some embodiments, multitasking and parallel processing are also possible or may be beneficial.

本発明における主題および機能操作の実施例は、デジタル電子回路、有形コンピュータソフトウェアまたはファームウェア、本発明に開示される構成およびその構造的同等物を含むコンピュータハードウェア、または、それらの１つまたは複数の組み合わせで、実現されることができる。本発明における主題の実施例は、１つまたは複数のコンピュータプログラムとして実現されることができ、すなわち、有形の非一時的プログラムキャリア上に符号化されて、データ処理装置によって実行されるか、または、データ処理装置の操作を制御するための、コンピュータプログラム命令中の１つまたは複数のモジュールとして実現されることができる。代替的または追加的に、プログラム命令は、手動で生成する伝播信号上に符号化されることができ、例えば、機械が生成する電気信号、光信号、または、電磁信号に符号化されることができる。当該信号は、情報を符号化して適切な受信機装置に伝送して、データ処理装置によって実行されるようにするために、生成される。コンピュータ記録媒体は、機械可読記憶デバイス、機械可読記憶基板、ランダムにまたはシリアルアクセスメモリデバイス、または、それらの１つまたは複数の組み合わせであり得る。 Examples of the subject matter and functional operation in the present invention are digital electronic circuits, tangible computer software or firmware, computer hardware including the configurations and structural equivalents thereof disclosed in the present invention, or one or more thereof. It can be realized by combination. The embodiments of the subject in the present invention can be realized as one or more computer programs, i.e., encoded on a tangible non-temporary program carrier and executed by a data processing apparatus. , Can be implemented as one or more modules in a computer program instruction to control the operation of the data processing device. Alternatively or additionally, the program instruction can be encoded on a manually generated propagation signal, for example, on a machine-generated electrical, optical, or electromagnetic signal. can. The signal is generated to encode the information and transmit it to the appropriate receiver device for execution by the data processing device. The computer recording medium can be a machine-readable storage device, a machine-readable storage board, a random or serial access memory device, or a combination thereof.

本発明における処理と論理フローは、１つまたは複数のコンピュータプログラムを実行する１つまたは複数のプログラム可能なコンピュータによって実行されることができ、入力データに基づいて操作を実行して出力を生成することによって該当する機能を実行する。前記処理と論理フローは、さらに、例えば、ＦＰＧＡ（フィールドプログラマブルゲートアレイ）またはＡＳＩＣ（専用集積回路）などの専用論理回路によって実行されることができ、また、装置も専用論理回路として実現されることができる。 The processing and logical flow in the present invention can be performed by one or more programmable computers running one or more computer programs, performing operations based on input data to produce output. By doing so, the corresponding function is executed. The processing and logic flow can be further executed by a dedicated logic circuit such as FPGA (field programmable gate array) or ASIC (dedicated integrated circuit), and the device can also be realized as a dedicated logic circuit. Can be done.

コンピュータプログラムの実行に適したコンピュータは、例えば、汎用、および／または、専用マイクロプロセッサ、または、いかなる他の種類の中央処理ユニットを含む。一般的に、中央処理ユニットは、読み取り専用メモリ、および／または、ランダムアクセスメモリから、命令とデータを受信することになる。コンピュータの基本コンポーネントは、命令を実施または実行するための中央処理ユニット、および、命令とデータを記憶するための１つまたは複数のメモリデバイスを含む。一般的に、コンピュータは、磁気ディスク、磁気光学ディスク、または、光学ディスクなどの、データを記憶するための１つまたは複数の大容量記憶デバイスをさらに含むか、または、操作可能に当該大容量記憶デバイスと結合されてデータを受信するかまたはデータを伝送するか、または、その両方を兼有する。しかしながら、コンピュータは、必ずとして、このようなデバイスを有するわけではない。なお、コンピュータは、もう１デバイスに埋め込まれることができ、例えば、携帯電話、パーソナルデジタルアシスタント（ＰＤＡ）、モバイルオーディオまたはビデオおプレーヤー、ゲームコンソール、グローバルポジショニングシステム（ＧＰＳ）レジーバー、または、汎用シリアルバス（ＵＳＢ）フラッシュドライブなどのポータブル記憶デバイスに埋め込まれることができ、これらデバイスはいくつかの例に過ぎない。 Suitable computers for running computer programs include, for example, general purpose and / or dedicated microprocessors, or any other type of central processing unit. Generally, the central processing unit will receive instructions and data from read-only memory and / or random access memory. The basic components of a computer include a central processing unit for executing or executing instructions, and one or more memory devices for storing instructions and data. In general, a computer further includes or is operable with one or more mass storage devices for storing data, such as magnetic disks, magnetic optical disks, or optical disks. Combined with a device to receive data, transmit data, or both. However, computers do not necessarily have such devices. The computer can be embedded in another device, such as a mobile phone, personal digital assistant (PDA), mobile audio or video player, game console, Global Positioning System (GPS) receiver, or general purpose serial bus. It can be embedded in portable storage devices such as (USB) flash drives, and these devices are just a few examples.

コンピュータプログラム命令とデータの記憶に適したコンピュータ可読媒体は、すべての形式の不揮発性メモリ、媒介、および、メモリデバイスを含み、例えば、半導体メモリデバイス（例えば、ＥＰＲＯＭ、ＥＥＰＲＯＭ、および、フラッシュデバイス）、磁気ディスク（例えば、内部ハードディスクまたは移動可能ディスク）、磁気光学ディスク、および、ＣＤＲＯＭ、および、ＤＶＤ－ＲＯＭディスクを含む。プロセッサとメモリは、専用論理回路によって補完されるかまたは専用論理回路に組み込まれることができる。 Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, intermediaries, and memory devices, such as semiconductor memory devices (eg, EPROMs, EEPROMs, and flash devices). Includes magnetic discs (eg, internal hard disks or mobile discs), magnetic optical discs, and CD ROMs, and DVD-ROM discs. Processors and memory can be complemented by dedicated logic circuits or incorporated into dedicated logic circuits.

本発明は、多くの具体的な実施の細部を含むが、これらを本発明の範囲または保護しようとする範囲を限定するものとして解釈すべきではなく、主に本発明のいくつかの実施例の特徴を叙述するために使用される。本発明の複数の実施例中の特定の特徴は、単一の実施例に組み合わせて実施されることもできる。他方、単一の実施例中の各種の特徴は、複数の実施例で別々に実施されるかまたはいかなる適切なサブ組み合わせで実施されることもできる。なお、特徴が上記のように特定の組み合わせで役割を果たし、また最初からこのように保護すると主張したが、保護すると主張した組み合わせからの１つまたは複数の特徴は、場合によって当該組み合わせから除外されることができ、また保護すると主張した組み合わせはサブ組み合わせるまたはサブ組み合わせる変形に向けることができる。 The present invention contains many specific implementation details, which should not be construed as limiting the scope of the invention or the scope of which it seeks to protect, primarily of some embodiments of the invention. Used to describe features. Specific features in a plurality of embodiments of the present invention may also be implemented in combination with a single embodiment. On the other hand, the various features in a single embodiment can be implemented separately in multiple embodiments or in any suitable sub-combination. It should be noted that the features play a role in a particular combination as described above and are claimed to be protected in this way from the beginning, but one or more features from the combination claimed to be protected may be excluded from the combination in some cases. And the combinations claimed to be protected can be sub-combined or directed to sub-combining variants.

類似的に、図面で特定の順序に従って操作を描いたが、これはこれら操作を示した特定の順序にしたがって実行するかまたは順次に実行するように要求するか、または、例示したすべての操作が実行されることによって期待する結果が実現されると要求することであると理解すべきではない。場合によっては、マルチタスクおよび並列処理が有利である可能性がある。なお、上記の実施例中の各種のシステムモジュールとコンポーネントの分離は、すべての実施例でいずれもこのように分離されなければならないと理解すべきではないし、また、叙述したプログラムコンポーネントとシステムは、一般的に、一緒に単一のソフトウェア製品に統合されるか、または、複数のソフトウェア製品にパッケージされることができることを理解すべきである。 Similarly, the drawings depict operations in a specific order, which either requires them to be performed in a specific order that indicates them, or requires them to be performed in sequence, or all the operations illustrated. It should not be understood that it is a requirement that the expected result be achieved by being carried out. In some cases, multitasking and parallel processing can be advantageous. It should not be understood that the separation of the various system modules and components in the above embodiments must be separated in this way in all embodiments, and the described program components and systems are: In general, it should be understood that they can be integrated together into a single software product or packaged into multiple software products.

したがって、主題の特定の実施例がすでに叙述された。他の実施例は、添付する「特許請求の範囲」の範囲内にある。場合によっては、特許請求の範囲に記載されている動作は、異なる順序によって実行されても、依然として期待する結果が実現されることができる。なお、図面で描かれた処理は、期待する結果を実現するために、必ずとして、示めされた特定の順序または順次を必要としない。一部の実現において、マルチタスクおよび並列処理が有益である可能性がある。 Therefore, specific examples of the subject have already been described. Other examples are within the scope of the attached "Claims". In some cases, the actions described in the claims may be performed in different orders and still achieve the expected results. It should be noted that the processes depicted in the drawings do not necessarily require the specific order or sequence shown to achieve the expected results. Multitasking and parallel processing can be beneficial in some implementations.

上記は、本発明のいくつかの実施例に過ぎず、本発明を限定するために使用されるものではない。本発明の精神と原則の範囲内で行われたいかなる修正、同等の置換、改良などは、いずれも本発明の１つまたは複数の実施例の範囲に含まれるべきである。
The above are only a few embodiments of the invention and are not used to limit the invention. Any modifications, equivalent substitutions, improvements, etc. made within the spirit and principles of the invention should be included within the scope of one or more embodiments of the invention.

Claims

It is a driving method for interactive objects displayed on display devices.
Acquiring a phoneme sequence corresponding to the voice-driven data of the interactive target,
Acquiring the appearance parameter value of the interactive object that matches the phoneme sequence,
A method for driving an interactive object, which comprises controlling the appearance of the interactive object displayed on the display device based on the appearance parameter value.

The driving method for an interactive object according to claim 1, further comprising controlling the output voice and / or exhibition text of the display device based on the phoneme sequence.

Acquiring the appearance parameter value of the interactive object that matches the phoneme sequence is
To obtain the feature information of the phoneme sequence by executing the feature encoding on the phoneme sequence,
The driving method for an interactive object according to claim 1 or 2, wherein the figure parameter value of the interactive object corresponding to the characteristic information of the phoneme sequence is acquired, and the present invention includes.

Performing feature encoding on the phoneme sequence to obtain feature information on the phoneme sequence
Generating a chord sequence corresponding to the phoneme for each phoneme among a plurality of types of phonemes included in the phoneme sequence.
Obtaining the feature information of the chord sequence corresponding to the phoneme based on the chord value of the chord sequence corresponding to the phoneme and the time length corresponding to the phoneme.
The driving method for an interactive object according to claim 3, wherein the feature information of the phoneme sequence is obtained based on the feature information of the chord sequence corresponding to each of the plurality of types of phonemes.

Generating a chord sequence corresponding to a phoneme for each phoneme among a plurality of types of phonemes included in the phoneme sequence is not possible.
Detecting whether or not the phoneme is supported at each time point,
By setting the chord value at the time when the phoneme is supported as the first numerical value and setting the chord value at the time when the phoneme is not supported as the second numerical value, the chord sequence corresponding to the phoneme is obtained. The method for driving an interactive object according to claim 4, wherein the method comprises.

It is possible to obtain the feature information of the chord sequence corresponding to each of the plurality of types of phonemes based on the chord value of the chord sequence corresponding to each of the plurality of phonemes and the time length corresponding to each of the plurality of types of phonemes. ,
For each phoneme in the plurality of types of phonemes, a Gaussian convolution operation is executed for a continuous value of the phonemes in time by using a Gaussian filter for the chord sequence corresponding to the phoneme. The driving method for an interactive object according to claim 4 or 5, wherein the feature information of the chord sequence corresponding to the phoneme is obtained.

The appearance parameter includes a facial appearance parameter, the facial appearance parameter includes a facial muscle control coefficient, and the facial muscle control coefficient is used to control the motor state of at least one facial muscle.
Controlling the appearance of the interactive object displayed on the display device based on the appearance parameter value is
It is characterized by including driving the interactive object so that the interactive object performs a facial motion matching each phoneme in the phoneme sequence based on a facial muscle control coefficient value matching the phoneme sequence. The driving method for an interactive object according to any one of claims 1 to 6.

Acquiring the drive data of the body shape related to the face shape parameter value and
7. The aspect of claim 7 is further comprising driving the interactive object so that the interactive object performs a limb movement based on the driving data of the body shape related to the facial appearance parameter value. Described how to drive an interactive object.

Acquiring the appearance parameter value of the interactive object corresponding to the feature information of the phoneme sequence is
The feature information of the phoneme sequence is sampled at a predetermined time interval to obtain the sampling feature information corresponding to the first sampling time.
The feature is that the sampling feature information corresponding to the first sampling time is input to a pre-trained neural network to obtain the appearance parameter value of the interactive object corresponding to the sampling feature information. The method for driving an interactive object according to claim 3.

The pre-trained neural network includes a long-term memory network and a fully connected network.
Inputting the sampling feature information corresponding to the first sampling time into a pre-trained neural network to obtain the appearance parameter value of the interactive object corresponding to the sampling feature information is possible.
The sampling feature information corresponding to the first sampling time is input to the long-short-term storage network, and the related feature information is output based on the sampling feature information before the first sampling time.
The related feature information is input to the fully connected network, and the appearance parameter value corresponding to the related feature information is determined based on the classification result of the fully connected network.
The driving method for an interactive object according to claim 9, wherein each type corresponds to a set of the appearance parameter values in the classification result.

The neural network was obtained by training using a phoneme sequence sample.
The driving method of the interactive object is
To get the video segment of the audio emitted by the character,
Acquiring a plurality of first image frames including the character and a plurality of audio frames corresponding to the plurality of the first image frames based on the video segment.
The first image frame is converted into a second image frame including the interactive object, and the appearance parameter value corresponding to the second image frame is acquired.
Labeling the audio frame corresponding to the first image frame based on the appearance parameter value corresponding to the second image frame, and
The driving method for an interactive object according to claim 9 or 10, further comprising obtaining the phoneme sequence sample based on the audio frame to which the appearance parameter value is leveled.

The sample feature encoding is executed for the phoneme sequence to obtain the feature information corresponding to the second sampling time, and the feature information sample is obtained for the figure parameter value corresponding to the feature information leveling.
Further comprising training the initial neural network based on the feature information sample and allowing the neural network to be trained after the change in network loss satisfies the convergence condition.
The driving method for an interactive object according to claim 11, wherein the network loss includes a difference between the appearance parameter value predicted and obtained by the initial neural network and the leveled appearance parameter value. ..

The network loss includes a second norm of the difference between the figure parameter value predicted and obtained by the initial neural network and the leveled figure parameter value.
The driving method for an interactive object according to claim 12, wherein the network loss further includes the first norm of the appearance parameter value predicted and obtained by the initial neural network.

An interactive drive device on display in a display device.
A phoneme sequence acquisition unit for acquiring a phoneme sequence corresponding to the voice-driven data of the interactive target, and a phoneme sequence acquisition unit.
A parameter acquisition unit for acquiring the figure parameter value of the interactive object that matches the phoneme sequence, and
A drive device for an interactive object, comprising: a drive unit for controlling the appearance of the interactive object displayed on the display device based on the appearance parameter value.

The parameter acquisition unit is
For each phoneme among the plurality of types of phonemes included in the phoneme sequence, a chord sequence corresponding to the phoneme is generated.
Based on the encoding value of the chord sequence corresponding to the phoneme and the time length corresponding to the phoneme, the feature information of the chord sequence corresponding to the phoneme is obtained.
Based on the characteristic information of the chord sequence corresponding to each of the plurality of types of phonemes, the characteristic information of the phoneme sequence is obtained.
Here, it is possible to generate a chord sequence corresponding to the phoneme for each phoneme among a plurality of types of phonemes included in the phoneme sequence.
Detecting whether or not the phoneme is supported at each time point,
By setting the encoding value at the time when the phoneme is supported as the first numerical value and setting the encoding value at the time when the phoneme is not supported as the second numerical value, the chord sequence corresponding to the phoneme is obtained. The drive device for an interactive object according to claim 14, wherein the drive device is characterized by including the above.

The appearance parameter includes a facial appearance parameter, the facial appearance parameter includes a facial muscle control coefficient, and the facial muscle control coefficient is used to control the motor state of at least one facial muscle.
The drive unit is
Based on the facial muscle control coefficient value that matches the phoneme sequence, the interactive object is driven so that the interactive object performs a facial motion that matches each phoneme in the phoneme sequence.
The drive device of the interactive object acquires the drive data of the body shape related to the face shape parameter value, and the interactive target moves the limbs based on the drive data of the body shape related to the face shape parameter value. The driving device for an interactive object according to claim 14 or 15, further comprising an operation driving unit for driving the interactive object.

When acquiring the figure parameter value of the interactive object corresponding to the feature information of the phoneme sequence,
The parameter acquisition unit is
The feature information of the phoneme sequence is sampled at predetermined time intervals to obtain the sampling feature information corresponding to the first sampling time.
The sampling feature information corresponding to the first sampling time is input to the pre-trained neural network to obtain the appearance parameter value of the interactive object corresponding to the sampling feature information.
Here, the neural network includes a long-short period storage network and a fully connected network.
When the sampling feature information corresponding to the first sampling time is input to a pre-trained neural network to obtain the appearance parameter value of the interactive object corresponding to the sampling feature information.
The parameter acquisition unit is
The sampling feature information corresponding to the first sampling time is input to the long and short period storage network, and the related feature information is output based on the sampling feature information before the first sampling time.
The related feature information is input to the fully connected network, and the appearance parameter value corresponding to the related feature information is determined based on the classification result of the fully connected network.
Here, the drive device for an interactive object according to claim 15, wherein each type corresponds to a set of the appearance parameter values in the classification result.

The neural network was obtained by training using a phoneme sequence sample.
The interactive object drive device further comprises a sample acquisition unit.
The sample acquisition unit is
A video segment of the sound emitted by the character is acquired, and a plurality of first image frames including the character and a plurality of audio frames corresponding to the plurality of the first image frames are acquired based on the video segment.
The first image frame is converted into a second image frame including the interactive object, and the appearance parameter value corresponding to the second image frame is acquired.
The audio frame corresponding to the first image frame is leveled based on the appearance parameter value corresponding to the second image frame.
The phoneme sequence sample was obtained based on the audio frame in which the appearance parameter values were leveled.
The interactive object drive is further equipped with a training unit.
The training unit
A sample feature code is executed for the phoneme sequence to obtain feature information corresponding to the second sampling time, and a feature information sample is obtained for the appearance parameter value corresponding to the feature information leveling.
The initial neural network can be trained based on the feature information sample, and the neural network can be trained after the change in network loss satisfies the convergence condition.
Here, the network loss includes a difference between the appearance parameter value predicted and obtained by the initial neural network and the leveled appearance parameter value.
The network loss includes a second norm of the difference between the figure parameter value predicted and obtained by the initial neural network and the leveled figure parameter value.
The drive device for an interactive object according to claim 17, wherein the network loss further includes the first norm of the appearance parameter value predicted and obtained by the initial neural network.

An electronic device with memory and a processor
The memory stores computer instructions that can be operated on the processor.
An electronic device, wherein the processor realizes the method according to any one of claims 1 to 13 when the computer instruction is executed.

A computer-readable recording medium that stores computer programs.
A computer-readable recording medium, wherein the method according to any one of claims 1 to 13 is realized when the computer program is executed by a processor.