TWI766499B

TWI766499B - Method and apparatus for driving interactive object, device and storage medium

Info

Publication number: TWI766499B
Application number: TW109145886A
Authority: TW
Inventors: 吳文岩; 吳潛溢; 錢晨; 宋林森
Original assignee: 大陸商北京市商湯科技開發有限公司
Priority date: 2020-03-31
Filing date: 2020-12-24
Publication date: 2022-06-01
Also published as: TW202138993A; JP2022531057A; WO2021196643A1; KR20210124312A; SG11202109464YA; CN111459450A

Abstract

The present disclosure relates to a method and an apparatus for driving interactive object, a device and a storage medium. The interactive object is displayed in a display device, and the method comprises: obtaining a phoneme sequence corresponding to sound driving data of the interactive object; obtaining a value of pose parameter of the interactive object matched with the phoneme sequence; and controlling a posture of the interactive object displayed by the display device according to the value of posture parameter.

Description

Driving method, device, device and storage medium for interactive objects

本公開涉及計算機技術領域，具體涉及一種互動物件的驅動方法、裝置、設備以及儲存媒體。The present disclosure relates to the field of computer technology, and in particular, to a driving method, apparatus, device and storage medium for interactive objects.

人機互動的方式大多基於按鍵、觸控、語音進行輸入，通過在顯示螢幕上呈現圖像、文本或虛擬人物進行回應。目前虛擬人物多是在語音助理的基礎上改進得到的。Most of the human-computer interaction methods are based on keystrokes, touch, and voice input, and respond by presenting images, texts or virtual characters on the display screen. At present, virtual characters are mostly improved on the basis of voice assistants.

本公開實施例提供一種互動物件的驅動方案。Embodiments of the present disclosure provide a driving solution for an interactive object.

根據本公開的一方面，提供一種互動物件的驅動方法，所述互動物件顯示在顯示設備中，所述方法包括：獲取所述互動物件的聲音驅動數據對應的音素序列；獲取與所述音素序列匹配的所述互動物件的姿態參數值；根據所述姿態參數值控制所述顯示設備顯示的所述互動物件的姿態。結合本公開提供的任一實施方式，所述方法更包括：根據所述音素序列控制所述顯示設備輸出語音和/或文本。According to an aspect of the present disclosure, there is provided a method for driving an interactive object, where the interactive object is displayed on a display device, the method comprising: obtaining a phoneme sequence corresponding to sound driving data of the interactive object; obtaining a phoneme sequence corresponding to the sound driving data of the interactive object; The matched gesture parameter value of the interactive object; controlling the gesture of the interactive object displayed by the display device according to the gesture parameter value. With reference to any of the embodiments provided in the present disclosure, the method further includes: controlling the display device to output speech and/or text according to the phoneme sequence.

結合本公開提供的任一實施方式，所述獲取與所述音素序列匹配的所述互動物件的姿態參數值，包括：對所述音素序列進行特徵編碼，獲得所述音素序列的特徵資訊；獲取所述音素序列的特徵資訊對應的所述互動物件的姿態參數值。With reference to any of the embodiments provided in the present disclosure, the obtaining the gesture parameter value of the interactive object matching the phoneme sequence includes: performing feature encoding on the phoneme sequence to obtain feature information of the phoneme sequence; obtaining The gesture parameter value of the interactive object corresponding to the feature information of the phoneme sequence.

結合本公開提供的任一實施方式，對所述音素序列進行特徵編碼，獲得所述音素序列的特徵資訊，包括：針對所述音素序列包含的多種音素中的每種音素，生成所述音素分別的編碼序列；根據所述音素分別對應的編碼序列的編碼值以及所述音素序列中多種音素分別對應的持續時間，獲得所述音素分別的編碼序列的特徵資訊；根據所述多種音素分別對應的編碼序列的特徵資訊，獲得所述音素序列的特徵資訊。With reference to any of the implementation manners provided in the present disclosure, performing feature encoding on the phoneme sequence to obtain feature information of the phoneme sequence includes: for each phoneme in the plurality of phonemes included in the phoneme sequence, generating the phoneme respectively according to the coding values of the coding sequences corresponding to the phonemes and the corresponding durations of various phonemes in the phoneme sequence respectively, obtain the feature information of the coding sequences of the phonemes; The feature information of the coding sequence is obtained, and the feature information of the phoneme sequence is obtained.

結合本公開提供的任一實施方式，針對所述音素序列包含的多種音素中的每種音素，生成所述音素分別的編碼序列，包括：檢測各時間點上是否對應有所述音素；通過將有所述音素的時間點上的編碼值設置為第一數值，將沒有所述音素的時間點上的編碼值設置為第二數值，得到所述音素對應的所述編碼序列。With reference to any of the implementation manners provided in the present disclosure, for each phoneme in the plurality of phonemes included in the phoneme sequence, generating a separate coding sequence of the phoneme includes: detecting whether each time point corresponds to the phoneme; The coding value at the time point with the phoneme is set as the first numerical value, and the coding value at the time point without the phoneme is set as the second numerical value, so as to obtain the coding sequence corresponding to the phoneme.

結合本公開提供的任一實施方式，根據所述多種音素分別對應的編碼序列的編碼值以及所述多種音素分別對應的持續時間，獲得所述多種音素分別對應的編碼序列的特徵資訊，包括：對於所述多種音素中的每種音素，對於所述音素對應的編碼序列，利用高斯濾波器對所述音素在時間上的連續值進行高斯卷積操作，獲得所述音素對應的編碼序列的特徵資訊。With reference to any of the embodiments provided in the present disclosure, according to the coding values of the coding sequences corresponding to the multiple phonemes and the durations corresponding to the multiple phonemes, the feature information of the coding sequence corresponding to the multiple phonemes is obtained, including: For each phoneme in the plurality of phonemes, for the coding sequence corresponding to the phoneme, use a Gaussian filter to perform a Gaussian convolution operation on the continuous values of the phoneme in time to obtain the feature of the coding sequence corresponding to the phoneme News.

結合本公開提供的任一實施方式，姿態參數包括面部姿態參數，所述面部姿態參數包括面部肌肉控制係數，所述面部肌肉控制係數用於控制至少一個面部肌肉的運動狀態；根據所述姿態參數值控制所述顯示設備顯示的所述互動物件的姿態，包括：根據與所述音素序列匹配的面部肌肉控制系數值，驅動所述互動物件做出與所述音素序列中的各個音素匹配的面部動作。With reference to any of the embodiments provided in the present disclosure, the posture parameters include facial posture parameters, the facial posture parameters include facial muscle control coefficients, and the facial muscle control coefficients are used to control the motion state of at least one facial muscle; according to the posture parameters The value controls the posture of the interactive object displayed by the display device, including: according to the facial muscle control coefficient value matching the phoneme sequence, driving the interactive object to make a face matching each phoneme in the phoneme sequence action.

結合本公開提供的任一實施方式，所述方法更包括：獲取與所述面部姿態參數值關聯的身體姿態的驅動數據；根據所述姿態參數值控制所述顯示設備顯示的所述互動物件的姿態，包括：根據與所述面部姿態參數值關聯的所述身體姿態的所述驅動數據，驅動所述互動物件做出肢體動作。With reference to any of the embodiments provided in the present disclosure, the method further includes: acquiring driving data of the body posture associated with the facial posture parameter value; controlling the interactive object displayed by the display device according to the posture parameter value The gesture includes: driving the interactive object to perform body movements according to the driving data of the body gesture associated with the facial gesture parameter value.

結合本公開提供的任一實施方式，獲取所述音素序列的特徵資訊對應的所述互動物件的姿態參數值，包括：以設定時間間隔對所述音素序列的特徵資訊進行採樣，獲得第一採樣時間對應的採樣特徵資訊；將所述第一採樣時間對應的採樣特徵資訊輸入至預先訓練的神經網路，獲得與所述採樣特徵資訊對應的所述互動物件的所述姿態參數值。With reference to any of the embodiments provided in the present disclosure, acquiring the gesture parameter value of the interactive object corresponding to the feature information of the phoneme sequence includes: sampling the feature information of the phoneme sequence at a set time interval to obtain a first sample Sampling feature information corresponding to time; inputting the sampling feature information corresponding to the first sampling time into a pre-trained neural network to obtain the attitude parameter value of the interactive object corresponding to the sampling feature information.

結合本公開提供的任一實施方式，所述神經網路包括長短期記憶網路和全連接網路，所述將所述第一採樣時間對應的所述採樣特徵資訊輸入至預先訓練的神經網路，獲得與所述採樣特徵資訊對應的互動物件的姿態參數值，包括：將所述第一採樣時間對應的採樣特徵資訊輸入至所述長短期記憶網路，根據在所述第一採樣時間之前的採樣特徵資訊，輸出關聯特徵資訊；將所述關聯特徵資訊輸入至所述全連接網路，根據所述全連接網路的分類結果，確定與所述關聯特徵資訊對應的姿態參數值；其中，所述分類結果中每種類別對應於一組姿態參數值。With reference to any of the embodiments provided in the present disclosure, the neural network includes a long short-term memory network and a fully connected network, and the sampling feature information corresponding to the first sampling time is input into a pre-trained neural network and obtaining the attitude parameter value of the interactive object corresponding to the sampling feature information, comprising: inputting the sampling feature information corresponding to the first sampling time into the long-term and short-term memory network, according to the first sampling time The previously sampled feature information, and output associated feature information; input the associated feature information into the fully connected network, and determine the attitude parameter value corresponding to the associated feature information according to the classification result of the fully connected network; Wherein, each category in the classification result corresponds to a set of attitude parameter values.

結合本公開提供的任一實施方式，所述神經網路通過音素序列樣本訓練得到；所述方法更包括：獲取一角色發出語音的視訊段；根據所述視訊段獲取多個包含所述角色的第一圖像幀，以及與多個所述第一圖像幀分別對應的多個音訊幀；將所述第一圖像幀轉化為包含所述互動物件的第二圖像幀，獲取所述第二圖像幀對應的姿態參數值；根據所述第二圖像幀對應的所述姿態參數值，對與所述第一圖像幀對應的所述音訊幀進行標註；根據標註有所述姿態參數值的所述音訊幀，獲得所述音素序列樣本。With reference to any of the embodiments provided in the present disclosure, the neural network is obtained by training phoneme sequence samples; the method further includes: acquiring a video segment of a character uttering speech; acquiring a plurality of video segments including the character according to the video segment a first image frame, and multiple audio frames corresponding to the multiple first image frames; converting the first image frame into a second image frame including the interactive object, and obtaining the The attitude parameter value corresponding to the second image frame; according to the attitude parameter value corresponding to the second image frame, the audio frame corresponding to the first image frame is marked; The audio frame of the pose parameter value is obtained to obtain the phoneme sequence sample.

結合本公開提供的任一實施方式，所述方法更包括：對所述音素序列樣本進行特徵編碼，獲得所述第二採樣時間對應的特徵資訊，並對於所述特徵資訊標註對應的姿態參數值，獲得特徵資訊樣本；根據所述特徵資訊樣本對初始神經網路進行訓練，在網路損失的變化滿足收斂條件後訓練得到所述神經網路，其中，所述網路損失包括所述初始神經網路預測得到的姿態參數值與標註的所述姿態參數值之間的差異。In combination with any of the embodiments provided in the present disclosure, the method further includes: performing feature encoding on the phoneme sequence samples, obtaining feature information corresponding to the second sampling time, and labeling the feature information with corresponding attitude parameter values , obtain feature information samples; train the initial neural network according to the feature information samples, and train the neural network after the change of the network loss satisfies the convergence condition, wherein the network loss includes the initial neural network The difference between the pose parameter value predicted by the network and the labeled pose parameter value.

結合本公開提供的任一實施方式，所述網路損失包括所述初始神經網路預測得到的所述姿態參數值與標註的所述姿態參數值的差的二範數；所述網路損失更包括，所述初始神經網路預測得到的所述姿態參數值的一範數。With reference to any of the embodiments provided in the present disclosure, the network loss includes a two-norm of the difference between the attitude parameter value predicted by the initial neural network and the marked attitude parameter value; the network loss It further includes a norm of the attitude parameter value predicted by the initial neural network.

根據本公開的一方面，提供一種互動物件的驅動裝置，所述互動物件顯示在顯示設備中，所述裝置包括：音素序列獲取單元，用於獲取所述互動物件的聲音驅動數據對應的音素序列；參數獲取單元，用於獲取與所述音素序列匹配的所述互動物件的姿態參數值；驅動單元，用於根據所述姿態參數值控制所述顯示設備顯示的所述互動物件的姿態。According to an aspect of the present disclosure, there is provided an apparatus for driving an interactive object, where the interactive object is displayed on a display device, and the apparatus includes: a phoneme sequence acquisition unit configured to acquire a phoneme sequence corresponding to sound driving data of the interactive object ; a parameter acquisition unit for acquiring the gesture parameter value of the interactive object matched with the phoneme sequence; a driving unit for controlling the gesture of the interactive object displayed by the display device according to the gesture parameter value.

根據本公開的一方面，提供一種電子設備，所述設備包括記憶體、處理器，所述記憶體用於儲存可在處理器上運行的計算機指令，所述處理器用於在執行所述計算機指令時實現本公開提供的任一實施方式所述的互動物件的驅動方法。According to an aspect of the present disclosure, an electronic device is provided, the device includes a memory and a processor, the memory is used to store computer instructions that can be executed on the processor, and the processor is used to execute the computer instructions. At the same time, the driving method of the interactive object described in any one of the implementation manners provided by the present disclosure is implemented.

根據本公開的一方面，提供一種計算機可讀儲存媒體，其上儲存有計算機程式，所述程式被處理器執行時實現本公開提供的任一實施方式所述的互動物件的驅動方法。According to an aspect of the present disclosure, there is provided a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, implements the method for driving an interactive object according to any one of the implementation manners provided in the present disclosure.

本公開一個或多個實施例的互動物件的驅動方法、裝置、設備及計算機可讀儲存媒體，通過獲取顯示設備顯示的互動物件的聲音驅動數據對應的音素序列，獲取與所述音素序列匹配的所述互動物件的姿態參數值，並根據與所述音素序列匹配的所述互動物件的姿態參數值，控制所述顯示設備顯示的所述互動物件的姿態，使得所述互動物件做出與所述目標物件的進行交流或對所述目標物件進行回應的匹配的姿態，從而使目標物件產生與互動物件正在交流的感覺，提升了目標物件與互動物件的互動體驗。The method, device, device, and computer-readable storage medium for driving an interactive object according to one or more embodiments of the present disclosure, by acquiring the phoneme sequence corresponding to the sound driving data of the interactive object displayed by the display device, and obtaining the phoneme sequence matching the phoneme sequence. The gesture parameter value of the interactive object, and according to the gesture parameter value of the interactive object matching the phoneme sequence, the gesture of the interactive object displayed by the display device is controlled, so that the interactive object makes the same The target object communicates or responds to the matching gesture of the target object, so that the target object feels that it is communicating with the interactive object, and the interactive experience between the target object and the interactive object is improved.

這裡將詳細地對範例性實施例進行說明，其範例表示在附圖中。下面的描述涉及附圖時，除非另有表示，不同附圖中的相同數位表示相同或相似的要素。以下範例性實施例中所描述的實施方式並不代表與本公開相一致的所有實施方式。相反，它們僅是與如所附請求項中所詳述的、本公開的一些方面相一致的裝置和方法的例子。Exemplary embodiments will be described in detail herein, examples of which are illustrated in the accompanying drawings. Where the following description refers to the drawings, the same numerals in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with this disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the appended claims.

本文中術語“和/或”，僅僅是一種描述關聯物件的關聯關係，表示可以存在三種關係，例如，A和/或B，可以表示：單獨存在A，同時存在A和B，單獨存在B這三種情況。另外，本文中術語“至少一種”表示多種中的任意一種或多種中的至少兩種的任意組合，例如，包括A、B、C中的至少一種，可以表示包括從A、B和C構成的集合中選擇的任意一個或多個元素。The term "and/or" in this article is only a relationship to describe related objects, which means that there can be three relationships, for example, A and/or B, which can mean that A exists alone, A and B exist at the same time, and B exists alone. three conditions. In addition, the term "at least one" herein refers to any combination of any one of a plurality or at least two of a plurality, for example, including at least one of A, B, and C, and may mean including those composed of A, B, and C. Any one or more elements selected in the collection.

本公開至少一個實施例提供了一種互動物件的驅動方法，所述驅動方法可以由終端設備或伺服器等電子設備執行，所述終端設備可以是固定終端或移動終端，例如手機、平板電腦、遊戲機、台式機、廣告機、一體機、車載終端等等，所述伺服器包括本地伺服器或雲端伺服器等，所述方法還可以通過處理器調用記憶體中儲存的計算機可讀指令的方式來實現。At least one embodiment of the present disclosure provides a method for driving an interactive object. The driving method can be executed by an electronic device such as a terminal device or a server, and the terminal device can be a fixed terminal or a mobile terminal, such as a mobile phone, a tablet computer, a game computer, desktop computer, advertising machine, all-in-one computer, vehicle-mounted terminal, etc., the server includes a local server or a cloud server, etc., the method can also call the computer-readable instructions stored in the memory through the processor. to fulfill.

在本公開實施例中，互動物件可以是任意一種能夠與目標物件進行互動的虛擬形象。在一實施例中，互動物件可以是虛擬人物，還可以是虛擬動物、虛擬物品、卡通形象等等其他能夠實現互動功能的虛擬形象。互動物件的呈現形式既可以是2D形式也可以是3D形式，本公開對此並不限定。所述目標物件可以是使用者，也可以是機器人，還可以是其他智能設備。所述互動物件和所述目標物件之間的互動方式可以是主動互動方式，也可以是被動互動方式。一範例中，目標物件可以通過做出手勢或者肢體動作來發出需求，通過主動互動的方式來觸發互動物件與其互動。另一範例中，互動物件可以通過主動打招呼、提示目標物件做出動作等方式，使得目標物件採用被動方式與互動物件進行互動。In the embodiment of the present disclosure, the interactive object may be any virtual image capable of interacting with the target object. In one embodiment, the interactive object may be a virtual character, and may also be a virtual animal, a virtual item, a cartoon image, or other virtual images capable of realizing interactive functions. The presentation form of the interactive object may be either a 2D form or a 3D form, which is not limited in the present disclosure. The target object may be a user, a robot, or other smart devices. The interaction mode between the interactive object and the target object may be an active interaction mode or a passive interaction mode. In one example, the target object can issue a demand by making gestures or body movements, and trigger the interactive object to interact with it through active interaction. In another example, the interactive object may actively say hello, prompt the target object to make an action, etc., so that the target object interacts with the interactive object in a passive manner.

所述互動物件可以通過終端設備進行顯示，所述終端設備可以是電視機、帶有顯示功能的一體機、投影機、虛擬實境（Virtual Reality，VR）設備、擴增實境（Augmented Reality，AR）設備等，本公開並不限定終端設備的具體形式。The interactive object can be displayed by a terminal device, and the terminal device can be a TV, an all-in-one machine with a display function, a projector, a virtual reality (Virtual Reality, VR) device, an augmented reality (Augmented Reality, AR) equipment, etc., the present disclosure does not limit the specific form of the terminal equipment.

圖1示出本公開至少一個實施例提出的顯示設備。如圖1所示，該顯示設備具有透明顯示螢幕，在透明顯示螢幕上可以顯示立體畫面，以呈現出具有立體效果的虛擬場景以及互動物件。例如圖1中透明顯示螢幕顯示的互動物件包括虛擬卡通人物。在一些實施例中，本公開中所述的終端設備也可以為上述具有透明顯示螢幕的顯示設備，顯示設備中配置有記憶體和處理器，記憶體用於儲存可在處理器上運行的計算機指令，所述處理器用於在執行所述計算機指令時實現本公開提供的互動物件的驅動方法，以驅動透明顯示螢幕中顯示的互動物件對目標物件進行交流或回應。FIG. 1 illustrates a display device proposed by at least one embodiment of the present disclosure. As shown in FIG. 1 , the display device has a transparent display screen, and a stereoscopic image can be displayed on the transparent display screen, so as to present a virtual scene with a stereoscopic effect and interactive objects. For example, the interactive objects displayed on the transparent display screen in FIG. 1 include virtual cartoon characters. In some embodiments, the terminal device described in the present disclosure may also be the above-mentioned display device with a transparent display screen. The display device is configured with a memory and a processor, and the memory is used to store a computer that can run on the processor. The processor is configured to implement the driving method of the interactive object provided by the present disclosure when executing the computer instruction, so as to drive the interactive object displayed on the transparent display screen to communicate or respond to the target object.

在一些實施例中，響應於用於驅動互動物件輸出語音的聲音驅動數據，互動物件可以對目標物件發出指定語音。終端設備可以根據終端設備周邊目標物件的動作、表情、身份、偏好等，生成聲音驅動數據，以驅動互動物件通過發出指定語音進行交流或回應，從而為目標物件提供擬人化的服務。需要說明的是，聲音驅動數據也可以通過其他方式生成，比如，由伺服器生成併發送給終端設備。In some embodiments, in response to the sound driving data for driving the interactive object to output the speech, the interactive object may emit the specified speech to the target object. The terminal device can generate voice-driven data according to the actions, expressions, identities, preferences, etc. of the target objects around the terminal device, so as to drive the interactive objects to communicate or respond by issuing specified voices, thereby providing anthropomorphic services for the target objects. It should be noted that the sound-driven data can also be generated in other ways, for example, generated by a server and sent to a terminal device.

在互動物件與目標物件的互動過程中，根據該聲音驅動數據驅動互動物件發出指定語音時，可能無法驅動所述互動物件做出與該指定語音同步的面部動作，使得互動物件在發出語音時呆板、不自然，影響了目標物件與互動物件的互動體驗。基於此，本公開至少一個實施例提出一種互動物件的驅動方法，以提升目標物件與互動物件進行互動的體驗。During the interaction between the interactive object and the target object, when the interactive object is driven to emit a specified voice according to the sound driving data, the interactive object may not be driven to make facial movements synchronized with the specified voice, so that the interactive object is rigid when it emits a voice. , Unnatural, affecting the interactive experience between the target object and the interactive object. Based on this, at least one embodiment of the present disclosure provides a method for driving an interactive object, so as to improve the experience of interacting between a target object and an interactive object.

圖2示出根據本公開至少一個實施例的互動物件的驅動方法的流程圖，如圖2所示，所述方法包括步驟201~步驟203。FIG. 2 shows a flowchart of a method for driving an interactive object according to at least one embodiment of the present disclosure. As shown in FIG. 2 , the method includes steps 201 to 203 .

步驟201，獲取所述互動物件的聲音驅動數據對應的音素序列。Step 201: Acquire a phoneme sequence corresponding to the sound driving data of the interactive object.

所述聲音驅動數據可以包括音訊數據（語音數據）、文本等等。響應於聲音驅動數據是音訊數據，可以直接利用該音訊數據驅動互動物件輸出語音，也即終端設備通過該音訊數據直接輸出語音；響應於聲音驅動數據是文本，需要根據所述文本中包含的語素，生成相應的音素，通過所生成的音素來驅動互動物件輸出語音。所述聲音驅動數據也可以是其他形式的驅動數據，本公開對此不進行限制。The sound-driven data may include audio data (voice data), text, and the like. In response to the sound-driven data being audio data, the audio data can be used to directly drive the interactive object to output voice, that is, the terminal device directly outputs the voice through the audio data; The corresponding phoneme is generated, and the generated phoneme is used to drive the interactive object to output the speech. The sound driving data may also be other forms of driving data, which are not limited in the present disclosure.

在本公開實施例中，所述聲音驅動數據可以是伺服器端或終端設備根據與互動物件進行互動的目標物件的動作、表情、身份、偏好等生成的驅動數據，也可以是終端設備從內部記憶體調用的聲音驅動數據。本公開對於該聲音驅動數據的獲取方式不進行限制。In the embodiment of the present disclosure, the sound driving data may be the driving data generated by the server or the terminal device according to the action, expression, identity, preference, etc. of the target object interacting with the interactive object, or it may be the driving data generated by the terminal device from the internal Sound-driven data for memory recall. The present disclosure does not limit the acquisition manner of the sound driving data.

響應於所述聲音驅動數據為音訊數據，可以通過將音訊數據拆分為多個音訊幀，根據音訊幀的狀態對音訊幀進行組合而形成音素；根據所述音訊數據所形成的各個音素形成音素序列。其中，音素是根據語音的自然屬性劃分出來的最小語音單元，真實人物一個發音動作能夠形成一個音素。In response to the sound driving data being audio data, the audio data can be split into a plurality of audio frames, and the audio frames can be combined according to the state of the audio frames to form phonemes; each phoneme formed according to the audio data forms a phoneme sequence. Among them, the phoneme is the smallest phonetic unit divided according to the natural attributes of the voice, and one pronunciation action of a real person can form a phoneme.

響應於所述聲音驅動數據為文本，可以根據所述文本中包含的語素，獲得所述語素所對應的音素，從而獲得相應的音素序列。In response to the sound driving data being text, the phoneme corresponding to the morpheme may be obtained according to the morpheme contained in the text, thereby obtaining a corresponding phoneme sequence.

本領域技術人員應當理解，還可以通過其他方式獲得所述聲音驅動數據對應的音素序列，本公開對此不進行限定。Those skilled in the art should understand that the phoneme sequence corresponding to the sound driving data may also be obtained in other manners, which is not limited in the present disclosure.

步驟202，獲取與所述音素序列匹配的所述互動物件的姿態參數值。Step 202: Obtain the gesture parameter value of the interactive object matching the phoneme sequence.

在本公開實施例中，可以根據所述音素序列的聲學特徵，獲得與所述音素序列匹配的互動物件的姿態參數值；也可以通過對所述音素序列進行特徵編碼，確定特徵編碼所對應的姿態參數值，從而確定所述音素序列對應的姿態參數值。In the embodiment of the present disclosure, the attitude parameter value of the interactive object matching the phoneme sequence can be obtained according to the acoustic features of the phoneme sequence; the feature encoding can also be performed on the phoneme sequence to determine the corresponding value of the feature code. attitude parameter value, so as to determine the attitude parameter value corresponding to the phoneme sequence.

姿態參數用於控制所述互動物件的姿態，利用不同的姿態參數值可以驅動所述互動物件做出相應的姿態。該姿態參數包括面部姿態參數，在一些實施例中，該姿態參數還可以包括肢體姿態參數。其中，面部姿態參數用於控制所述互動物件的面部姿態，包括表情、口型、五官動作和頭部姿態等；肢體姿態參數用於控制所述互動物件的肢體姿態，也即用於驅動所述互動物件做出肢體動作。在本公開實施例中，可以預先建立音素序列的某種特徵與互動物件的姿態參數值的對應關係，從而通過所述音素序列可獲得對應的姿態參數值。獲取與所述音素序列匹配的所述互動物件的姿態參數值的具體方法容後詳述。姿態參數的具體形式可以根據互動物件模型的類型確定。The attitude parameter is used to control the attitude of the interactive object, and the interactive object can be driven to make a corresponding attitude by using different attitude parameter values. The gesture parameters include facial gesture parameters, and in some embodiments, the gesture parameters may also include limb gesture parameters. Among them, the facial posture parameter is used to control the facial posture of the interactive object, including expressions, mouth shapes, facial features and head posture, etc.; the limb posture parameter is used to control the limb posture of the interactive object, that is, used to drive all The interactive object makes a physical action. In the embodiment of the present disclosure, a correspondence relationship between a certain feature of the phoneme sequence and the gesture parameter value of the interactive object can be established in advance, so that the corresponding gesture parameter value can be obtained through the phoneme sequence. The specific method for acquiring the gesture parameter value of the interactive object matching the phoneme sequence will be described in detail later. The specific form of the pose parameter can be determined according to the type of the interactive object model.

步驟203，根據所述姿態參數值控制所述顯示設備顯示的所述互動物件的姿態。Step 203: Control the posture of the interactive object displayed by the display device according to the posture parameter value.

其中，所述姿態參數值與所述互動物件的聲音驅動數據對應的音素序列相匹配，根據所述姿態參數值控制所述互動物件的姿態，可以使互動物件的姿態與互動物件對所述目標物件所進行的交流或回應相匹配。例如，在互動物件正在用語音與目標物件進行交流或回應時，所做出的姿態與所輸出的語音是同步的，從而給目標物件一種所述互動物件正在說話的感覺。Wherein, the gesture parameter value matches the phoneme sequence corresponding to the sound driving data of the interactive object, and the gesture of the interactive object is controlled according to the gesture parameter value, so that the gesture of the interactive object and the interactive object can be aligned with the target. The communication or response made by the object matches. For example, when the interactive object is communicating or responding to the target object with speech, the gestures made are synchronized with the output speech, thereby giving the target object a feeling that the interactive object is talking.

在本公開實施例中，通過獲取顯示設備顯示的互動物件的聲音驅動數據對應的音素序列，獲取與所述音素序列匹配的所述互動物件的姿態參數值，並根據與所述音素序列匹配的所述互動物件的姿態參數值，控制所述顯示設備顯示的所述互動物件的姿態，使得所述互動物件做出與所述目標物件進行交流或對所述目標物件進行回應的匹配的姿態，從而使目標物件產生與互動物件正在交流的感覺，提升了目標物件的互動體驗。In the embodiment of the present disclosure, by acquiring the phoneme sequence corresponding to the sound driving data of the interactive object displayed by the display device, the gesture parameter value of the interactive object matching the phoneme sequence is obtained, and according to the phoneme sequence matching the phoneme sequence The gesture parameter value of the interactive object controls the gesture of the interactive object displayed by the display device, so that the interactive object makes a matching gesture for communicating with the target object or responding to the target object, Thereby, the target object feels that it is communicating with the interactive object, and the interactive experience of the target object is improved.

在一些實施例中，所述方法應用於伺服器，包括本地伺服器或雲端伺服器等，所述伺服器對於互動物件的聲音驅動數據進行處理，生成所述互動物件的姿態參數值，並根據所述姿態參數值利用三維渲染引擎進行渲染，得到所述互動物件的動畫。所述伺服器可以將所述動畫發送至終端進行顯示來對目標物件進行交流或回應，還可以將所述動畫發送至雲端，以使終端能夠從雲端獲取所述動畫來對目標物件進行交流或回應。在伺服器生成所述互動物件的姿態參數值後，還可以將所述姿態參數值發送至終端，以使終端完成渲染、生成動畫、進行顯示的過程。In some embodiments, the method is applied to a server, including a local server or a cloud server, etc., the server processes the sound-driven data of the interactive object, generates the gesture parameter value of the interactive object, and generates a gesture parameter value of the interactive object according to The attitude parameter value is rendered by a three-dimensional rendering engine to obtain the animation of the interactive object. The server can send the animation to the terminal for display to communicate or respond to the target object, and can also send the animation to the cloud, so that the terminal can obtain the animation from the cloud to communicate or communicate with the target object. respond. After the server generates the attitude parameter value of the interactive object, the attitude parameter value may also be sent to the terminal, so that the terminal can complete the process of rendering, generating animation and displaying.

在一些實施例中，所述方法應用於終端，所述終端對於互動物件的聲音驅動數據進行處理，生成所述互動物件的姿態參數值，並根據所述姿態參數值利用三維渲染引擎進行渲染，得到所述互動物件的動畫，所述終端可以顯示所述動畫以對目標物件進行交流或回應。In some embodiments, the method is applied to a terminal, and the terminal processes sound-driven data of an interactive object, generates an attitude parameter value of the interactive object, and uses a three-dimensional rendering engine to perform rendering according to the attitude parameter value, The animation of the interactive object is obtained, and the terminal can display the animation to communicate or respond to the target object.

在一些實施例中，可以根據所述音素序列控制所述顯示設備輸出語音和/或顯示文本。並且可以在根據所述音素序列控制所述顯示設備輸出語音和/或顯示文本的同時，根據所述姿態參數值控制所述顯示設備顯示的所述互動物件的姿態。In some embodiments, the display device may be controlled to output speech and/or display text according to the phoneme sequence. And while controlling the display device to output speech and/or display text according to the phoneme sequence, the gesture of the interactive object displayed by the display device can be controlled according to the gesture parameter value.

在本公開實施例中，由於所述姿態參數值是與所述音素序列相匹配的，因此根據音素序列輸出的語音和/或顯示的文本，與根據所述姿態參數值控制互動物件的姿態是同步進行的情況下，互動物件所做出的姿態與所輸出的語音和/或所顯示的文本是同步的，給目標物件以所述互動物件正在說話的感覺。In the embodiment of the present disclosure, since the gesture parameter value matches the phoneme sequence, the output speech and/or displayed text according to the phoneme sequence is different from the gesture of controlling the interactive object according to the gesture parameter value. In the case of synchronization, the gesture made by the interactive object is synchronized with the output speech and/or the displayed text, giving the target object the feeling that the interactive object is speaking.

由於聲音的輸出需要保持連續性，因此，在一實施例中，在音素序列上移動時間視窗，並輸出在每次移動過程中時間視窗內的音素，其中，以設定時長作為每次移動時間視窗的步長。例如，可以將時間視窗的長度設置為1秒，將設定時長設置為0.1秒。在輸出時間視窗內的音素的同時，獲取時間視窗設定位置處的音素或音素的特徵資訊所對應的姿態參數值，利用所述姿態參數值控制所述互動物件的姿態；該設定位置為距離時間視窗起始位置設定時長的位置，例如在時間視窗的長度設置為1s時，該設定位置距離時間視窗的起始位置可以為0.5s。隨著時間視窗的每次移動，在輸出時間視窗內的音素同時，都以時間視窗設定位置處對應的姿態參數值控制互動物件的姿態，從而使互動物件的姿態與輸出的語音同步，給目標物件以所述互動物件正在說話的感覺。Since the output of sound needs to maintain continuity, in one embodiment, the time window is moved on the phoneme sequence, and the phonemes in the time window during each movement are output, wherein the set duration is used as the time of each movement The step size of the window. For example, the length of the time window can be set to 1 second, and the set duration can be set to 0.1 seconds. While outputting the phonemes in the time window, obtain the gesture parameter value corresponding to the phoneme or the feature information of the phoneme at the set position of the time window, and use the gesture parameter value to control the gesture of the interactive object; the set position is the distance time The starting position of the window is the position where the duration is set. For example, when the length of the time window is set to 1s, the distance between the set position and the starting position of the time window can be 0.5s. With each movement of the time window, while outputting the phonemes in the time window, the posture of the interactive object is controlled by the corresponding attitude parameter value at the set position of the time window, so that the posture of the interactive object is synchronized with the output voice, giving the target The object feels like the interactive object is talking.

通過改變設定時長，可以改變獲取姿態參數值的時間間隔（頻率），從而改變了互動物件做出姿態的頻率。可以根據實際的互動場景來設置該設定時長，以使互動物件的姿態變化更加自然。By changing the set duration, you can change the time interval (frequency) for obtaining the attitude parameter value, thereby changing the frequency of the interactive object making the gesture. The set duration can be set according to the actual interactive scene, so as to make the posture change of the interactive object more natural.

在一些實施例中，可以對所述音素序列進行特徵編碼，獲得所述音素序列的特徵資訊；並根據所述特徵資訊確定所述互動物件的姿態參數值。In some embodiments, feature encoding can be performed on the phoneme sequence to obtain feature information of the phoneme sequence; and the gesture parameter value of the interactive object is determined according to the feature information.

本公開實施例通過對互動物件的聲音驅動數據所對應的音素序列進行特徵編碼，並根據所得到的特徵資訊獲得對應的姿態參數值，以在根據音素序列輸出聲音的同時，根據所述特徵資訊對應的姿態參數值控制所述互動物件的姿態，尤其是根據所述特徵資訊對應的面部姿態參數值驅動所述互動物件做出面部動作，使得所述互動物件的表情與發出的聲音是同步的，使目標物件產生互動物件正在說話的感覺，提升了目標物件的互動體驗。The embodiment of the present disclosure performs feature encoding on the phoneme sequence corresponding to the sound driving data of the interactive object, and obtains the corresponding attitude parameter value according to the obtained feature information, so as to output the sound according to the phoneme sequence, according to the feature information. The corresponding posture parameter value controls the posture of the interactive object, especially driving the interactive object to make facial movements according to the facial posture parameter value corresponding to the feature information, so that the expression of the interactive object and the sound emitted are synchronized , which makes the target object feel like the interactive object is talking, which improves the interactive experience of the target object.

在一些實施例中，可以通過以下方式對所述音素序列進行特徵編碼，獲得所述音素序列的特徵資訊。In some embodiments, the feature encoding of the phoneme sequence may be performed in the following manner to obtain feature information of the phoneme sequence.

首先，針對所述音素序列包含的多種音素，生成多種音素分別對應的編碼序列。First, for the multiple phonemes included in the phoneme sequence, a coding sequence corresponding to the multiple phonemes is generated.

在一個範例中，檢測各時間點上是否對應有第一音素，所述第一音素為所述多個音素中的任一個；將有所述第一音素的時間點上的編碼值設置為第一數值，將沒有所述第一音素的時間點上的編碼值設置為第二數值，在對各個時間點上的編碼值進行賦值之後可得到第一音素對應的編碼序列。例如，可以將有所述第一音素的時間點上的編碼值設置為1，將沒有所述第一音素的時間點上的編碼值設置為0。即，針對所述音素序列包含的多個音素中的每個音素，檢測各時間點上是否對應有該音素；將有所述音素的時間點上的編碼值設置為第一數值，將沒有所述音素的時間點上的編碼值設置為第二數值，在對各個時間點上的編碼值進行賦值之後可得到該音素對應的編碼序列。本領域技術人員應當理解，上述編碼值的設置僅為範例，也可以將編碼值設置為其他值，本公開對此不進行限制。In an example, it is detected whether there is a first phoneme corresponding to each time point, and the first phoneme is any one of the plurality of phonemes; the encoding value at the time point where the first phoneme exists is set as the first phoneme A numerical value, the coding value at the time point without the first phoneme is set as the second numerical value, and the coding sequence corresponding to the first phoneme can be obtained after assigning the coding value at each time point. For example, the code value at the time point with the first phoneme may be set to 1, and the code value at the time point without the first phoneme may be set to 0. That is, for each phoneme in the plurality of phonemes included in the phoneme sequence, it is detected whether the phoneme corresponds to the phoneme at each time point; The encoding value at the time point of the phoneme is set as the second value, and the encoding sequence corresponding to the phoneme can be obtained after assigning the encoding value at each time point. Those skilled in the art should understand that the above setting of the encoding value is only an example, and the encoding value may also be set to other values, which is not limited in the present disclosure.

之後，根據所述各個音素分別對應的編碼序列的編碼值以及所述音素序列中各個音素的持續時間，獲得各個音素分別對應的編碼序列的特徵資訊。Then, according to the coding value of the coding sequence corresponding to each phoneme and the duration of each phoneme in the phoneme sequence, the feature information of the coding sequence corresponding to each phoneme is obtained.

在一個範例中，對於第一音素對應的編碼序列，利用高斯濾波器對所述第一音素在時間上的連續值進行高斯卷積操作，獲得所述第一音素對應的編碼序列的特徵資訊；所述第一音素為所述多個音素中的任一個。In an example, for the coding sequence corresponding to the first phoneme, a Gaussian convolution operation is performed on continuous values of the first phoneme in time using a Gaussian filter to obtain feature information of the coding sequence corresponding to the first phoneme; The first phoneme is any one of the plurality of phonemes.

最後，根據各個編碼序列的特徵資訊的集合，獲得所述音素序列的特徵資訊。Finally, the feature information of the phoneme sequence is obtained according to the set of feature information of each coding sequence.

圖3示出對音素序列進行特徵編碼的過程示意圖。如圖3所示，音素序列310含音素j、i1、j、ie4（為簡潔起見，只示出部分音素），針對每種音素j、i1、ie4分別獲得與上述各音素分別對應的編碼序列321、322、323。在各個編碼序列中，將有所述音素的時間點上對應的編碼值設置為第一數值（例如為1），將沒有所述音素的時間點上對應的編碼值設置為第二數值（例如為0）。以編碼序列321為例，在音素序列310中有音素j的時間點上，編碼序列321的值為第一數值，在沒有音素j的時間點上，編碼序列321的值為第二數值。所有編碼序列321、322、323構成總編碼序列320。FIG. 3 shows a schematic diagram of a process of feature encoding for a phoneme sequence. As shown in FIG. 3 , the phoneme sequence 310 includes phonemes j, i1, j, and ie4 (for the sake of brevity, only some phonemes are shown), and for each phoneme j, i1, and ie4, codes corresponding to the above phonemes are obtained respectively. Sequence 321, 322, 323. In each encoding sequence, the corresponding encoding value at the time point with the phoneme is set to the first numerical value (for example, 1), and the corresponding encoding value at the time point without the phoneme is set to the second numerical value (for example, is 0). Taking the coding sequence 321 as an example, the value of the coding sequence 321 is the first numerical value at the time point when the phoneme sequence 310 has phoneme j, and the value of the coding sequence 321 is the second numerical value at the time point when there is no phoneme j in the phoneme sequence 310 . All coding sequences 321 , 322 , 323 constitute the total coding sequence 320 .

根據音素j、i1、ie4分別對應的編碼序列321、322、323的編碼值，以及該三個編碼序列中對應的音素的持續時間，也即在編碼序列321中j的持續時間、在編碼序列322中i1的持續時間、在編碼序列323中ie4的持續時間，可以獲得編碼序列321、322、323的特徵資訊。According to the coding values of the coding sequences 321, 322 and 323 corresponding to the phonemes j, i1 and ie4 respectively, and the durations of the corresponding phonemes in the three coding sequences, that is, the duration of j in the coding sequence 321, the duration of the coding sequence in the coding sequence The duration of i1 in 322 and the duration of ie4 in the coding sequence 323 can obtain the characteristic information of the coding sequences 321 , 322 and 323 .

例如，可以利用高斯濾波器分別對所述編碼序列321、322、323中的音素j、i1、ie4在時間上的連續值進行高斯卷積操作，獲得所述編碼序列的特徵資訊。也即，通過高斯濾波器對音素在時間上的連續值進行高斯卷積操作，使得各個編碼序列中編碼值從第二數值到第一數值或者從第一數值到第二數值的變化階段變得平滑。對各個編碼序列321、322、323分別進行高斯卷積操作，從而獲得各個編碼序列的特徵值，其中，特徵值為構成特徵資訊的參數，根據各個編碼序列的特徵資訊的集合，獲得該音素序列310所對應的特徵資訊330。本領域技術人員應當理解，也可以對各個編碼序列進行其他的操作來獲得所述編碼序列的特徵資訊，本公開對此不進行限制。For example, Gaussian convolution operation can be performed on successive temporal values of phonemes j, i1, ie4 in the encoded sequences 321, 322, 323 using a Gaussian filter to obtain the feature information of the encoded sequences. That is, the Gaussian convolution operation is performed on the continuous values of the phoneme in time through the Gaussian filter, so that the change stage of the coding value in each coding sequence from the second value to the first value or from the first value to the second value becomes smooth. Gaussian convolution operation is performed on each coding sequence 321, 322, 323, respectively, so as to obtain the feature value of each coding sequence, wherein the feature value is a parameter constituting feature information, and the phoneme sequence is obtained according to the set of feature information of each coding sequence. Feature information 330 corresponding to 310 . Those skilled in the art should understand that other operations may also be performed on each coding sequence to obtain characteristic information of the coding sequence, which is not limited in the present disclosure.

在本公開實施例中，通過根據音素序列中每種音素的持續時間獲得所述編碼序列的特徵資訊，使得編碼序列的變化階段平滑，例如，編碼序列的值除了0和1也呈現出中間狀態的值，例如0.2、0.3等等，而根據這些中間狀態的值所獲取的姿態參數值，使得互動人物的姿態變化過度的更加平緩、自然，尤其是互動人物的表情變化更加平緩、自然，提高了目標物件的互動體驗。In the embodiment of the present disclosure, the characteristic information of the coding sequence is obtained according to the duration of each phoneme in the phoneme sequence, so that the change phase of the coding sequence is smooth. For example, the values of the coding sequence show an intermediate state except for 0 and 1. value, such as 0.2, 0.3, etc., and the attitude parameter values obtained according to the values of these intermediate states make the changes of the interactive characters' poses more gentle and natural, especially the changes of the interactive characters' expressions are more gentle and natural, and improve the The interactive experience of the target object.

在一些實施例中，所述面部姿態參數可以包括面部肌肉控制係數。In some embodiments, the facial pose parameters may include facial muscle control coefficients.

人臉的運動，從解剖學角度來看，是由面部各部分肌肉協同變形的結果。因此，通過對互動物件的面部肌肉進行劃分而獲得面部肌肉模型，並對劃分得到的每一塊肌肉（區域）通過對應的面部肌肉控制係數控制其運動，也即對其進行收縮/擴張控制，則能夠使互動人物的面部做出各種表情。對於所述面部肌肉模型的每一塊肌肉，可以根據肌肉所在的面部位置和肌肉自身的運動特徵，來設置不同的肌肉控制係數所對應的運動狀態。例如，對於上唇肌肉，其控制係數的數值範圍為0~1，在該範圍內的不同數值，對應於上唇肌肉不同的收縮／擴張狀態，通過改變該數值，可以實現嘴部的縱向開合；而對於左嘴角肌肉，其控制係數的數值範圍為0~1，在該範圍內的不同數值，對應於左嘴角肌肉的收縮／擴張狀態，通過改變該數值，可以實現嘴部的橫向變化。The movement of the human face, from an anatomical point of view, is the result of cooperative deformation of the muscles of various parts of the face. Therefore, the facial muscle model is obtained by dividing the facial muscles of the interactive object, and the movement of each muscle (region) obtained by the division is controlled by the corresponding facial muscle control coefficient, that is, the contraction/expansion control is performed on it, then It can make the faces of interactive characters make various expressions. For each muscle of the facial muscle model, motion states corresponding to different muscle control coefficients can be set according to the facial position where the muscle is located and the motion characteristics of the muscle itself. For example, for the upper lip muscle, the value range of its control coefficient is 0~1. Different values in this range correspond to different contraction/expansion states of the upper lip muscle. By changing the value, the longitudinal opening and closing of the mouth can be realized; For the muscle of the left corner of the mouth, the value of the control coefficient ranges from 0 to 1. Different values in this range correspond to the contraction/expansion state of the muscle at the left corner of the mouth. By changing the value, the lateral change of the mouth can be achieved.

在根據音素序列輸出聲音的同時，根據與所述音素序列對應的面部肌肉控制系數值來驅動所述互動物件做出面部表情，則可以實現顯示設備在輸出聲音時，互動物件同步做出發出該聲音的表情，從而使目標物件產生該互動物件正在說話的感覺，提高了目標物件的互動體驗。While outputting the sound according to the phoneme sequence, the interactive object is driven to make a facial expression according to the facial muscle control coefficient value corresponding to the phoneme sequence, so that when the display device outputs the sound, the interactive object can synchronously make the sound. The expression of the voice, so that the target object feels that the interactive object is talking, and the interactive experience of the target object is improved.

在一些實施例中，可以將所述互動物件的面部動作與身體姿態相關聯，也即將該面部動作所對應的面部姿態參數值與所述身體姿態相關聯，所述身體姿態可以包括肢體動作、手勢動作、走路姿態等等。In some embodiments, the facial action of the interactive object may be associated with a body posture, that is, the facial posture parameter value corresponding to the facial action may be associated with the body posture, and the body posture may include body movements, Gestures, walking gestures, etc.

在互動物件的驅動過程中，獲取與所述面部姿態參數值關聯的身體姿態的驅動數據；在根據所述音素序列輸出聲音的同時，根據與所述面部姿態參數值關聯的身體姿態的驅動數據，驅動所述互動物件做出肢體動作。也即，在根據所述互動物件的聲音驅動數據驅動所述互動物件做出面部動作的同時，還根據該面部動作對應的面部姿態參數值獲取相關聯的身體姿態的驅動數據，從而在輸出聲音時，可以驅動互動物件同步做出相應的面部動作和肢體動作，使互動物件的說話狀態更加生動自然，提高了目標物件的互動體驗。During the driving process of the interactive object, obtain the driving data of the body posture associated with the facial posture parameter value; while outputting the sound according to the phoneme sequence, according to the driving data of the body posture associated with the facial posture parameter value , and drive the interactive object to make body movements. That is, while driving the interactive object to make a facial action according to the sound driving data of the interactive object, the driving data of the associated body posture is also obtained according to the facial posture parameter value corresponding to the facial action, so as to output the sound. When the interactive object is activated, it can drive the interactive object to make corresponding facial and body movements synchronously, so that the speaking state of the interactive object is more vivid and natural, and the interactive experience of the target object is improved.

在一些實施例中，可以通過以下方法獲取所述音素序列的特徵資訊對應的互動物件的姿態參數值。In some embodiments, the gesture parameter value of the interactive object corresponding to the feature information of the phoneme sequence can be obtained by the following method.

首先，以設定時間間隔對所述音素序列的特徵資訊進行採樣，獲得各個第一採樣時間對應的採樣特徵資訊。例如，設定時間間隔為0.1s，則各個第一採樣時間可以為0.1s、0.2s、0.3s等。First, the feature information of the phoneme sequence is sampled at set time intervals to obtain sampled feature information corresponding to each first sampling time. For example, if the set time interval is 0.1s, each first sampling time may be 0.1s, 0.2s, 0.3s, and so on.

參見圖3，特徵資訊330是基於時間的資訊，因此，在以設定時間間隔對該特徵資訊進行取樣時，可以獲得各個第一採樣時間所對應的採樣特徵資訊。Referring to FIG. 3 , the feature information 330 is time-based information. Therefore, when the feature information is sampled at a set time interval, the sampled feature information corresponding to each first sampling time can be obtained.

接下來，將所述第一採樣時間對應的採樣特徵資訊輸入至預先訓練的神經網路，則可以獲得與所述採樣特徵資訊對應的互動物件的姿態參數值。基於各個第一採樣時間對應的採樣特徵資訊，則可以獲得各個第一採樣時間對應的互動物件的姿態參數值。Next, the sampling feature information corresponding to the first sampling time is input into the pre-trained neural network, and then the attitude parameter value of the interactive object corresponding to the sampling feature information can be obtained. Based on the sampling feature information corresponding to each first sampling time, the attitude parameter value of the interactive object corresponding to each first sampling time can be obtained.

如前所述，在通過在音素序列上進行移動時間視窗輸出音素的情況下，獲取時間視窗設定位置處的特徵資訊，也即獲得時間視窗設定位置所對應的第一採樣時間處的特徵資訊，通過獲取該特徵資訊所對應的姿態參數值來控制所述互動物件的姿態，則可以使互動物件做出與所發出的語音適配的姿態，從而使互動物件發出語音的過程更加生動、自然。As mentioned above, in the case of outputting phonemes by moving the time window on the phoneme sequence, the feature information at the set position of the time window is obtained, that is, the feature information at the first sampling time corresponding to the set position of the time window is obtained, By obtaining the gesture parameter value corresponding to the feature information to control the gesture of the interactive object, the interactive object can be made to make a gesture suitable for the voice issued, so that the process of the interactive object issuing voice is more vivid and natural.

在一些實施例中，所述神經網路包括長短期記憶網路（Long Short-Term Memory，LSTM）和全連接網路。其中，長短期記憶網路是一種時間遞歸神經網路，其可以學習所輸入採樣特徵資訊的歷史資訊；並且，所述長短期記憶網路和全連接網路是聯合訓練的。In some embodiments, the neural network includes a Long Short-Term Memory (LSTM) network and a fully connected network. Among them, the long short-term memory network is a time recurrent neural network, which can learn the historical information of the input sampling feature information; and the long short-term memory network and the fully connected network are jointly trained.

在所述神經網路包括長短期記憶網路和全連接網路的情況下，首先將所述第一採樣時間對應的採樣特徵資訊輸入至長短期記憶網路，長短期記憶網路根據在所述第一採樣時間之前的採樣特徵資訊，輸出關聯特徵資訊。也即，長短期記憶網路所輸出的資訊包含了歷史特徵資訊對當前特徵資訊的影響。接下來，將所述關聯特徵資訊輸入至所述全連接網路，根據所述全連接網路的分類結果，確定與所述關聯特徵資訊對應的姿態參數值；其中，每一種分類對應於一組姿態參數值，即對應於一種面部肌肉控制係數的分佈狀況。In the case where the neural network includes a long-term and short-term memory network and a fully connected network, first input the sampling feature information corresponding to the first sampling time into the long-term and short-term memory network, and the long-term and short-term memory network The sampling feature information before the first sampling time is described, and the associated feature information is output. That is, the information output by the long short-term memory network includes the influence of the historical feature information on the current feature information. Next, input the associated feature information into the fully connected network, and determine a posture parameter value corresponding to the associated feature information according to the classification result of the fully connected network; wherein each classification corresponds to a The group pose parameter value, that is, the distribution state corresponding to a facial muscle control coefficient.

在本公開實施例中，通過長短期記憶網路和全連接網路來預測與音素序列的採樣特徵資訊對應的姿態參數值，可以將具有關聯性的歷史特徵資訊和當前特徵資訊進行融合，從而使得歷史姿態參數值對當前姿態參數值的變化產生影響，使得互動人物的姿態參數值的變化更加平緩、自然。In the embodiment of the present disclosure, the posture parameter value corresponding to the sampled feature information of the phoneme sequence can be predicted through the long short-term memory network and the fully connected network, and the related historical feature information and the current feature information can be fused, thereby The historical attitude parameter value has an influence on the change of the current attitude parameter value, so that the change of the attitude parameter value of the interactive character is more gentle and natural.

在一些實施例中，可以通過以下方式對所述神經網路進行訓練。In some embodiments, the neural network can be trained in the following manner.

首先，獲取音素序列樣本，所述音素序列樣本包含在設定時間間隔的第二採樣時間上標註的所述互動物件的姿態參數值。如圖4所示的音素序列樣本，其中虛線表示第二採樣時間，在各個第二採樣時間處標註互動物件的姿態參數值。First, a phoneme sequence sample is obtained, and the phoneme sequence sample includes the gesture parameter value of the interactive object marked at the second sampling time of the set time interval. As shown in the phoneme sequence sample shown in FIG. 4 , the dotted line represents the second sampling time, and the gesture parameter values of the interactive objects are marked at each second sampling time.

接下來，對所述音素序列樣本進行特徵編碼，獲得在各個第二採樣時間對應的特徵資訊，並對於所述特徵資訊標註對應的姿態參數值，獲得特徵資訊樣本。也即，特徵資訊樣本包含了在第二採樣時間上標註的所述互動物件的姿態參數值。Next, feature encoding is performed on the phoneme sequence samples to obtain feature information corresponding to each second sampling time, and corresponding attitude parameter values are marked for the feature information to obtain feature information samples. That is, the feature information sample includes the gesture parameter value of the interactive object marked at the second sampling time.

在獲得了特徵資訊樣本後，可以根據該特徵資訊樣本對所述神經網路進行訓練，在網路損失小於設定損失值時完成訓練，其中，所述網路損失包括所述神經網路預測得到的姿態參數值與標註的姿態參數值之間的差異。After the characteristic information sample is obtained, the neural network can be trained according to the characteristic information sample, and the training is completed when the network loss is less than the set loss value, wherein the network loss includes the prediction obtained by the neural network. The difference between the pose parameter value of and the annotated pose parameter value.

在一個範例中，網路損失函數的表示式如公式（1）所示：

（1）In one example, the network loss function is expressed as Equation (1):

(1)

其中，

是神經網路預測得到的第

個姿態參數值；

是所標註的第

個姿態參數值，也即真實值；

表示向量的二範數。in,

is predicted by the neural network

an attitude parameter value;

is the marked

A pose parameter value, that is, the true value;

Represents the two-norm of a vector.

通過調整所述神經網路的網路參數值，以最小化網路損失函數，在網路損失的變化滿足收斂條件時，例如網路損失的變化量小於設定閾值時，或者迭代次數達到設定次數時完成訓練，即得到了訓練好的神經網路。By adjusting the network parameter values of the neural network to minimize the network loss function, when the change of the network loss satisfies the convergence condition, for example, when the change of the network loss is less than the set threshold, or the number of iterations reaches the set number When the training is completed, the trained neural network is obtained.

在另一個範例中，網路損失函數的表示式如公式（2）所示：

（2）In another example, the network loss function is expressed as Equation (2):

(2)

其中，

是神經網路預測得到的第

個姿態參數值；

是所標註的第

個姿態參數值，也即真實值；

表示向量的二範數；

表示向量的一範數。in,

is predicted by the neural network

an attitude parameter value;

is the marked

A pose parameter value, that is, the true value;

represents the two-norm of the vector;

Represents the one-norm of a vector.

通過在網路損失函數中，加入預測得到的姿態參數值的一範數，增加了對於面部參數稀疏性的約束。By adding a norm of the predicted pose parameter value to the network loss function, the constraint on the sparsity of facial parameters is increased.

在一些實施例中，可以通過以下方法獲得音素序列樣本。In some embodiments, phoneme sequence samples may be obtained by the following methods.

首先，獲取一角色發出語音的視訊段。例如，可以獲取一真實人物正在說話的視訊段。First, a video clip of a character uttering a voice is obtained. For example, a video clip of a real person speaking may be obtained.

針對所述視訊段，獲取多個包含所述角色的第一圖像幀，以及與所述第一圖像幀對應的多個音訊幀。也即，將所述視訊段拆分為圖像幀和音訊幀，並且，每個圖像幀與每個音訊幀是對應的，也即，對於一個圖像幀，可以確定該角色在做出圖像幀的表情時所發出的聲音對應的音訊幀。For the video segment, a plurality of first image frames including the character, and a plurality of audio frames corresponding to the first image frames are acquired. That is, the video segment is divided into image frames and audio frames, and each image frame corresponds to each audio frame, that is, for one image frame, it can be determined that the character is making The audio frame corresponding to the sound produced when the image frame is expressed.

接下來，將所述第一圖像幀，也即所述包含角色的圖像幀，轉化為包含所述互動物件的第二圖像幀，獲取所述第二圖像幀對應的姿態參數值。以所述第一圖像幀為包含真實人物的圖像幀為例，可以將該真實人物的圖像幀轉換為包含互動物件所表示的形象的第二圖像幀，並且所述真實人物的姿態參數值與所述互動物件的姿態參數值是對應的，從而可以獲取各個第二圖像幀中互動物件的姿態參數值。Next, convert the first image frame, that is, the image frame containing the character, into a second image frame containing the interactive object, and obtain the pose parameter value corresponding to the second image frame . Taking the first image frame as an image frame containing a real person as an example, the image frame of the real person can be converted into a second image frame containing an image represented by an interactive object, and the real person's image frame can be converted into a second image frame containing an image represented by an interactive object. The attitude parameter value corresponds to the attitude parameter value of the interactive object, so that the attitude parameter value of the interactive object in each second image frame can be acquired.

之後，根據所述第二圖像幀對應的姿態參數值，對與所述第一圖像幀對應的音訊幀進行標註，根據標註了姿態參數值的音訊幀，獲得音素序列樣本。Then, according to the attitude parameter value corresponding to the second image frame, the audio frame corresponding to the first image frame is marked, and phoneme sequence samples are obtained according to the audio frame marked with the attitude parameter value.

在本公開實施例中，通過將一角色的視訊段，拆分為對應的圖像幀和音訊幀，並通過將包含真實人物的第一圖像幀轉化為包含互動物件的第二圖像幀來獲取音素序列對應的姿態參數值，使得音素與姿態參數值的對應性較好，能夠獲得較高質量的音素序列樣本。In an embodiment of the present disclosure, a video segment of a character is split into corresponding image frames and audio frames, and a first image frame containing a real person is converted into a second image frame containing interactive objects To obtain the attitude parameter value corresponding to the phoneme sequence, the correspondence between the phoneme and the attitude parameter value is better, and a higher quality phoneme sequence sample can be obtained.

圖5示出根據本公開至少一個實施例的互動物件的驅動裝置的結構示意圖，所述互動物件顯示在顯示設備中，如圖5所示，該裝置可以包括：音素序列獲取單元501，用於獲取所述互動物件的聲音驅動數據對應的音素序列；參數獲取單元502，用於獲取與所述音素序列匹配的所述互動物件的姿態參數值；驅動單元503，用於根據所述姿態參數值控制所述顯示設備顯示的所述互動物件的姿態。FIG. 5 shows a schematic structural diagram of an apparatus for driving an interactive object according to at least one embodiment of the present disclosure. The interactive object is displayed on a display device. As shown in FIG. 5 , the apparatus may include: a phoneme sequence acquisition unit 501 for Acquire the phoneme sequence corresponding to the sound driving data of the interactive object; the parameter acquisition unit 502 is used for acquiring the attitude parameter value of the interactive object matching the phoneme sequence; the driving unit 503 is used for according to the attitude parameter value Control the gesture of the interactive object displayed by the display device.

在一些實施例中，所述裝置更包括輸出單元，用於根據所述音素序列控制所述顯示設備輸出語音和/或顯示文本。In some embodiments, the apparatus further includes an output unit for controlling the display device to output speech and/or display text according to the phoneme sequence.

在一些實施例中，所述參數獲取單元具體用於：對所述音素序列進行特徵編碼，獲得所述音素序列的特徵資訊；獲取所述音素序列的特徵資訊對應的所述互動物件的姿態參數值。In some embodiments, the parameter obtaining unit is specifically configured to: perform feature encoding on the phoneme sequence to obtain feature information of the phoneme sequence; obtain gesture parameters of the interactive object corresponding to the feature information of the phoneme sequence value.

在一些實施例中，在對所述音素序列進行特徵編碼，獲得所述音素序列的特徵資訊時，所述參數獲取單元具體用於：針對所述音素序列包含的多種音素中的每種音素，生成多種音素分別對應的編碼序列；根據所述多種音素分別對應的編碼序列的編碼值以及所述音素序列中多種音素分別對應的持續時間，獲得所述多種音素分別對應的編碼序列的特徵資訊；根據所述多種音素分別對應的編碼序列的特徵資訊，獲得所述音素序列的特徵資訊。In some embodiments, when the feature encoding is performed on the phoneme sequence to obtain feature information of the phoneme sequence, the parameter obtaining unit is specifically configured to: for each phoneme in the plurality of phonemes included in the phoneme sequence, generating coding sequences corresponding to multiple phonemes respectively; according to the coding values of the coding sequences corresponding to the multiple phonemes and the duration corresponding to the multiple phonemes respectively in the phoneme sequence, obtain the feature information of the coding sequences corresponding to the multiple phonemes respectively; The feature information of the phoneme sequence is obtained according to the feature information of the coding sequences corresponding to the plurality of phonemes respectively.

在一些實施例中，在針對所述音素序列包含的多種音素，生成多個音素分別對應的編碼序列時，所述參數獲取單元具體用於：檢測各時間點上是否對應有第一音素，所述第一音素為所述多個音素中的任一個；通過將有所述第一音素的時間點上的編碼值設置為第一數值，將沒有所述第一音素的時間點上的編碼值設置為第二數值，得到所述第一音素對應的編碼序列。In some embodiments, when generating encoding sequences corresponding to multiple phonemes respectively for multiple phonemes included in the phoneme sequence, the parameter obtaining unit is specifically configured to: detect whether each time point corresponds to a first phoneme, so The first phoneme is any one of the plurality of phonemes; by setting the encoding value at the time point with the first phoneme as the first value, the encoding value at the time point without the first phoneme is set Set as the second value to obtain the coding sequence corresponding to the first phoneme.

在一些實施例中，在根據所述多種音素分別對應的編碼序列的編碼值以及所述音素序列中多種音素分別對應的持續時間，獲得所述多種音素分別對應的編碼序列的特徵資訊時，所述參數獲取單元具體用於：對於第一音素對應的編碼序列，利用高斯濾波器對所述第一音素在時間上的連續值進行高斯卷積操作，獲得所述第一音素對應的編碼序列的特徵資訊；所述第一音素為所述多個音素中的任一個。In some embodiments, when the feature information of the coding sequences corresponding to the various phonemes is obtained according to the coding values of the coding sequences corresponding to the various phonemes and the durations corresponding to the various phonemes in the phoneme sequence, the The parameter obtaining unit is specifically used to: for the coding sequence corresponding to the first phoneme, use a Gaussian filter to perform a Gaussian convolution operation on the continuous values of the first phoneme in time, and obtain the coding sequence corresponding to the first phoneme. feature information; the first phoneme is any one of the plurality of phonemes.

在一些實施例中，姿態參數包括面部姿態參數，所述面部姿態參數包括面部肌肉控制係數，用於控制至少一個面部肌肉的運動狀態；所述驅動單元具體用於：根據與所述音素序列匹配的面部肌肉控制係數，驅動所述互動物件做出與所述音素序列中的各個音素匹配的面部動作。In some embodiments, the posture parameters include facial posture parameters, and the facial posture parameters include facial muscle control coefficients, which are used to control the motion state of at least one facial muscle; the driving unit is specifically configured to: according to matching with the phoneme sequence The facial muscle control coefficient of , drives the interactive object to make facial actions matching each phoneme in the phoneme sequence.

在一些實施例中，所述裝置更包括動作驅動單元，用於獲取與所述面部姿態參數關聯的身體姿態的驅動數據；根據與所述面部姿態參數值關聯的身體姿態的驅動數據，驅動所述互動物件做出肢體動作。In some embodiments, the device further includes an action driving unit for acquiring driving data of the body posture associated with the facial posture parameter; The interactive object makes a physical action.

在一些實施例中，在獲取所述音素序列的特徵資訊對應的所述互動物件的姿態參數值時，所述參數獲取單元具體用於：以設定時間間隔對所述音素序列的特徵資訊進行採樣，獲得第一採樣時間對應的採樣特徵資訊；將所述第一採樣時間對應的採樣特徵資訊輸入至預先訓練的神經網路，獲得與所述採樣特徵資訊對應的互動物件的姿態參數值。In some embodiments, when acquiring the gesture parameter value of the interactive object corresponding to the feature information of the phoneme sequence, the parameter acquisition unit is specifically configured to: sample the feature information of the phoneme sequence at a set time interval , obtain the sampling feature information corresponding to the first sampling time; input the sampling feature information corresponding to the first sampling time into the pre-trained neural network, and obtain the attitude parameter value of the interactive object corresponding to the sampling feature information.

在一些實施例中，所述神經網路包括長短期記憶網路和全連接網路；在將所述第一採樣時間對應的採樣特徵資訊輸入至預先訓練的神經網路，獲得與所述採樣特徵資訊對應的互動物件的姿態參數值時，所述參數獲取單元具體用於：將所述第一採樣時間對應的採樣特徵資訊輸入至所述長短期記憶網路，根據在所述第一採樣時間之前的採樣特徵資訊，輸出關聯特徵資訊；將所述關聯特徵資訊輸入至所述全連接網路，根據所述全連接網路的分類結果，確定與所述關聯特徵資訊對應的姿態參數值；其中，所述分類結果中每種類別對應於一組姿態參數值。In some embodiments, the neural network includes a long short-term memory network and a fully connected network; after inputting the sampling feature information corresponding to the first sampling time into a pre-trained neural network, obtain a When the feature information corresponds to the attitude parameter value of the interactive object, the parameter acquisition unit is specifically configured to: input the sampling feature information corresponding to the first sampling time into the long short-term memory network, and according to the first sampling time Sampling feature information before time, and output associated feature information; input the associated feature information into the fully connected network, and determine the attitude parameter value corresponding to the associated feature information according to the classification result of the fully connected network ; wherein, each category in the classification result corresponds to a set of attitude parameter values.

在一些實施例中，所述神經網路通過音素序列樣本訓練得到。所述裝置更包括樣本獲取單元，用於：獲取一角色發出語音的視訊段；根據所述視訊段獲取多個包含所述角色的第一圖像幀以及與所述第一圖像幀對應的多個音訊幀；將所述第一圖像幀轉化為包含所述互動物件的第二圖像幀，獲取所述第二圖像幀對應的姿態參數值；根據所述第二圖像幀對應的姿態參數值，對與所述第一圖像幀對應的音訊幀進行標註；根據標註有姿態參數值的音訊幀，獲得音素序列樣本。In some embodiments, the neural network is trained on phoneme sequence samples. The device further includes a sample acquisition unit for: acquiring a video segment of a character uttering a voice; acquiring a plurality of first image frames including the character and a plurality of first image frames corresponding to the first image frames according to the video segment. a plurality of audio frames; converting the first image frame into a second image frame containing the interactive object, and obtaining the posture parameter value corresponding to the second image frame; The audio frame corresponding to the first image frame is marked; according to the audio frame marked with the attitude parameter value, the phoneme sequence sample is obtained.

本說明書至少一個實施例還提供了一種電子設備，如圖6所示，所述設備包括記憶體、處理器，記憶體用於儲存可在處理器上運行的計算機指令，處理器用於在執行所述計算機指令時實現本公開任一實施例所述的互動物件的驅動方法。At least one embodiment of the present specification also provides an electronic device, as shown in FIG. 6 , the device includes a memory and a processor, where the memory is used to store computer instructions that can be executed on the processor, and the processor is used to execute all computer instructions. The method for driving an interactive object described in any embodiment of the present disclosure is implemented when the computer instruction is used.

本說明書至少一個實施例還提供了一種計算機可讀儲存媒體，其上儲存有計算機程式，所述程式被處理器執行時實現本公開任一實施例所述的互動物件的驅動方法。At least one embodiment of the present specification further provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, implements the method for driving an interactive object described in any embodiment of the present disclosure.

本領域技術人員應明白，本說明書一個或多個實施例可提供為方法、系統或計算機程式產品。因此，本說明書一個或多個實施例可採用完全硬體實施例、完全軟體實施例或結合軟體和硬體方面的實施例的形式。而且，本說明書一個或多個實施例可採用在一個或多個其中包含有計算機可用程式代碼的計算機可用儲存媒體（包括但不限於磁碟記憶體、CD-ROM、光學記憶體等）上實施的計算機程式產品的形式。As will be appreciated by one skilled in the art, one or more embodiments of this specification may be provided as a method, system or computer program product. Accordingly, one or more embodiments of this specification may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, one or more embodiments of this specification may be implemented on one or more computer-usable storage media (including but not limited to disk memory, CD-ROM, optical memory, etc.) having computer-usable program code embodied therein in the form of a computer program product.

本說明書中的各個實施例均採用遞進的方式描述，各個實施例之間相同相似的部分互相參見即可，每個實施例重點說明的都是與其他實施例的不同之處。尤其，對於數據處理設備實施例而言，由於其基本相似於方法實施例，所以描述的比較簡單，相關之處參見方法實施例的部分說明即可。Each embodiment in this specification is described in a progressive manner, and the same and similar parts between the various embodiments may be referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, for the data processing device embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and for related parts, please refer to the partial description of the method embodiment.

上述對本說明書特定實施例進行了描述。其它實施例在所附請求項的範圍內。在一些情況下，在請求項中記載的行為或步驟可以按照不同於實施例中的順序來執行並且仍然可以實現期望的結果。另外，在附圖中描繪的過程不一定要求示出的特定順序或者連續順序才能實現期望的結果。在某些實施方式中，多任務處理和並行處理也是可以的或者可能是有利的。The foregoing describes specific embodiments of the present specification. Other embodiments are within the scope of the appended claims. In some cases, the acts or steps recited in the claims may be performed in an order different from that of the embodiments and still achieve desirable results. Additionally, the processes depicted in the figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

本說明書中描述的主題及功能操作的實施例可以在以下中實現：數位電子電路、有形體現的計算機軟體或韌體、包括本說明書中公開的結構及其結構性等同物的計算機硬體、或者它們中的一個或多個的組合。本說明書中描述的主題的實施例可以實現為一個或多個計算機程式，即編碼在有形非暫時性程式載體上以被數據處理裝置執行或控制數據處理裝置的操作的計算機程式指令中的一個或多個模組。可替代地或附加地，程式指令可以被編碼在人工生成的傳播訊號上，例如機器生成的電、光或電磁訊號，該訊號被生成以將資訊編碼並傳輸到合適的接收機裝置以由數據處理裝置執行。計算機儲存媒體可以是機器可讀儲存設備、機器可讀儲存基板、隨機或序列存取記憶體設備、或它們中的一個或多個的組合。Embodiments of the subject matter and functional operations described in this specification can be implemented in digital electronic circuits, in tangible embodiment of computer software or firmware, in computer hardware including the structures disclosed in this specification and their structural equivalents, or A combination of one or more of them. Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, ie, one or more of computer program instructions encoded on a tangible, non-transitory program carrier for execution by, or to control the operation of, a data processing apparatus. Multiple mods. Alternatively or additionally, program instructions may be encoded on artificially generated propagating signals, such as machine-generated electrical, optical or electromagnetic signals, which are generated to encode and transmit information to suitable receiver devices for data retrieval. The processing device executes. The computer storage medium can be a machine-readable storage device, a machine-readable storage substrate, a random or sequential access memory device, or a combination of one or more of these.

本說明書中描述的處理及邏輯流程可以由執行一個或多個計算機程式的一個或多個可編程計算機執行，以通過根據輸入數據進行操作並生成輸出來執行相應的功能。所述處理及邏輯流程還可以由專用邏輯電路—例如FPGA（現場可編程門陣列）或ASIC（專用集成電路）來執行，並且裝置也可以實現為專用邏輯電路。The processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform corresponding functions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, eg, an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit).

適合用於執行計算機程式的計算機包括，例如通用和/或專用微處理器，或任何其他類型的中央處理單元。通常，中央處理單元將從只讀記憶體和/或隨機存取記憶體接收指令和數據。計算機的基本組件包括用於實施或執行指令的中央處理單元以及用於儲存指令和數據的一個或多個記憶體設備。通常，計算機還將包括用於儲存數據的一個或多個大容量儲存設備，例如磁碟、光碟磁光碟或光碟等，或者計算機將可操作地與此大容量儲存設備耦接以從其接收數據或向其傳送數據，抑或兩種情況兼而有之。然而，計算機不是必須具有這樣的設備。此外，計算機可以嵌入在另一設備中，例如移動電話、個人數位助理（PDA）、移動音訊或視訊播放器、遊戲操縱臺、全球定位系統（GPS）接收機、或例如通用序列匯流排（USB）閃存驅動器的便攜式儲存設備，僅舉幾例。Computers suitable for the execution of a computer program include, for example, general and/or special purpose microprocessors, or any other type of central processing unit. Typically, the central processing unit will receive instructions and data from read-only memory and/or random access memory. The basic components of a computer include a central processing unit for implementing or executing instructions and one or more memory devices for storing instructions and data. Typically, a computer will also include, or be operably coupled to, such mass storage devices to receive data therefrom, one or more mass storage devices, such as magnetic disks, optical disks, magneto-optical disks, or optical disks, etc., for storing data Or send data to it, or both. However, the computer does not have to have such a device. Furthermore, the computer may be embedded in another device, such as a mobile phone, personal digital assistant (PDA), mobile audio or video player, game console, global positioning system (GPS) receiver, or a universal serial bus (USB ) flash drives for portable storage devices, to name a few.

適合於儲存計算機程式指令和數據的計算機可讀媒體包括所有形式的非揮發性記憶體、媒介和記憶體設備，例如包括半導體記憶體設備（例如EPROM、EEPROM和閃存設備）、磁碟（例如內部硬碟或可移動碟）、光碟磁光碟以及CD ROM和DVD-ROM。處理器和記憶體可由專用邏輯電路補充或併入專用邏輯電路中。Computer readable media suitable for storage of computer program instructions and data include all forms of non-volatile memory, media, and memory devices including, for example, semiconductor memory devices (eg, EPROM, EEPROM, and flash memory devices), magnetic disks (eg, internal hard disks or removable disks), compact disks, magneto-optical disks, and CD-ROMs and DVD-ROMs. The processor and memory may be supplemented by or incorporated in special purpose logic circuitry.

雖然本說明書包含許多具體實施細節，但是這些不應被解釋為限制任何發明的範圍或所要求保護的範圍，而是主要用於描述特定發明的具體實施例的特徵。本說明書內在多個實施例中描述的某些特徵也可以在單個實施例中被組合實施。另一方面，在單個實施例中描述的各種特徵也可以在多個實施例中分開實施或以任何合適的子組合來實施。此外，雖然特徵可以如上所述在某些組合中起作用並且甚至最初如此要求保護，但是來自所要求保護的組合中的一個或多個特徵在一些情況下可以從該組合中去除，並且所要求保護的組合可以指向子組合或子組合的變型。While this specification contains many specific implementation details, these should not be construed as limiting the scope of any invention or what may be claimed, but rather are used primarily to describe features of specific embodiments of particular inventions. Certain features that are described in this specification in multiple embodiments can also be implemented in combination in a single embodiment. On the other hand, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Furthermore, although features may function as described above in certain combinations and even be originally claimed as such, one or more features from a claimed combination may in some cases be removed from the combination and the claimed A protected combination may point to a subcombination or a variation of a subcombination.

類似地，雖然在附圖中以特定順序描繪了操作，但是這不應被理解為要求這些操作以所示的特定順序執行或順次執行、或者要求所有例示的操作被執行，以實現期望的結果。在某些情況下，多任務和並行處理可能是有利的。此外，上述實施例中的各種系統模組和組件的分離不應被理解為在所有實施例中均需要這樣的分離，並且應當理解，所描述的程式組件和系統通常可以一起集成在單個軟體產品中，或者封裝成多個軟體產品。Similarly, although operations in the figures are depicted in a particular order, this should not be construed as requiring that the operations be performed in the particular order shown or sequentially, or that all illustrated operations be performed, to achieve the desired results . In some cases, multitasking and parallel processing may be advantageous. Furthermore, the separation of various system modules and components in the above-described embodiments should not be construed as requiring such separation in all embodiments, and it should be understood that the described program components and systems may generally be integrated together in a single software product , or packaged into multiple software products.

由此，主題的特定實施例已被描述。其他實施例在所附請求項的範圍以內。在某些情況下，請求項中記載的動作可以以不同的順序執行並且仍實現期望的結果。此外，附圖中描繪的處理並非必需所示的特定順序或順次順序，以實現期望的結果。在某些實現中，多任務和並行處理可能是有利的。Thus, specific embodiments of the subject matter have been described. Other embodiments are within the scope of the appended claims. In some cases, the actions recited in the claim may be performed in a different order and still achieve the desired result. Furthermore, the processes depicted in the figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some implementations, multitasking and parallel processing may be advantageous.

以上所述僅為本說明書一個或多個實施例的較佳實施例而已，並不用以限制本說明書一個或多個實施例，凡在本說明書一個或多個實施例的精神和原則之內，所做的任何修改、等同替換、改進等，均應包含在本說明書一個或多個實施例保護的範圍之內。The above descriptions are only preferred embodiments of one or more embodiments of this specification, and are not intended to limit one or more embodiments of this specification. All within the spirit and principles of one or more embodiments of this specification, Any modifications, equivalent replacements, improvements, etc. made should be included within the protection scope of one or more embodiments of this specification.

201:獲取所述互動物件的聲音驅動數據對應的音素序列的步驟 202:獲取與所述音素序列匹配的所述互動物件的姿態參數值的步驟 203:根據所述姿態參數值控制所述顯示設備顯示的所述互動物件的姿態的步驟 501:音素序列獲取單元 502:參數獲取單元 503:驅動單元201: the step of acquiring the phoneme sequence corresponding to the sound-driven data of the interactive object 202: the step of acquiring the gesture parameter value of the interactive object matched with the phoneme sequence 203: the step of controlling the posture of the interactive object displayed by the display device according to the posture parameter value 501: Phoneme sequence acquisition unit 502: Parameter acquisition unit 503: Drive unit

圖1是本公開至少一個實施例提出的互動物件的驅動方法中顯示設備的示意圖。圖2是本公開至少一個實施例提出的互動物件的驅動方法的流程圖。圖3是本公開至少一個實施例提出的對音素序列進行特徵編碼的過程示意圖。圖4是本公開至少一個實施例提出的音素序列樣本的示意圖。圖5是本公開至少一個實施例提出的互動物件的驅動裝置的結構示意圖。圖6是本公開至少一個實施例提出的電子設備的結構示意圖。FIG. 1 is a schematic diagram of a display device in a method for driving an interactive object provided by at least one embodiment of the present disclosure. FIG. 2 is a flowchart of a method for driving an interactive object provided by at least one embodiment of the present disclosure. FIG. 3 is a schematic diagram of a process of feature encoding for a phoneme sequence proposed by at least one embodiment of the present disclosure. FIG. 4 is a schematic diagram of a phoneme sequence sample proposed by at least one embodiment of the present disclosure. FIG. 5 is a schematic structural diagram of a driving device for an interactive object according to at least one embodiment of the present disclosure. FIG. 6 is a schematic structural diagram of an electronic device provided by at least one embodiment of the present disclosure.

201:獲取所述互動物件的聲音驅動數據對應的音素序列的步驟201: the step of acquiring the phoneme sequence corresponding to the sound-driven data of the interactive object

202:獲取與所述音素序列匹配的所述互動物件的姿態參數值的步驟202: the step of acquiring the gesture parameter value of the interactive object matched with the phoneme sequence

203:根據所述姿態參數值控制所述顯示設備顯示的所述互動物件的姿態的步驟203: the step of controlling the posture of the interactive object displayed by the display device according to the posture parameter value

Claims

A method for driving an interactive object, wherein the interactive object is displayed on a display device, the method includes: acquiring a phoneme sequence corresponding to sound driving data of the interactive object; acquiring a gesture of the interactive object matching the phoneme sequence parameter values, the posture parameter values include facial posture parameters and limb posture parameters, the facial posture parameters include facial muscle control coefficients, and the facial muscle control coefficients are used to control the motion state of at least one facial muscle; according to the posture parameters The value controls the posture of the interactive object displayed by the display device, at least including, according to the facial muscle control coefficient value matched with the phoneme sequence, driving the interactive object to make a match with each phoneme in the phoneme sequence. facial movements.

The driving method according to claim 1, further comprising: controlling the display device to output speech and/or display text according to the phoneme sequence.

The driving method according to claim 1 or 2, wherein obtaining the gesture parameter value of the interactive object matching the phoneme sequence includes: performing feature encoding on the phoneme sequence to obtain feature information of the phoneme sequence ; Obtain the gesture parameter value of the interactive object corresponding to the feature information of the phoneme sequence.

The driving method according to claim 3, wherein the feature encoding is performed on the phoneme sequence to obtain feature information of the phoneme sequence, comprising: For each phoneme in the plurality of phonemes included in the phoneme sequence, a coding sequence corresponding to the phoneme is generated; according to the coding value of the coding sequence corresponding to the phoneme and the duration corresponding to the phoneme, obtain the corresponding phoneme. The feature information of the coding sequence; the feature information of the phoneme sequence is obtained according to the feature information of the coding sequence corresponding to the plurality of phonemes respectively.

The driving method according to claim 4, wherein, for each phoneme in the plurality of phonemes included in the phoneme sequence, generating a coding sequence corresponding to the phoneme includes: detecting whether each time point corresponds to the phoneme; The encoding sequence corresponding to the phoneme is obtained by setting the encoding value at the time point with the phoneme as the first value, and setting the encoding value at the time point without the phoneme as the second value.

The driving method according to claim 4, wherein the feature information of the coding sequences corresponding to the plurality of phonemes is obtained according to the coding values of the coding sequences corresponding to the plurality of phonemes and the durations corresponding to the plurality of phonemes respectively , including: for each phoneme in the plurality of phonemes, for the coding sequence corresponding to the phoneme, using a Gaussian filter to perform a Gaussian convolution operation on the continuous values of the phoneme in time to obtain the code corresponding to the phoneme. Characteristic information for the sequence.

The driving method according to claim 1, further comprising: Acquiring the driving data of the body posture associated with the facial posture parameter value; and driving the interactive object to perform body movements according to the driving data of the body posture associated with the facial posture parameter value.

The driving method according to claim 3, wherein acquiring the attitude parameter value of the interactive object corresponding to the feature information of the phoneme sequence comprises: sampling the feature information of the phoneme sequence at a set time interval to obtain the first Sampling feature information corresponding to the sampling time; inputting the sampling feature information corresponding to the first sampling time into a pre-trained neural network to obtain the attitude parameter value of the interactive object corresponding to the sampling feature information.

The driving method according to claim 8, wherein the pre-trained neural network includes a long short-term memory network and a fully connected network, and the sampling feature information corresponding to the first sampling time is input to the pre-trained neural network network, and obtaining the attitude parameter value of the interactive object corresponding to the sampling feature information includes: inputting the sampling feature information corresponding to the first sampling time into the long short-term memory network, according to Sampling feature information before the first sampling time, output associated feature information; input the associated feature information into the fully connected network, and determine the associated feature according to the classification result of the fully connected network The attitude parameter values corresponding to the information; wherein, in the classification result, each category corresponds to a group of the attitude parameter values.

The driving method according to claim 8, wherein the neural network is obtained by training phoneme sequence samples; the method further comprises: acquiring a video segment of a character uttering speech; acquiring a plurality of segments including the a first image frame of the character, and multiple audio frames corresponding to the multiple first image frames; converting the first image frame into a second image frame including the interactive object, and obtaining The attitude parameter value corresponding to the second image frame; according to the attitude parameter value corresponding to the second image frame, the audio frame corresponding to the first image frame is marked; The audio frame of the gesture parameter value, the phoneme sequence sample is obtained.

The driving method according to claim 10, further comprising: performing feature encoding on the phoneme sequence samples to obtain feature information corresponding to the second sampling time, and labeling the feature information with corresponding attitude parameter values to obtain feature information samples The initial neural network is trained according to the feature information samples, and the neural network is obtained by training after the change of the network loss meets the convergence condition, wherein the network loss includes the predicted value obtained by the initial neural network. The difference between the attitude parameter value and the marked attitude parameter value; wherein, the network loss includes the difference between the attitude parameter value predicted by the initial neural network and the marked attitude parameter value The second norm of ; The network loss further includes a norm of the attitude parameter value predicted by the initial neural network.

A driving device for an interactive object, the interactive object is displayed on a display device, the device comprises: a phoneme sequence acquisition unit for acquiring a phoneme sequence corresponding to sound driving data of the interactive object; a parameter acquisition unit for acquiring The posture parameter value of the interactive object matched with the phoneme sequence, the posture parameter value includes a facial posture parameter and a limb posture parameter, the facial posture parameter includes a facial muscle control coefficient, and the facial muscle control coefficient is used to control The motion state of at least one facial muscle; a driving unit, configured to control the gesture of the interactive object displayed by the display device according to the gesture parameter value; wherein, the driving unit is specifically configured to The facial muscle controls the coefficient value to drive the interactive object to make facial actions matching each phoneme in the phoneme sequence.

An electronic device, comprising a memory and a processor, the memory is used to store computer instructions that can be executed on the processor, and the processor is used to implement any one of claim items 1 to 11 when executing the computer instructions the described driving method.

A computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, implements the driving method described in any one of claim 1 to 11.