TWI747258B

TWI747258B - Method for generating action according to audio signal and electronic device

Info

Publication number: TWI747258B
Application number: TW109114298A
Authority: TW
Inventors: 楊東庭; 王鈞立; 郭曜禎; 楊宏毅
Original assignee: 宏達國際電子股份有限公司
Priority date: 2020-04-29
Filing date: 2020-04-29
Publication date: 2021-11-21
Also published as: TW202141233A

Abstract

The disclosure provides a method for generating an action according to an audio signal and an electronic device. The method includes: receiving an audio signal and extracting a high-level audio features therefrom; extracting a latent audio features from the high-level audio feature; in response to determining that the audio signal corresponds to a beat, obtaining a joint angle distribution matrix based on the latent audio features; in response to determining that the audio signal corresponds to music, obtaining a plurality of designated joint angles corresponding to the joint points based on the joint angle distribution matrix; and adjusting the joint angles of each joint point on the avatar according to the aforementioned designated joint angles.

Description

Method and electronic device for generating action based on audio signal

本發明是有關於一種控制虛擬化身（avatar）的技術，且特別是有關於一種依據音訊訊號產生動作的方法及電子裝置。The present invention relates to a technology for controlling an avatar, and particularly relates to a method and an electronic device for generating actions based on audio signals.

在虛擬實境（virtual reality，VR）及掛留實境（augmented reality，AR）體驗中，虛擬化身是這些應用中的關鍵部分。如果化身可以具有與用戶相同的感知能力和感覺，並且能夠對環境做出相應的反應，此將極大地改善用戶的沉浸感。In virtual reality (VR) and augmented reality (AR) experiences, avatars are a key part of these applications. If the avatar can have the same perception capabilities and feelings as the user, and can respond to the environment accordingly, this will greatly improve the user's sense of immersion.

在現有技術中，存在一種讓虛擬化身能夠依據音樂跳舞的技術。然而，為實現上述目的，此種技術需維護儲存有大量預設舞步的資料庫以用於產生舞步，因而將耗用較多的記憶體，故較不容易實現在邊緣裝置（edge device）（例如嵌入式系統或行動裝置）的應用程式上。In the prior art, there is a technology that allows virtual avatars to dance according to music. However, in order to achieve the above purpose, this technology needs to maintain a database storing a large number of preset dance steps for generating dance steps, which will consume more memory, so it is not easy to implement in edge devices (edge devices) ( Such as embedded systems or mobile devices).

進一步而言，當VR/AR環境中出現音樂時，上述技術將會基於某些預定的手工特徵（hand-crafted feature）從資料庫中選出一或多個舞步，並將這些舞步重組為對應於當下音樂的一連串舞步。因此，上述技術並無法讓虛擬化身有創意地舞動。Furthermore, when music appears in the VR/AR environment, the above technology will select one or more dance steps from the database based on certain predetermined hand-crafted features, and reorganize these dance steps into corresponding ones A series of dance steps in current music. Therefore, the above-mentioned technology cannot make the virtual avatar dance creatively.

有鑑於此，本發明提供一種依據音訊訊號產生動作的方法及電子裝置，其可用於解決上述技術問題。In view of this, the present invention provides a method and an electronic device for generating actions based on audio signals, which can be used to solve the above technical problems.

本發明提供一種依據音訊訊號產生動作的方法，包括：接收一第一音訊訊號，並從第一音訊訊號擷取一第一高階音訊特徵；從第一高階音訊特徵中擷取一第一潛在音訊特徵；反應於判定第一潛在音訊特徵指示第一音訊訊號對應於一第一節拍，依據第一潛在音訊特徵取得一第一關節角度分布矩陣，其中第一關節角度分布矩陣包括多個高斯分布參數，且前述高斯分布參數對應於一虛擬化身上的多個關節點；反應於判定第一潛在音訊特徵指示第一音訊訊號對應於一第一音樂，基於第一關節角度分布矩陣取得對應於前述關節點的多個指定關節角度；依據前述指定關節角度調整虛擬化身上各關節點的關節角度。The present invention provides a method for generating actions based on audio signals, including: receiving a first audio signal, and extracting a first high-level audio feature from the first audio signal; and extracting a first potential audio signal from the first high-level audio feature Features; in response to determining that the first potential audio feature indicates that the first audio signal corresponds to a first beat, a first joint angle distribution matrix is obtained according to the first potential audio feature, wherein the first joint angle distribution matrix includes a plurality of Gaussian distribution parameters , And the aforementioned Gaussian distribution parameter corresponds to a plurality of joint points on a virtual body; in response to determining that the first potential audio feature indicates that the first audio signal corresponds to a first piece of music, based on the first joint angle distribution matrix, it is obtained corresponding to the aforementioned joint Multiple designated joint angles of a point; adjust the joint angle of each joint point on the virtualized body according to the aforementioned designated joint angle.

本發明提供一種電子裝置，其包括儲存電路及處理器。儲存電路儲存多個模組。處理器耦接儲存電路，並存取前述模組以執行下列步驟：接收一第一音訊訊號，並從第一音訊訊號擷取一第一高階音訊特徵；從第一高階音訊特徵中擷取一第一潛在音訊特徵；反應於判定第一潛在音訊特徵指示第一音訊訊號對應於一第一節拍，依據第一潛在音訊特徵取得一第一關節角度分布矩陣，其中第一關節角度分布矩陣包括多個高斯分布參數，且前述高斯分布參數對應於一虛擬化身上的多個關節點；反應於判定第一潛在音訊特徵指示第一音訊訊號對應於一第一音樂，基於第一關節角度分布矩陣取得對應於前述關節點的多個指定關節角度；依據前述指定關節角度調整虛擬化身上各關節點的關節角度。The invention provides an electronic device, which includes a storage circuit and a processor. The storage circuit stores multiple modules. The processor is coupled to the storage circuit and accesses the aforementioned module to perform the following steps: receiving a first audio signal, and extracting a first high-level audio feature from the first audio signal; and extracting a first high-level audio feature from the first audio signal A first potential audio feature; in response to determining that the first potential audio feature indicates that the first audio signal corresponds to a first beat, a first joint angle distribution matrix is obtained according to the first potential audio feature, wherein the first joint angle distribution matrix includes multiple Gaussian distribution parameters, and the aforementioned Gaussian distribution parameters correspond to multiple joint points on a virtualized body; in response to determining that the first potential audio feature indicates that the first audio signal corresponds to a first music, it is obtained based on the first joint angle distribution matrix A plurality of designated joint angles corresponding to the aforementioned joint points; the joint angle of each joint point on the virtualized body is adjusted according to the aforementioned designated joint angle.

基於上述，在不需維護舞步資料庫的情況下，本發明的方法可讓虛擬化身隨著當下音樂即興呈現對應的動作（例如舞步），因而適於應用在實現為邊緣裝置的電子裝置上。Based on the above, without maintaining the dance step database, the method of the present invention allows the virtual avatar to improvise corresponding actions (such as dance steps) with the current music, and is therefore suitable for application to electronic devices implemented as edge devices.

請參照圖1，其是依據本發明實施例繪示的電子裝置示意圖。在不同的實施例中，電子裝置100例如是可用於提供AR/VR或其他類似服務的電腦裝置、嵌入式系統、行動裝置等裝置，但可不限於此。如圖1所示，電子裝置100包括儲存電路102及處理器104。Please refer to FIG. 1, which is a schematic diagram of an electronic device according to an embodiment of the present invention. In different embodiments, the electronic device 100 is, for example, a computer device, an embedded system, a mobile device, etc., which can be used to provide AR/VR or other similar services, but it may not be limited thereto. As shown in FIG. 1, the electronic device 100 includes a storage circuit 102 and a processor 104.

儲存電路102例如是任意型式的固定式或可移動式隨機存取記憶體（Random Access Memory，RAM）、唯讀記憶體（Read-Only Memory，ROM）、快閃記憶體（Flash memory）、硬碟或其他類似裝置或這些裝置的組合，而可用以記錄多個程式碼或模組。The storage circuit 102 is, for example, any type of fixed or removable random access memory (Random Access Memory, RAM), read-only memory (Read-Only Memory, ROM), flash memory (Flash memory), hard disk Disk or other similar devices or a combination of these devices can be used to record multiple codes or modules.

處理器104耦接於儲存電路102，並可為一般用途處理器、特殊用途處理器、傳統的處理器、數位訊號處理器、多個微處理器（microprocessor）、一個或多個結合數位訊號處理器核心的微處理器、控制器、微控制器、特殊應用積體電路（Application Specific Integrated Circuit，ASIC）、現場可程式閘陣列電路（Field Programmable Gate Array，FPGA）、任何其他種類的積體電路、狀態機、基於進階精簡指令集機器（Advanced RISC Machine，ARM）的處理器以及類似品。The processor 104 is coupled to the storage circuit 102, and can be a general purpose processor, a special purpose processor, a traditional processor, a digital signal processor, multiple microprocessors, one or more combined digital signal processing The core microprocessor, controller, microcontroller, Application Specific Integrated Circuit (ASIC), Field Programmable Gate Array (FPGA), any other type of integrated circuit , State machines, processors based on Advanced RISC Machine (ARM) and similar products.

在本發明的實施例中，處理器104可存取儲存電路102中記錄的模組、程式碼來實現本發明提出的依據音訊訊號產生動作的方法，其細節詳述如下。In the embodiment of the present invention, the processor 104 can access the modules and program codes recorded in the storage circuit 102 to implement the method for generating actions based on audio signals proposed by the present invention. Details of the method are as follows.

請參照圖2，其是依據本發明實施例繪示的依據音訊訊號產生動作的方法流程圖。本實施例的方法可由圖1的電子裝置100執行，以下即搭配圖1所示的元件說明圖2各步驟的細節。另外，為使本案內容更易於理解，以下將另輔以圖3所示的系統架構圖作說明，但其僅用以舉例，並非用以限定本發明可能的實施方式。Please refer to FIG. 2, which is a flowchart of a method for generating actions based on audio signals according to an embodiment of the present invention. The method of this embodiment can be executed by the electronic device 100 in FIG. 1. The details of each step in FIG. 2 will be described below with the components shown in FIG. 1. In addition, in order to make the content of this case easier to understand, the system architecture diagram shown in FIG. 3 will be supplemented below for description, but it is only used as an example and is not intended to limit the possible implementation of the present invention.

概略而言，本發明的方法可在接收到一段音訊訊號（例如一個音訊幀）時，據以決定虛擬化身上的各個關節在各個維度上的關節角度，從而讓虛擬化身整體呈現適當的動作。在不同的實施例中，上述音訊訊號可來自於任何種類的聲音，例如音樂、環境音、語音等，但不限於此。Generally speaking, the method of the present invention can determine the joint angles of each joint of the virtualized body in various dimensions when receiving a segment of audio signal (for example, an audio frame), so that the virtualized avatar can present appropriate actions as a whole. In different embodiments, the above-mentioned audio signal may come from any kind of sound, such as music, ambient sound, voice, etc., but it is not limited thereto.

在圖3中，音訊訊號F1~FN例如是連續的多個音訊幀，而對於每個音訊訊號F1~FN而言，處理器104可對其進行類似的處理，以產生對應於所考慮的音訊訊號的虛擬化身動作。為便於說明，以下暫以音訊訊號F1為例進行說明，但其並非用以限定本發明可能的實施方式。In FIG. 3, the audio signal F1~FN is, for example, a plurality of consecutive audio frames, and for each audio signal F1~FN, the processor 104 can perform similar processing on it to generate an audio signal corresponding to the considered audio signal. The virtual avatar action of the signal. For ease of description, the audio signal F1 is temporarily used as an example for description below, but it is not intended to limit the possible implementations of the present invention.

首先，在步驟S210中，處理器104可接收音訊訊號F1，並從音訊訊號F1擷取高階（high level）音訊特徵H1。在一實施例中，音訊訊號F1可包括一音訊幀，而其可表徵為具有特定維度（例如2048x1）的向量（或稱陣列），但可不限於此。在一實施例中，處理器104可將上述音訊幀輸入卷積神經網路（convolutional neural network，CNN）N1，以由CNN N1從此音訊幀擷取高階音訊特徵H1。在本發明的實施例中，CNN N1可包括一或多個卷積層，用以從所接收的音訊幀中擷取對應的高階音訊特徵，但可不限於此。以上由CNN N1擷取高階音訊特徵H1的技術細節可參照相關的現有技術文獻，於此不另贅述。First, in step S210, the processor 104 may receive the audio signal F1, and extract a high level audio feature H1 from the audio signal F1. In an embodiment, the audio signal F1 may include an audio frame, and it may be characterized as a vector (or array) with a specific dimension (for example, 2048×1), but it may not be limited thereto. In one embodiment, the processor 104 may input the aforementioned audio frame into a convolutional neural network (CNN) N1, so that the CNN N1 extracts high-level audio features H1 from the audio frame. In the embodiment of the present invention, the CNN N1 may include one or more convolutional layers for extracting corresponding high-level audio features from the received audio frame, but it is not limited to this. The technical details of the above-mentioned extraction of the high-level audio feature H1 by the CNN N1 can be referred to related prior art documents, which will not be described in detail here.

之後，在步驟S220中，處理器104可從高階音訊特徵H1中擷取潛在（latent）音訊特徵L1。在一實施例中，處理器104可將高階音訊特徵H1輸入第一遞歸神經網路（recurrent neural network，RNN）N2，以由第一RNN N2從高階音訊特徵H1擷取潛在音訊特徵L1。以上由第一RNN N2擷取潛在音訊特徵L1的技術細節可參照相關的現有技術文獻，於此不另贅述。Afterwards, in step S220, the processor 104 may extract latent audio features L1 from the high-level audio features H1. In an embodiment, the processor 104 may input the high-level audio feature H1 into a first recurrent neural network (RNN) N2, so that the first RNN N2 extracts the potential audio feature L1 from the high-level audio feature H1. The above technical details of the potential audio feature L1 extracted by the first RNN N2 can be referred to related prior art documents, which will not be repeated here.

此外，在本實施例中，第一RNN N2除了可基於高階音訊特徵H1輸出潛在音訊特徵L1之外，還可一併輸出一第一內部狀態（internal state）IS11，其細節可參照RNN的相關技術文件，於此不另贅述。在本發明的實施例中，第一RNN N2可包括一多堆疊（multi-stack）結構，用以從所接收的高階音訊特徵中擷取對應的潛在音訊特徵，但可不限於此。In addition, in this embodiment, the first RNN N2 can output a first internal state IS11 in addition to the potential audio feature L1 based on the high-level audio feature H1. For details, please refer to the RNN correlation Technical documents, not to repeat them here. In the embodiment of the present invention, the first RNN N2 may include a multi-stack structure for extracting corresponding potential audio features from the received high-level audio features, but it is not limited to this.

此外，在一實施例中，第一內部狀態IS11可讓第一RNN N2在處理對應於下一個音訊訊號F2的高階音訊特徵H2時，進一步參考前一級的高階音訊特徵H1來產生對應的潛在音訊特徵L2，而相關細節將在之後另作說明。In addition, in one embodiment, the first internal state IS11 allows the first RNN N2 to further refer to the high-level audio feature H1 of the previous level to generate the corresponding potential audio when processing the high-level audio feature H2 corresponding to the next audio signal F2. Feature L2, and related details will be explained later.

在一實施例中，處理器104可基於潛在音訊特徵L1判斷音訊訊號F1是否對應於節拍（即是否在節拍上（on beat）），以及基於潛在音訊特徵L1判斷音訊訊號F1是否對應於音樂。在本發明的實施例中，處理器104可將潛在音訊特徵L1輸入一特定神經網路N3（其例如由多個全連接層（fully-connected layer）組成），以由特定神經網路N3基於潛在音訊特徵L1判斷音訊訊號F1是否對應於節拍以及是否對應於音樂，但可不限於此。In one embodiment, the processor 104 may determine whether the audio signal F1 corresponds to a beat (ie whether it is on beat) based on the latent audio feature L1, and determine whether the audio signal F1 corresponds to music based on the latent audio feature L1. In the embodiment of the present invention, the processor 104 can input the latent audio feature L1 into a specific neural network N3 (for example, it is composed of multiple fully-connected layers), so that the specific neural network N3 is based on The latent audio feature L1 determines whether the audio signal F1 corresponds to a beat and whether it corresponds to music, but it is not limited to this.

為便於說明，以下假設音訊訊號F1係對應於節拍且對應於音樂（即，不為雜訊、人聲或其他非音樂聲響）。因此，在步驟S230中，反應於判定潛在音訊特徵L1指示音訊訊號F1對應於節拍，處理器104可依據潛在音訊特徵L1取得關節角度分布矩陣M1，其中關節角度分布矩陣M1可包括多個高斯分布參數，且前述高斯分布參數可對應於一虛擬化身上的多個關節點。在一實施例中，處理器104可將潛在音訊特徵L1輸入第二RNN N4，以由第二RNN N4基於潛在音訊特徵L1產生關節角度分布矩陣M1。此外，第二RNN N4還可基於潛在音訊特徵L1產生第二內部狀態IS12。For ease of description, it is assumed below that the audio signal F1 corresponds to the beat and corresponds to the music (that is, it is not noise, human voice, or other non-musical sounds). Therefore, in step S230, in response to determining that the potential audio feature L1 indicates that the audio signal F1 corresponds to the beat, the processor 104 can obtain the joint angle distribution matrix M1 according to the potential audio feature L1, where the joint angle distribution matrix M1 may include multiple Gaussian distributions Parameters, and the aforementioned Gaussian distribution parameters can correspond to multiple joint points on a virtualized body. In an embodiment, the processor 104 may input the latent audio feature L1 into the second RNN N4, so that the second RNN N4 generates the joint angle distribution matrix M1 based on the latent audio feature L1. In addition, the second RNN N4 can also generate a second internal state IS12 based on the potential audio feature L1.

在一實施例中，上述虛擬化身例如是AR/VR環境中經配置以依據音樂舞動的角色。另外，依據生物視覺層次（biovision hierarchy，BVH）的相關規格，一個虛擬化身上可經定義有一個髖關節點絕對位置（可由x、y、z表示）以及52個其他關節點，而所述52個其他關節點個別可以一組在三度空間中的關節旋轉角度表示，例如(Rx, Ry, Rz)。舉例而言，對於虛擬化身上的一第一關節點而言，對應的Rx、Ry及Rz分別例如是在第一維度（例如X軸）、第二維度（例如Y軸）及第三維度（例如Z軸）上的關節角度，但可不限於此。In one embodiment, the aforementioned virtual avatar is, for example, a character configured to dance according to music in an AR/VR environment. In addition, according to the relevant specifications of the biovision hierarchy (BVH), a virtual body can be defined with an absolute position of the hip joint point (which can be represented by x, y, and z) and 52 other joint points, and the 52 Each of the other joint points can be represented by a group of joint rotation angles in a three-degree space, for example (Rx, Ry, Rz). For example, for a first joint point on a virtualized body, the corresponding Rx, Ry, and Rz are respectively in the first dimension (such as the X axis), the second dimension (such as the Y axis), and the third dimension ( For example, the joint angle on the Z axis), but it is not limited to this.

為便於說明本發明的概念，以下假設所考慮的虛擬化身上的關節點可包括上述髖關節點及52個其他關節點，但本發明可不限於此。此外，以下亦假設所考慮的虛擬化身的動作可基於BVH的相關規格進行定義，但本發明可不限於此。在此情況下，虛擬化身的動作可依據BVH動態捕捉（motion capture）資料檔而決定。在一實施例中，一個BVH動態捕捉資料檔可包括159個值，其個別對應於上述髖關節點絕對位置（即x、y、z）及所述52個其他關節點個別的(Rx, Ry, Rz)。因此，在取得BVH動態捕捉資料檔之後，即可相應地決定虛擬化身的動作，而本發明可基於所產生的關節角度分布矩陣M1決定BVH動態捕捉資料檔中的159個值，進而決定虛擬化身的動作。To facilitate the description of the concept of the present invention, the joint points on the virtualized body considered in the following hypothesis may include the aforementioned hip joint points and 52 other joint points, but the present invention may not be limited to this. In addition, the following also assumes that the action of the considered avatar can be defined based on the relevant specifications of BVH, but the present invention may not be limited to this. In this case, the action of the avatar can be determined based on the BVH motion capture data file. In one embodiment, a BVH dynamic capture data file may include 159 values, which respectively correspond to the absolute positions of the aforementioned hip joint points (ie x, y, z) and the respective 52 other joint points (Rx, Ry). , Rz). Therefore, after obtaining the BVH motion capture data file, the actions of the virtual avatar can be determined accordingly, and the present invention can determine the 159 values in the BVH motion capture data file based on the generated joint angle distribution matrix M1, and then determine the virtual avatar Actions.

具體而言，在第一實施例中，關節角度分布矩陣M1可實現為一個維度為159x2的矩陣，而其中的所述159個列分別對應於上述x、y、z及所述52個其他關節點個別的(Rx, Ry, Rz)。舉例而言，假設虛擬化身上的某關節點（下稱第一關節點）在第一維度上具有一第一可動角度範圍（可理解為對應於第一關節點的Rx的可動角度範圍），而此第一可動角度範圍在本發明中可模型化為一第一高斯分布模型。在此情況下，關節角度分布矩陣M1中對應於第一關節點的Rx的列可包括2個元素，而此2元素可分別是第一高斯分布模型的期望值（以

表示）及標準差（以

表示）。舉另一例而言，假設第一關節點在第二維度上還具有另一可動角度範圍（可理解為對應於第一關節點的Ry的可動角度範圍），而此另一可動角度範圍在本發明中可模型化為一另一高斯分布模型。在此情況下，關節角度分布矩陣M1中對應於第一關節點的Ry的列可包括2個元素，而此2元素可分別是所述另一高斯分布模型的期望值及標準差。 Specifically, in the first embodiment, the joint angle distribution matrix M1 can be implemented as a matrix with a dimension of 159x2, and the 159 columns in it correspond to the aforementioned x, y, z and the 52 other joints. Click the individual (Rx, Ry, Rz). For example, suppose that a certain joint point on the virtualized body (hereinafter referred to as the first joint point) has a first movable angle range in the first dimension (it can be understood as the movable angle range of Rx corresponding to the first joint point), The first movable angle range can be modeled as a first Gaussian distribution model in the present invention. In this case, the column of Rx corresponding to the first joint point in the joint angle distribution matrix M1 may include two elements, and these two elements may be the expected values of the first Gaussian distribution model (in terms of

Expressed) and standard deviation (in

Express). For another example, suppose that the first joint point has another movable angle range in the second dimension (which can be understood as the movable angle range corresponding to Ry of the first joint point), and this other movable angle range is in this The invention can be modeled as another Gaussian distribution model. In this case, the column of Ry corresponding to the first joint point in the joint angle distribution matrix M1 may include two elements, and the two elements may be the expected value and the standard deviation of the other Gaussian distribution model, respectively.

基於以上教示，本領域具通常知識者應可相應理解關節角度分布矩陣M1中其餘各列的意義及內容，於此不另贅述。此外，在第一實施例中，關節角度分布矩陣M1的第1行例如可由各列中的期望值組成，而關節角度分布矩陣M1的第2行例如可由各列中的標準差組成，但可不限於此。Based on the above teachings, those with ordinary knowledge in the field should be able to understand the meaning and content of the remaining columns in the joint angle distribution matrix M1, which will not be repeated here. In addition, in the first embodiment, the first row of the joint angle distribution matrix M1 may be composed of expected values in each column, for example, and the second row of the joint angle distribution matrix M1 may be composed of, for example, the standard deviations in each column, but it may not be limited to this.

在取得關節角度分布矩陣M1之後，在步驟S240中，反應於判定潛在音訊特徵L1指示音訊訊號F1對應於音樂，處理器104可基於關節角度分布矩陣M1取得對應於關節點的多個指定關節角度。After obtaining the joint angle distribution matrix M1, in step S240, in response to determining that the potential audio feature L1 indicates that the audio signal F1 corresponds to music, the processor 104 may obtain a plurality of designated joint angles corresponding to the joint points based on the joint angle distribution matrix M1 .

再以第一關節點為例，假設處理器104欲取得第一關節點在第一維度上的第一指定關節角度，則處理器104可基於上述第一高斯分布模型在上述第一可動角度範圍內取樣第一角度以作為第一關節點在第一維度上的第一指定關節角度。為便於理解，以下將另輔以圖4作說明。Taking the first joint point as an example again, assuming that the processor 104 wants to obtain the first specified joint angle of the first joint point in the first dimension, the processor 104 may be in the first movable angle range based on the first Gaussian distribution model. The first angle is internally sampled to be the first designated joint angle of the first joint point in the first dimension. For ease of understanding, the description will be supplemented with FIG. 4 below.

請參照圖4，其是依據本發明第一實施例繪示的用以模型化第一可動角度範圍的第一高斯分布模型。在圖4中，假設第一關節點在第一維度上具有第一可動角度範圍R1，而第一高斯分布模型G1例如可用於模型化第一可動角度範圍R1。在此情況下，處理器104可基於第一高斯分布模型G1在第一可動角度範圍R1內取樣第一角度以作為第一關節點在第一維度上的第一指定關節角度。在一實施例中，處理器104例如可基於第一高斯分布模型G1在第一可動角度範圍內R1隨機取樣第一角度作為上述第一指定關節角度。在另一實施例中，處理器104亦可直接在第一可動角度範圍內R1中取樣對應於期望值（即，

）的第一角度作為上述第一指定關節角度，但可不限於此。 Please refer to FIG. 4, which is a first Gaussian distribution model for modeling the first movable angle range according to the first embodiment of the present invention. In FIG. 4, it is assumed that the first joint point has the first movable angle range R1 in the first dimension, and the first Gaussian distribution model G1 can be used to model the first movable angle range R1, for example. In this case, the processor 104 may sample the first angle in the first movable angle range R1 based on the first Gaussian distribution model G1 as the first designated joint angle of the first joint point in the first dimension. In an embodiment, the processor 104 may randomly sample a first angle within the first movable angle range R1 based on the first Gaussian distribution model G1 as the first designated joint angle. In another embodiment, the processor 104 can also directly sample R1 within the first movable angle range corresponding to the expected value (ie,

The first angle of) is taken as the above-mentioned first specified joint angle, but it may not be limited to this.

同理，假設處理器104欲取得第一關節點在第二維度上的指定關節角度，則處理器104可基於所述另一高斯分布模型在所述另一可動角度範圍內（隨機）取樣一角度以作為第一關節點在第二維度上的另一指定關節角度。基於以上教示，本領域具通常知識者應可相應理解處理器103取得各關節點在各維度上的指定關節角度的方式，於此不另贅述。In the same way, assuming that the processor 104 wants to obtain the specified joint angle of the first joint point in the second dimension, the processor 104 may sample a (randomly) within the other movable angle range based on the another Gaussian distribution model. The angle is taken as another specified joint angle of the first joint point in the second dimension. Based on the above teachings, those with ordinary knowledge in the art should be able to understand the way the processor 103 obtains the designated joint angles of each joint point in each dimension, which will not be repeated here.

在取得各關節點對應的多個指定關節角度之後，在步驟S250中，處理器104可依據指定關節角度調整虛擬化身上各關節點的關節角度。在第一實施例中，處理器104可將各關節點對應的指定關節角度以指定關節角度向量S1（其維度例如是159x1）的形式輸出。舉例而言，假設處理器104對於各關節點皆是取樣對應於期望值的角度作為各關節點的指定關節角度，則處理器104可直接取用關節角度分布矩陣M1的第1行作為指定關節角度向量S1，但本發明可不限於此。After obtaining multiple designated joint angles corresponding to each joint point, in step S250, the processor 104 may adjust the joint angle of each joint point on the virtualized body according to the designated joint angle. In the first embodiment, the processor 104 may output the designated joint angle corresponding to each joint point in the form of a designated joint angle vector S1 (the dimension of which is, for example, 159×1). For example, assuming that the processor 104 samples the angle corresponding to the expected value for each joint point as the designated joint angle of each joint point, the processor 104 can directly use the first row of the joint angle distribution matrix M1 as the designated joint angle Vector S1, but the present invention may not be limited to this.

在此情況下，處理器104例如可基於指定關節角度向量S1中的指定關節角度產生對應的BVH動態捕捉資料檔，並基於此BVH動態捕捉資料檔調整虛擬化身上各關節點的關節角度。例如，處理器104可將第一關節點在第一維度上的關節角度調整為對應於上述第一指定關節角度（例如第一高斯分布模型G1的期望值）。並且，處理器104還可將第一關節點在第二維度上的關節角度調整為對應於上述另一指定關節角度（例如上述另一高斯分布模型的期望值）。基此，處理器104可依據BVH動態捕捉資料檔的內容調整虛擬化身上各關節點在不同維度上的關節角度，從而令虛擬化身呈現特定的動作（例如舞步）。In this case, the processor 104 may, for example, generate a corresponding BVH motion capture data file based on the specified joint angle in the specified joint angle vector S1, and adjust the joint angle of each joint point of the virtualized body based on the BVH motion capture data file. For example, the processor 104 may adjust the joint angle of the first joint point in the first dimension to correspond to the aforementioned first specified joint angle (for example, the expected value of the first Gaussian distribution model G1). In addition, the processor 104 may also adjust the joint angle of the first joint point in the second dimension to correspond to the above-mentioned another specified joint angle (for example, the expected value of the above-mentioned another Gaussian distribution model). Based on this, the processor 104 can adjust the joint angles of each joint point of the virtual body in different dimensions according to the content of the BVH dynamic capture data file, so that the virtual avatar can present a specific movement (such as dance steps).

由上可知，有別於習知從資料庫中挑選既有舞步進行重組的作法，本發明的方法可依據當下的音訊訊號決定虛擬化身上各關節點在各維度上的關節角度，從而讓虛擬化身可基於當下的音樂而即興在節拍上舞動。It can be seen from the above that, different from the conventional method of selecting existing dance steps from the database for reorganization, the method of the present invention can determine the joint angle of each joint point of the virtual body in each dimension according to the current audio signal, thereby allowing the virtual The avatar can improvise and dance to the beat based on the current music.

在其他實施例中，單一關節點在單一維度上可具有兩個以上的可動角度範圍，而這些可動角度範圍可模型化為一個多變量混合高斯模型，以下將以第二實施例作進一步說明。In other embodiments, a single joint point may have more than two movable angle ranges in a single dimension, and these movable angle ranges may be modeled as a multivariate mixed Gaussian model. The second embodiment will be used for further explanation below.

在第二實施例中，假設單一關節點在單一維度上具有兩個可動角度範圍，但可不限於此。在此情況下，關節角度分布矩陣M1可實現為一個維度為159x4的矩陣，而其中的所述159個列分別對應於上述x、y、z及所述52個其他關節點個別的(Rx, Ry, Rz)。再以第一關節點為例，假設第一關節點在第一維度上具有第一及第二可動角度範圍（可理解為對應於第一關節點的Rx的可動角度範圍），而此第一、第二可動角度範圍在本發明中可模型化為一第一多變量混合（multi variate mixture）高斯分布模型。在此情況下，關節角度分布矩陣M1中對應於第一關節點的Rx的列可包括4個元素，而此4元素可分別是第一多變量混合高斯分布模型的第一期望值（以

表示）、第一標準差（以

表示）、第二期望值（以

表示）及第二標準差（以

表示）。 In the second embodiment, it is assumed that a single joint point has two movable angle ranges in a single dimension, but it is not limited to this. In this case, the joint angle distribution matrix M1 can be realized as a matrix with a dimension of 159x4, and the 159 columns in it correspond to the aforementioned x, y, z and the 52 other joint points individually (Rx, Ry, Rz). Taking the first joint point as an example again, it is assumed that the first joint point has the first and second movable angle ranges in the first dimension (which can be understood as the movable angle range of Rx corresponding to the first joint point), and the first joint point 1. The second movable angle range can be modeled as a first multivariate mixture Gaussian distribution model in the present invention. In this case, the column of Rx corresponding to the first joint point in the joint angle distribution matrix M1 may include 4 elements, and these 4 elements may be the first expected values of the first multivariate mixed Gaussian distribution model (in terms of

Expressed), the first standard deviation (in

Expressed), the second expected value (in

Expressed) and the second standard deviation (in

Express).

基於以上教示，本領域具通常知識者應可相應理解第二實施例中關節角度分布矩陣M1中其餘各列的意義及內容，於此不另贅述。此外，在第二實施例中，關節角度分布矩陣M1的第1行例如可由各列中的第一期望值組成，關節角度分布矩陣M1的第2行例如可由各列中的第一標準差組成，關節角度分布矩陣M1的第3行例如可由各列中的第二期望值組成，關節角度分布矩陣M1的第4行例如可由各列中的第二標準差組成，但可不限於此。Based on the above teachings, those with ordinary knowledge in the art should be able to understand the meaning and content of the remaining columns in the joint angle distribution matrix M1 in the second embodiment, which will not be repeated here. In addition, in the second embodiment, the first row of the joint angle distribution matrix M1 may be composed of, for example, the first expected value in each column, and the second row of the joint angle distribution matrix M1 may be composed of, for example, the first standard deviation in each column. The third row of the joint angle distribution matrix M1 may be composed of, for example, the second expected value in each column, and the fourth row of the joint angle distribution matrix M1 may be composed of, for example, the second standard deviation in each column, but it may not be limited thereto.

再以第一關節點為例，假設處理器104欲取得第一關節點在第一維度上的第一指定關節角度，則處理器104可基於上述第一多變量混合高斯分布模型在上述第一可動角度範圍或第二可動角度範圍內取樣第一角度以作為第一關節點在第一維度上的第一指定關節角度。為便於理解，以下將另輔以圖5作說明。Taking the first joint point as an example again, assuming that the processor 104 wants to obtain the first specified joint angle of the first joint point in the first dimension, the processor 104 may use the first multivariate mixture Gaussian distribution model in the first The first angle is sampled in the movable angle range or the second movable angle range as the first designated joint angle of the first joint point in the first dimension. For ease of understanding, the following will be supplemented with FIG. 5 for explanation.

請參照圖5，其是依據本發明第二實施例繪示的用以模型化第一、第二可動角度範圍的第一多變量混合高斯分布模型。在圖5中，假設第一關節點在第一維度上具有第一可動角度範圍R11及第二可動角度範圍R12，而第一多變量混合高斯分布模型G1’例如可用於模型化第一可動角度範圍R11（其對應於

及

）及第二可動角度範圍R2（其對應於

及

）。在此情況下，處理器104可基於第一多變量混合高斯分布模型G1’在第一可動角度範圍R11內或第二可動角度範圍R12內取樣第一角度以作為第一關節點在第一維度上的第一指定關節角度。在一實施例中，處理器104例如可基於第一高斯分布模型G1’在第一可動角度範圍內R11或第二可動角度範圍R12內隨機取樣第一角度作為上述第一指定關節角度。在另一實施例中，處理器104亦可直接在第一可動角度範圍內R11或第二可動角度範圍R12內中取樣對應於期望值（即，

或

）的第一角度作為上述第一指定關節角度，但可不限於此。 Please refer to FIG. 5, which is a first multivariate mixture Gaussian distribution model for modeling the first and second movable angle ranges according to the second embodiment of the present invention. In FIG. 5, it is assumed that the first joint point has a first movable angle range R11 and a second movable angle range R12 in the first dimension, and the first multivariate mixed Gaussian distribution model G1' can be used to model the first movable angle, for example. Range R11 (which corresponds to

and

) And the second movable angle range R2 (which corresponds to

and

). In this case, the processor 104 may sample the first angle in the first movable angle range R11 or the second movable angle range R12 based on the first multivariate mixture Gaussian distribution model G1' to serve as the first joint point in the first dimension. On the first specified joint angle. In an embodiment, the processor 104 may randomly sample a first angle in the first movable angle range R11 or the second movable angle range R12 based on the first Gaussian distribution model G1′ as the first specified joint angle. In another embodiment, the processor 104 may also directly sample the expected value in the first movable angle range R11 or the second movable angle range R12 (ie,

or

在其他實施例中，假設AR/VR環境中存在兩個可控制的虛擬化身A、B，且此二虛擬化身A、B上皆具有第一關節點，則處理器104可基於第一多變量混合高斯分布模型G1’在第一可動角度範圍R11內取樣一角度以作為虛擬化身A上第一關節點在第一維度上的第一指定關節角度。另外，處理器104還可基於第一多變量混合高斯分布模型G1’在第二可動角度範圍R12內取樣一角度以作為虛擬化身B上第一關節點在第一維度上的第一指定關節角度，從而讓不同的虛擬化身因應當下的音樂呈現不同舞步，但可不限於此。基於以上教示，本領域具通常知識者應可相應理解處理器103在第二實施例中取得各關節點在各維度上的指定關節角度的方式，於此不另贅述。In other embodiments, assuming that there are two controllable avatars A and B in the AR/VR environment, and both avatars A and B have first joint points, the processor 104 may be based on the first multivariate The mixed Gaussian distribution model G1' samples an angle in the first movable angle range R11 to be the first designated joint angle of the first joint point on the virtual avatar A in the first dimension. In addition, the processor 104 may also sample an angle in the second movable angle range R12 based on the first multivariate mixture Gaussian distribution model G1′ to serve as the first designated joint angle of the first joint point on the virtual avatar B in the first dimension. , So that different virtual avatars show different dance steps in response to the music, but it is not limited to this. Based on the above teachings, a person with ordinary knowledge in the art should be able to understand the manner in which the processor 103 obtains the specified joint angles of each joint point in each dimension in the second embodiment, which will not be repeated here.

此外，第一關節點在第二維度上亦可具有兩個可動角度範圍，而此二可動角度範圍亦可模型化為另一多變量高斯分布模型。在此情況下，處理器104決定第一關節點在第二維度上的指定關節角度的方式可參照以上的教示，於此不另贅述。並且，其他關節點的在各維度上的可動角度範圍亦可基於以上教示模型化為對應的多變量高斯模型，其細節亦可參照以上的教示，於此不另贅述。In addition, the first joint point may also have two movable angle ranges in the second dimension, and the two movable angle ranges may also be modeled as another multivariate Gaussian distribution model. In this case, the way for the processor 104 to determine the specified joint angle of the first joint point in the second dimension can refer to the above teaching, which will not be repeated here. In addition, the movable angle range of other joint points in each dimension can also be modeled into a corresponding multivariable Gaussian model based on the above teaching. The details can also refer to the above teaching, and will not be repeated here.

在取得各關節點對應的多個指定關節角度之後，在第二實施例的步驟S250中，處理器104可依據指定關節角度調整虛擬化身上各關節點的關節角度。在第二實施例中，處理器104可將各關節點對應的指定關節角度以指定關節角度向量S1（其維度例如是159x1）的形式輸出。舉例而言，假設處理器104對於各關節點皆是取樣對應於第一期望值的角度作為各關節點的指定關節角度，則處理器104可直接取用關節角度分布矩陣M1的第1行作為指定關節角度向量S1。舉另一例而言，假設處理器104對於各關節點皆是取樣對應於第二期望值的角度作為各關節點的指定關節角度，則處理器104可直接取用關節角度分布矩陣M1的第3行作為指定關節角度向量S1，但本發明可不限於此。After obtaining multiple designated joint angles corresponding to each joint point, in step S250 of the second embodiment, the processor 104 may adjust the joint angle of each joint point of the virtualized body according to the designated joint angle. In the second embodiment, the processor 104 may output the designated joint angle corresponding to each joint point in the form of a designated joint angle vector S1 (the dimension of which is, for example, 159×1). For example, assuming that the processor 104 samples the angle corresponding to the first expected value for each joint point as the designated joint angle of each joint point, the processor 104 can directly use the first row of the joint angle distribution matrix M1 as the designated joint angle. The joint angle vector S1. For another example, assuming that the processor 104 samples the angle corresponding to the second expected value for each joint point as the designated joint angle of each joint point, the processor 104 can directly use the third row of the joint angle distribution matrix M1 As the designated joint angle vector S1, the present invention may not be limited to this.

在此情況下，處理器104例如可基於指定關節角度向量S1中的指定關節角度產生對應的BVH動態捕捉資料檔，並基於此BVH動態捕捉資料檔調整虛擬化身上各關節點的關節角度。例如，處理器104可將第一關節點在第一維度上的關節角度調整為對應於上述第一指定關節角度（例如第一多變量高斯分布模型G1’的第一期望值或第二期望值）。基此，處理器104可依據BVH動態捕捉資料檔的內容調整虛擬化身上各關節點在不同維度上的關節角度，從而令虛擬化身呈現特定的動作（例如舞步）。In this case, the processor 104 may, for example, generate a corresponding BVH motion capture data file based on the specified joint angle in the specified joint angle vector S1, and adjust the joint angle of each joint point of the virtualized body based on the BVH motion capture data file. For example, the processor 104 may adjust the joint angle of the first joint point in the first dimension to correspond to the aforementioned first specified joint angle (for example, the first expected value or the second expected value of the first multivariate Gaussian distribution model G1'). Based on this, the processor 104 can adjust the joint angles of each joint point of the virtual body in different dimensions according to the content of the BVH dynamic capture data file, so that the virtual avatar can present a specific movement (such as dance steps).

請參照圖6，其是依據本發明實施例的BVH動態捕捉資料檔及對應的虛擬化身示意圖。在本實施例中，在處理器104依先前教示產生BVH動態捕捉資料檔610之後，處理器104可依據其中的內容調整虛擬化身620上各關節點在各維度上的關節角度，從而讓虛擬化身620呈現特定的動作、舞步、姿態等，但不限於此。Please refer to FIG. 6, which is a schematic diagram of a BVH dynamic capture data file and a corresponding virtual avatar according to an embodiment of the present invention. In this embodiment, after the processor 104 generates the BVH dynamic capture data file 610 according to the previous teaching, the processor 104 can adjust the joint angles of the joint points on the virtual avatar 620 in each dimension according to the content therein, so that the virtual avatar 620 presents specific movements, dance steps, postures, etc., but is not limited thereto.

應了解的是，以上實施例係假設音訊訊號F1係對應於節拍及音樂，而對於未對應於節拍或音樂的其他音訊訊號而言，本發明可基於不同的機制執行本發明的方法，以下將以第三實施例作進一步說明。It should be understood that the above embodiment assumes that the audio signal F1 corresponds to the beat and music. For other audio signals that do not correspond to the beat or music, the present invention can execute the method of the present invention based on different mechanisms. Take the third embodiment for further explanation.

舉例而言，在第三實施例中，假設接續於音訊訊號F1的音訊訊號F2係對應於音樂但未對應於節拍（即，不在節拍上）。在此情況下，處理器104仍可執行步驟S210以接收音訊訊號F2，並從音訊訊號F2擷取高階音訊特徵H2。在一實施例中，處理器104可將音訊訊號F2（例如是一音訊幀）輸入CNN N1，以由CNN N1從音訊訊號F2中擷取高階音訊特徵H2。For example, in the third embodiment, it is assumed that the audio signal F2 connected to the audio signal F1 corresponds to music but does not correspond to the beat (that is, not on the beat). In this case, the processor 104 can still perform step S210 to receive the audio signal F2, and extract the high-level audio feature H2 from the audio signal F2. In one embodiment, the processor 104 may input the audio signal F2 (for example, an audio frame) to the CNN N1, so that the CNN N1 extracts high-level audio features H2 from the audio signal F2.

之後，在步驟S220中，處理器104可從高階音訊特徵H2中擷取潛在音訊特徵L2。在一實施例中，處理器104可將高階音訊特徵H2輸入第一RNN N2，以由第一RNN N2基於第一內部狀態IS11從高階音訊特徵H2擷取潛在音訊特徵L2。在本實施例中，由於第一內部狀態IS11可理解為來自前一級的操作，故第一內部狀態IS11可視為是第三實施例中的歷史內部狀態。並且，由於第一內部狀態IS11帶有前一級的高階音訊特徵H1的相關資訊，因而可使得第一RNN N2所擷取的潛在音訊特徵L2一併考慮先前的一級（或多級）的資訊，但可不限於此。After that, in step S220, the processor 104 may extract the potential audio feature L2 from the high-level audio feature H2. In an embodiment, the processor 104 may input the high-level audio feature H2 into the first RNN N2, so that the first RNN N2 extracts the latent audio feature L2 from the high-level audio feature H2 based on the first internal state IS11. In this embodiment, since the first internal state IS11 can be understood as an operation from the previous stage, the first internal state IS11 can be regarded as the historical internal state in the third embodiment. In addition, since the first internal state IS11 has information related to the high-level audio feature H1 of the previous level, the potential audio feature L2 extracted by the first RNN N2 can also consider the information of the previous level (or multiple levels). But it is not limited to this.

此外，在本實施例中，第一RNN N2除了可基於高階音訊特徵H2輸出潛在音訊特徵L2之外，還可輸出第一內部狀態IS21以供下一級使用，但可不限於此。In addition, in this embodiment, the first RNN N2 can output not only the latent audio feature L2 based on the high-level audio feature H2, but also the first internal state IS21 for use in the next stage, but it is not limited to this.

在第三實施例中，處理器104同樣可將潛在音訊特徵L2輸入特定神經網路N3，以由特定神經網路N3基於潛在音訊特徵L2判斷音訊訊號F2是否對應於節拍以及是否對應於音樂，但可不限於此。In the third embodiment, the processor 104 can also input the latent audio feature L2 into the specific neural network N3, so that the specific neural network N3 determines whether the audio signal F2 corresponds to the beat and whether it corresponds to the music based on the latent audio feature L2. But it is not limited to this.

由於第三實施例中的音訊訊號F2已假設為對應於音樂但不在節拍上，故處理器104可採用不同於第一、第二實施例的方式來執行步驟S230以產生對應的關節角度分布矩陣M2。具體而言，在第三實施例中，處理器104可取得一歷史關節角度分布矩陣，其中此歷史關節角度分布矩陣可包括多個歷史高斯分布參數，且前述歷史高斯分布參數可對應於虛擬化身上的關節點。在第三實施例中，上述歷史關節角度分布矩陣例如是前一級操作中所產生的關節角度分布矩陣M1，而上述歷史高斯斯分布參數即為關節角度分布矩陣M1中的內容，但可不限於此。Since the audio signal F2 in the third embodiment has been assumed to correspond to music but not on the beat, the processor 104 can perform step S230 in a manner different from the first and second embodiments to generate the corresponding joint angle distribution matrix M2. Specifically, in the third embodiment, the processor 104 may obtain a historical joint angle distribution matrix, where the historical joint angle distribution matrix may include a plurality of historical Gaussian distribution parameters, and the aforementioned historical Gaussian distribution parameters may correspond to virtualization The joint points on the body. In the third embodiment, the aforementioned historical joint angle distribution matrix is, for example, the joint angle distribution matrix M1 generated in the previous operation, and the aforementioned historical Gaussian distribution parameter is the content of the joint angle distribution matrix M1, but it may not be limited to this. .

之後，處理器104可將此歷史關節角度分布矩陣（即，關節角度分布矩陣M1）轉換為參考音訊特徵L2’，並將此參考音訊特徵L2’定義為（新的）潛在音訊特徵L2。之後，處理器104例如可將參考音訊特徵L2’（即，新的潛在音訊特徵L2）輸入第二RNN N4，以由第二RNN N4取得關節角度分布矩陣M2。Afterwards, the processor 104 can convert this historical joint angle distribution matrix (ie, the joint angle distribution matrix M1) into a reference audio feature L2', and define this reference audio feature L2' as a (new) potential audio feature L2. After that, the processor 104 may, for example, input the reference audio feature L2' (ie, the new potential audio feature L2) into the second RNN N4 to obtain the joint angle distribution matrix M2 from the second RNN N4.

簡言之，由於音訊訊號F2未在節拍上，故處理器104可忽略原本的潛在音訊特徵L2，而是以由關節角度分布矩陣M1轉換而來的參考音訊特徵L2’作為（新的）潛在音訊特徵L2而輸入至第二RNN N4，以由第二RNN N4據以取得關節角度分布矩陣M2。In short, since the audio signal F2 is not on the beat, the processor 104 can ignore the original potential audio feature L2, and instead use the reference audio feature L2' converted from the joint angle distribution matrix M1 as the (new) potential The audio feature L2 is input to the second RNN N4 to obtain the joint angle distribution matrix M2 from the second RNN N4.

在一實施例中，為了將關節角度分布矩陣M1的維度轉換為適於輸入第二RNN N4的參考音訊特徵L2’，處理器104可簡易地採用一個全連接層神經網路來進行轉換。此外，處理器104亦可基於卷積層、池化層（pooling layer）來進行前述轉換，但可不限於此。將（轉換後的）關節角度分布矩陣M1饋入第二RNN N4以取得關節角度分布矩陣M2的相關原理可參照「 Auto-Conditioned Recurrent Networks for Extended Complex Human Motion Synthesis,cs.LG, 2017」，於此不另贅述。 In one embodiment, in order to convert the dimension of the joint angle distribution matrix M1 into a reference audio feature L2' suitable for inputting the second RNN N4, the processor 104 can simply use a fully connected layer neural network to perform the conversion. In addition, the processor 104 may also perform the aforementioned conversion based on a convolutional layer and a pooling layer, but it may not be limited thereto. Feed the (transformed) joint angle distribution matrix M1 into the second RNN N4 to obtain the joint angle distribution matrix M2 for the relevant principle, please refer to " Auto-Conditioned Recurrent Networks for Extended Complex Human Motion Synthesis, cs.LG, 2017 ", in This will not be repeated here.

此外，在第三實施例中，第二RNN N4還可進一步基於參考音訊特徵L2’及第二內部狀態IS12來產生關節角度分布矩陣M2，以在考慮先前一或多級資訊的情況下產生更佳的關節角度分布矩陣M2，但可不限於此。In addition, in the third embodiment, the second RNN N4 can further generate the joint angle distribution matrix M2 based on the reference audio feature L2' and the second internal state IS12, so as to generate more information while considering the previous one or more levels of information. A good joint angle distribution matrix M2, but not limited to this.

在產生關節角度分布矩陣M2之後，處理器104例如可基於第一、第二實施例中教示的機制來產生對應的指定關節角度向量S1，並據以將虛擬化身的動作/舞步/姿態調整為對應於音訊訊號F2的態樣。After generating the joint angle distribution matrix M2, the processor 104 may generate the corresponding designated joint angle vector S1 based on the mechanism taught in the first and second embodiments, and adjust the motion/dancing step/posture of the virtual avatar accordingly to Corresponds to the state of the audio signal F2.

在第四實施例中，假設音訊訊號F3係對應於節拍及音樂，故處理器104可基於第一、第二實施例中教示的機制來將虛擬化身的動作/舞步/姿態調整為對應於音訊訊號F3的態樣，其細節於此不另贅述。In the fourth embodiment, it is assumed that the audio signal F3 corresponds to the beat and the music, so the processor 104 can adjust the action/dance/posture of the virtual avatar to correspond to the audio based on the mechanism taught in the first and second embodiments. The details of the state of the signal F3 will not be repeated here.

此外，在第五實施例中，假設特定神經網路R3判定音訊訊號FN的潛在音訊特徵（未另標示）指示音訊訊號FN既未對應於節拍亦未對應於音樂，則處理器104可不調整虛擬化身上各關節點的關節角度，或是將虛擬化身調整為呈現閒置姿態。藉此，可避免虛擬化身在沒有音樂的情況下自行舞動，但可不限於此。In addition, in the fifth embodiment, assuming that the specific neural network R3 determines that the potential audio feature (not shown separately) of the audio signal FN indicates that the audio signal FN does not correspond to either the beat or the music, the processor 104 may not adjust the virtual The joint angle of each joint point on the avatar, or adjust the virtual avatar to present an idle posture. In this way, the virtual avatar can be prevented from dancing on its own without music, but it is not limited to this.

請參照圖7，其是依據本發明實施例繪示的訓練階段示意圖。在圖7中，所示的訓練機制可用於產生先前實施例中所提及的CNN N1、第一RNN N2、特定神經網路N3及第二RNN N4。具體而言，在本實施例中，處理器104可先將音樂訓練資料輸入至待訓練的上述神經網路（即，CNN N1、第一RNN N2、特定神經網路N3及第二RNN N4）中。在一實施例中，各神經網路的相關模型參數可初始化為隨機數值，但可不限於此。Please refer to FIG. 7, which is a schematic diagram of a training phase according to an embodiment of the present invention. In FIG. 7, the training mechanism shown can be used to generate the CNN N1, the first RNN N2, the specific neural network N3, and the second RNN N4 mentioned in the previous embodiment. Specifically, in this embodiment, the processor 104 may first input the music training data to the aforementioned neural network to be trained (ie, CNN N1, first RNN N2, specific neural network N3, and second RNN N4) middle. In an embodiment, the relevant model parameters of each neural network can be initialized to random values, but it is not limited to this.

之後，處理器104可基於舞步訓練資料將虛擬化身上的各關節點在各維度上的可動角度範圍模型化對應的（單變量/多變量）高斯模型，並據以產生一預測舞步。之後，處理器104可基於預測舞步及對應的舞步訓練資料計算一損失函數，並依據損失函數的結果調整上述各神經網路的相關模型參數（例如神經元的權重）。以上流程可反復執行，直至所產生的預測舞步足夠接近於對應的舞步訓練資料。以上訓練階段技術細節可參照相關的現有技術文獻，於此不另贅述。After that, the processor 104 may model the corresponding (univariate/multivariate) Gaussian model of the movable angle range of each joint point of the virtualized body in each dimension based on the dance step training data, and generate a predicted dance step accordingly. After that, the processor 104 may calculate a loss function based on the predicted dance steps and the corresponding dance step training data, and adjust the relevant model parameters of the aforementioned neural networks (for example, the weights of neurons) according to the results of the loss function. The above process can be executed repeatedly until the generated predicted dance step is sufficiently close to the corresponding dance step training data. For the technical details of the above training phase, please refer to related prior art documents, which will not be repeated here.

綜上所述，本發明提出的方法及電子裝置可在不需維護舞步資料庫的情況下，讓AR/VR環境中的虛擬化身依據當下的音樂即興地在節拍上舞動。此外，本發明的方法可讓電子裝置耗用較少的記憶體，並可讓電子裝置實時地進行相關的運算。因此，即便電子裝置屬於資源較受限的邊緣裝置，本發明的方法仍可讓電子裝置流暢地控制虛擬化身隨音樂而舞動。In summary, the method and electronic device proposed by the present invention can allow the virtual avatar in the AR/VR environment to improvisely dance on the beat according to the current music without maintaining the dance step database. In addition, the method of the present invention allows the electronic device to consume less memory, and allows the electronic device to perform related operations in real time. Therefore, even if the electronic device is an edge device with relatively limited resources, the method of the present invention can still allow the electronic device to smoothly control the virtual avatar to dance with music.

雖然本發明已以實施例揭露如上，然其並非用以限定本發明，任何所屬技術領域中具有通常知識者，在不脫離本發明的精神和範圍內，當可作些許的更動與潤飾，故本發明的保護範圍當視後附的申請專利範圍所界定者為準。Although the present invention has been disclosed in the above embodiments, it is not intended to limit the present invention. Anyone with ordinary knowledge in the technical field can make some changes and modifications without departing from the spirit and scope of the present invention. The scope of protection of the present invention shall be subject to those defined by the attached patent scope.

100:電子裝置 102:儲存電路 104:處理器 610:BVH動態捕捉資料檔 620:虛擬化身 G1:第一高斯分布模型 G1’:第一多變量高斯分布模型 H1, H2:高階音訊特徵 IS11, IS21:第一內部狀態 IS12:第二內部狀態 L1, L2:潛在音訊特徵 L2’:參考音訊特徵 M1, M2:關節角度分布矩陣 N1:CNN N2:第一RNN N3:特定神經網路 N4:第二RNN R1, R11:第一可動角度範圍 R12:第二可動角度範圍 S1, S2:指定關節角度向量 S210~S250:步驟 100: electronic device 102: storage circuit 104: processor 610: BVH motion capture data file 620: Virtual Avatar G1: The first Gaussian distribution model G1’: The first multivariate Gaussian distribution model H1, H2: High-level audio features IS11, IS21: The first internal state IS12: Second internal state L1, L2: potential audio features L2’: Reference audio features M1, M2: Joint angle distribution matrix N1: CNN N2: First RNN N3: specific neural network N4: Second RNN R1, R11: the first movable angle range R12: The second movable angle range S1, S2: Specify the joint angle vector S210~S250: steps

圖1是依據本發明實施例繪示的電子裝置示意圖。圖2是依據本發明實施例繪示的依據音訊訊號產生動作的方法流程圖。圖3是依據本發明實施例繪示的系統架構圖。圖4是依據本發明第一實施例繪示的用以模型化第一可動角度範圍的第一高斯分布模型。圖5是依據本發明第二實施例繪示的用以模型化第一、第二可動角度範圍的第一多變量混合高斯分布模型。圖6是依據本發明實施例的BVH動態捕捉資料檔及對應的虛擬化身示意圖。圖7是依據本發明實施例繪示的訓練階段示意圖。 FIG. 1 is a schematic diagram of an electronic device according to an embodiment of the present invention. 2 is a flowchart of a method for generating actions based on audio signals according to an embodiment of the present invention. FIG. 3 is a system architecture diagram drawn according to an embodiment of the present invention. FIG. 4 is a first Gaussian distribution model for modeling the first movable angle range according to the first embodiment of the present invention. FIG. 5 is a first multivariable mixture Gaussian distribution model for modeling the first and second movable angle ranges according to the second embodiment of the present invention. FIG. 6 is a schematic diagram of a BVH dynamic capture data file and a corresponding virtual avatar according to an embodiment of the present invention. Fig. 7 is a schematic diagram of a training phase according to an embodiment of the present invention.

S210~S250:步驟S210~S250: steps

Claims

A method for generating an action based on an audio signal includes: receiving a first audio signal, and extracting a first high-level audio feature from the first audio signal; extracting a first potential audio feature from the first high-level audio feature ; In response to determining that the first potential audio feature indicates that the first audio signal corresponds to a first beat, a first joint angle distribution matrix is obtained according to the first potential audio feature, wherein the first joint angle distribution matrix includes a plurality of Gaussian distribution parameters, and the Gaussian distribution parameters correspond to multiple joint points on a virtual body; in response to determining that the first potential audio feature indicates that the first audio signal corresponds to a first music, based on the first joint angle The distribution matrix obtains a plurality of designated joint angles corresponding to the joint points; and adjusts the joint angle of each joint point of the virtual body according to the designated joint angles.

The method according to claim 1, wherein the first audio signal includes a first audio frame, and the step of extracting the first high-level audio feature from the first audio signal includes: inputting the first audio frame into a volume A neural network is used to extract the first high-level audio feature from the first audio frame by the convolutional neural network.

The method according to claim 1, wherein the step of extracting the first potential audio feature from the first high-level audio feature includes: The first high-level audio feature is input into a first recurrent neural network, so that the first recurrent neural network extracts the first potential audio feature from the first high-level audio feature.

The method according to claim 1, further comprising: inputting the first potential audio feature into a specific neural network, so that the specific neural network determines whether the first audio signal corresponds to the first potential audio feature based on the first potential audio feature And determining whether the first audio signal corresponds to the first music based on the first potential audio feature.

The method according to claim 1, wherein in response to determining that the first potential audio feature indicates that the first audio signal does not correspond to any beat, the method further includes: obtaining a historical joint angle distribution matrix, and the historical joint angle The distribution matrix is converted into a reference audio feature, and the reference audio feature is defined as the first potential audio feature, wherein the historical joint angle distribution matrix includes a plurality of historical Gaussian distribution parameters, and the historical Gaussian distribution parameters correspond to the virtual The joint points on the avatar are obtained; the first joint angle distribution matrix is obtained according to the first potential audio feature.

The method according to claim 1, wherein the step of obtaining the first joint angle distribution matrix according to the first potential audio feature includes: inputting the first potential audio feature into a second recurrent neural network, so that the second The recurrent neural network generates the first joint angle distribution matrix based on the first potential audio feature.

The method according to claim 1, wherein the step of extracting the first potential audio feature from the first high-level audio feature includes: Obtain a first historical internal state; input the first high-level audio feature into a first recurrent neural network so that the first recurrent neural network extracts from the first high-level audio feature based on the first historical internal state The first potential audio feature.

The method according to claim 7, wherein the step of obtaining the first historical internal state comprises: receiving a first historical audio signal prior to the first audio signal, and extracting a first historical audio signal from the first historical audio signal Historical high-level audio features; input the first historical high-level audio features into the first recurrent neural network, so that the first recurrent neural network generates the first historical internal state based on the first historical high-level audio features and corresponds to the A first historical potential audio feature of the first historical high-level audio feature.

The method according to claim 8, wherein the step of obtaining the first joint angle distribution matrix according to the first potential audio feature includes: obtaining a second historical internal state; and inputting the first potential audio feature to a second recurrent nerve A network to generate the first joint angle distribution matrix based on the second historical internal state and the first potential audio feature by the second recurrent neural network.

The method according to claim 9, wherein the step of obtaining the second historical internal state includes: The first historical potential audio feature is input to the second recurrent neural network, so that the second recurrent neural network generates the second historical internal state based on the first historical potential audio feature.

The method according to claim 1, wherein in response to determining that the first potential audio feature indicates that the first audio signal does not correspond to any music, the method further includes: not adjusting the joints of the joint points on the virtualization body Angle, or adjust the avatar to present an idle posture.

The method according to claim 1, wherein the joint points include a first joint point, the first joint point has a first movable angle range in a first dimension, and the Gaussian distribution parameters include a first joint point An expected value and a first standard deviation, wherein the first expected value and the first standard deviation correspond to a first Gaussian distribution model for modeling the first movable angle range.

The method according to claim 12, wherein the designated joint angles include a first designated joint angle corresponding to the first joint point in the first dimension, and the first joint angle distribution matrix is based on the first designated joint angle corresponding to the The step of specifying the joint angles of the joint points includes: sampling a first angle within the first movable angle range based on the first Gaussian distribution model as the first angle of the first joint point in the first dimension Specify the joint angle.

The method according to claim 13, wherein the first angle is an angle corresponding to the first desired value in the first movable angle range.

The method according to claim 13, wherein the step of adjusting the joint angle of each joint point of the virtual body according to the specified joint angles includes: adjusting the joint angle of the first joint point in the first dimension to Corresponds to the first designated joint angle.

The method according to claim 1, wherein the joint points include a first joint point, the first joint point has a first movable angle range and a second movable angle range in a first dimension, and the Gaussian The distribution parameters include a first expected value, a first standard deviation, a second expected value, and a second standard deviation. The first expected value, the first standard deviation, the second expected value, and the second standard deviation correspond to the A first multivariate mixture Gaussian distribution model for modeling the first movable angle range and the second movable angle range, wherein the first expected value and the first standard deviation correspond to the first movable angle range, and the second The expected value and the second standard deviation correspond to the second movable angle range.

The method according to claim 16, wherein the designated joint angles include a first designated joint angle corresponding to the first joint point in the first dimension, and based on the first joint angle distribution matrix, the first designated joint angle corresponding to the The step of specifying the joint angles of the joint points includes: sampling a first angle in the first movable angle range or the second movable angle range based on the first multivariate mixture Gaussian distribution model as the first angle at the first joint The first designated joint angle in the first dimension.

The method according to claim 17, wherein the first angle is an angle corresponding to the first desired value in the first movable angle range, or an angle corresponding to the second desired value in the second movable angle range.

The method according to claim 12, wherein the first joint point has another movable angle range in a second dimension, and the Gaussian distribution parameters include another expected value and another standard deviation, wherein the The other expected value and the other standard deviation correspond to another Gaussian distribution model used to model the other movable angle range, and the specified joint angles further include a value corresponding to the first joint point in the second dimension. A second designated joint angle, and the method further includes: sampling a second angle in the other movable angle range based on the another Gaussian distribution model as the first joint point in the second dimension 2. Specify the joint angle.

The method according to claim 19, wherein the second angle corresponds to the other desired value.

The method according to claim 19, further comprising: adjusting the joint angle of the first joint point in the second dimension to correspond to the second designated joint angle.

The method according to claim 1, wherein the step of adjusting the joint angle of each joint point of the virtualized body according to the specified joint angles includes: generating a biological visual hierarchical dynamic capture data file based on the specified joint angles, and The biological visual hierarchy dynamic capture data file adjusts the joint angle of each joint point of the virtual body.

An electronic device includes: a storage circuit which stores a plurality of modules; and a processor which is coupled to the storage circuit and accesses the modules to perform the following steps: receiving a first audio signal and receiving The first audio signal extracts a first high-level audio feature; extracts a first potential audio feature from the first high-level audio feature; responding to determining that the first potential audio feature indicates that the first audio signal corresponds to a first For one beat, a first joint angle distribution matrix is obtained according to the first potential audio feature, wherein the first joint angle distribution matrix includes a plurality of Gaussian distribution parameters, and the Gaussian distribution parameters correspond to a plurality of joints on a virtual body Points; in response to determining that the first potential audio feature indicates that the first audio signal corresponds to a first music, based on the first joint angle distribution matrix to obtain a plurality of designated joint angles corresponding to the joint points; according to the designated The joint angle adjusts the joint angle of each joint point on the virtual body.