TWI747258B - Method for generating action according to audio signal and electronic device - Google Patents
Method for generating action according to audio signal and electronic device Download PDFInfo
- Publication number
- TWI747258B TWI747258B TW109114298A TW109114298A TWI747258B TW I747258 B TWI747258 B TW I747258B TW 109114298 A TW109114298 A TW 109114298A TW 109114298 A TW109114298 A TW 109114298A TW I747258 B TWI747258 B TW I747258B
- Authority
- TW
- Taiwan
- Prior art keywords
- joint
- angle
- audio feature
- potential
- audio signal
- Prior art date
Links
Images
Abstract
Description
本發明是有關於一種控制虛擬化身(avatar)的技術,且特別是有關於一種依據音訊訊號產生動作的方法及電子裝置。The present invention relates to a technology for controlling an avatar, and particularly relates to a method and an electronic device for generating actions based on audio signals.
在虛擬實境(virtual reality,VR)及掛留實境(augmented reality,AR)體驗中,虛擬化身是這些應用中的關鍵部分。如果化身可以具有與用戶相同的感知能力和感覺,並且能夠對環境做出相應的反應,此將極大地改善用戶的沉浸感。In virtual reality (VR) and augmented reality (AR) experiences, avatars are a key part of these applications. If the avatar can have the same perception capabilities and feelings as the user, and can respond to the environment accordingly, this will greatly improve the user's sense of immersion.
在現有技術中,存在一種讓虛擬化身能夠依據音樂跳舞的技術。然而,為實現上述目的,此種技術需維護儲存有大量預設舞步的資料庫以用於產生舞步,因而將耗用較多的記憶體,故較不容易實現在邊緣裝置(edge device)(例如嵌入式系統或行動裝置)的應用程式上。In the prior art, there is a technology that allows virtual avatars to dance according to music. However, in order to achieve the above purpose, this technology needs to maintain a database storing a large number of preset dance steps for generating dance steps, which will consume more memory, so it is not easy to implement in edge devices (edge devices) ( Such as embedded systems or mobile devices).
進一步而言,當VR/AR環境中出現音樂時,上述技術將會基於某些預定的手工特徵(hand-crafted feature)從資料庫中選出一或多個舞步,並將這些舞步重組為對應於當下音樂的一連串舞步。因此,上述技術並無法讓虛擬化身有創意地舞動。Furthermore, when music appears in the VR/AR environment, the above technology will select one or more dance steps from the database based on certain predetermined hand-crafted features, and reorganize these dance steps into corresponding ones A series of dance steps in current music. Therefore, the above-mentioned technology cannot make the virtual avatar dance creatively.
有鑑於此,本發明提供一種依據音訊訊號產生動作的方法及電子裝置,其可用於解決上述技術問題。In view of this, the present invention provides a method and an electronic device for generating actions based on audio signals, which can be used to solve the above technical problems.
本發明提供一種依據音訊訊號產生動作的方法,包括:接收一第一音訊訊號,並從第一音訊訊號擷取一第一高階音訊特徵;從第一高階音訊特徵中擷取一第一潛在音訊特徵;反應於判定第一潛在音訊特徵指示第一音訊訊號對應於一第一節拍,依據第一潛在音訊特徵取得一第一關節角度分布矩陣,其中第一關節角度分布矩陣包括多個高斯分布參數,且前述高斯分布參數對應於一虛擬化身上的多個關節點;反應於判定第一潛在音訊特徵指示第一音訊訊號對應於一第一音樂,基於第一關節角度分布矩陣取得對應於前述關節點的多個指定關節角度;依據前述指定關節角度調整虛擬化身上各關節點的關節角度。The present invention provides a method for generating actions based on audio signals, including: receiving a first audio signal, and extracting a first high-level audio feature from the first audio signal; and extracting a first potential audio signal from the first high-level audio feature Features; in response to determining that the first potential audio feature indicates that the first audio signal corresponds to a first beat, a first joint angle distribution matrix is obtained according to the first potential audio feature, wherein the first joint angle distribution matrix includes a plurality of Gaussian distribution parameters , And the aforementioned Gaussian distribution parameter corresponds to a plurality of joint points on a virtual body; in response to determining that the first potential audio feature indicates that the first audio signal corresponds to a first piece of music, based on the first joint angle distribution matrix, it is obtained corresponding to the aforementioned joint Multiple designated joint angles of a point; adjust the joint angle of each joint point on the virtualized body according to the aforementioned designated joint angle.
本發明提供一種電子裝置,其包括儲存電路及處理器。儲存電路儲存多個模組。處理器耦接儲存電路,並存取前述模組以執行下列步驟:接收一第一音訊訊號,並從第一音訊訊號擷取一第一高階音訊特徵;從第一高階音訊特徵中擷取一第一潛在音訊特徵;反應於判定第一潛在音訊特徵指示第一音訊訊號對應於一第一節拍,依據第一潛在音訊特徵取得一第一關節角度分布矩陣,其中第一關節角度分布矩陣包括多個高斯分布參數,且前述高斯分布參數對應於一虛擬化身上的多個關節點;反應於判定第一潛在音訊特徵指示第一音訊訊號對應於一第一音樂,基於第一關節角度分布矩陣取得對應於前述關節點的多個指定關節角度;依據前述指定關節角度調整虛擬化身上各關節點的關節角度。The invention provides an electronic device, which includes a storage circuit and a processor. The storage circuit stores multiple modules. The processor is coupled to the storage circuit and accesses the aforementioned module to perform the following steps: receiving a first audio signal, and extracting a first high-level audio feature from the first audio signal; and extracting a first high-level audio feature from the first audio signal A first potential audio feature; in response to determining that the first potential audio feature indicates that the first audio signal corresponds to a first beat, a first joint angle distribution matrix is obtained according to the first potential audio feature, wherein the first joint angle distribution matrix includes multiple Gaussian distribution parameters, and the aforementioned Gaussian distribution parameters correspond to multiple joint points on a virtualized body; in response to determining that the first potential audio feature indicates that the first audio signal corresponds to a first music, it is obtained based on the first joint angle distribution matrix A plurality of designated joint angles corresponding to the aforementioned joint points; the joint angle of each joint point on the virtualized body is adjusted according to the aforementioned designated joint angle.
基於上述,在不需維護舞步資料庫的情況下,本發明的方法可讓虛擬化身隨著當下音樂即興呈現對應的動作(例如舞步),因而適於應用在實現為邊緣裝置的電子裝置上。Based on the above, without maintaining the dance step database, the method of the present invention allows the virtual avatar to improvise corresponding actions (such as dance steps) with the current music, and is therefore suitable for application to electronic devices implemented as edge devices.
請參照圖1,其是依據本發明實施例繪示的電子裝置示意圖。在不同的實施例中,電子裝置100例如是可用於提供AR/VR或其他類似服務的電腦裝置、嵌入式系統、行動裝置等裝置,但可不限於此。如圖1所示,電子裝置100包括儲存電路102及處理器104。Please refer to FIG. 1, which is a schematic diagram of an electronic device according to an embodiment of the present invention. In different embodiments, the electronic device 100 is, for example, a computer device, an embedded system, a mobile device, etc., which can be used to provide AR/VR or other similar services, but it may not be limited thereto. As shown in FIG. 1, the electronic device 100 includes a
儲存電路102例如是任意型式的固定式或可移動式隨機存取記憶體(Random Access Memory,RAM)、唯讀記憶體(Read-Only Memory,ROM)、快閃記憶體(Flash memory)、硬碟或其他類似裝置或這些裝置的組合,而可用以記錄多個程式碼或模組。The
處理器104耦接於儲存電路102,並可為一般用途處理器、特殊用途處理器、傳統的處理器、數位訊號處理器、多個微處理器(microprocessor)、一個或多個結合數位訊號處理器核心的微處理器、控制器、微控制器、特殊應用積體電路(Application Specific Integrated Circuit,ASIC)、現場可程式閘陣列電路(Field Programmable Gate Array,FPGA)、任何其他種類的積體電路、狀態機、基於進階精簡指令集機器(Advanced RISC Machine,ARM)的處理器以及類似品。The
在本發明的實施例中,處理器104可存取儲存電路102中記錄的模組、程式碼來實現本發明提出的依據音訊訊號產生動作的方法,其細節詳述如下。In the embodiment of the present invention, the
請參照圖2,其是依據本發明實施例繪示的依據音訊訊號產生動作的方法流程圖。本實施例的方法可由圖1的電子裝置100執行,以下即搭配圖1所示的元件說明圖2各步驟的細節。另外,為使本案內容更易於理解,以下將另輔以圖3所示的系統架構圖作說明,但其僅用以舉例,並非用以限定本發明可能的實施方式。Please refer to FIG. 2, which is a flowchart of a method for generating actions based on audio signals according to an embodiment of the present invention. The method of this embodiment can be executed by the electronic device 100 in FIG. 1. The details of each step in FIG. 2 will be described below with the components shown in FIG. 1. In addition, in order to make the content of this case easier to understand, the system architecture diagram shown in FIG. 3 will be supplemented below for description, but it is only used as an example and is not intended to limit the possible implementation of the present invention.
概略而言,本發明的方法可在接收到一段音訊訊號(例如一個音訊幀)時,據以決定虛擬化身上的各個關節在各個維度上的關節角度,從而讓虛擬化身整體呈現適當的動作。在不同的實施例中,上述音訊訊號可來自於任何種類的聲音,例如音樂、環境音、語音等,但不限於此。Generally speaking, the method of the present invention can determine the joint angles of each joint of the virtualized body in various dimensions when receiving a segment of audio signal (for example, an audio frame), so that the virtualized avatar can present appropriate actions as a whole. In different embodiments, the above-mentioned audio signal may come from any kind of sound, such as music, ambient sound, voice, etc., but it is not limited thereto.
在圖3中,音訊訊號F1~FN例如是連續的多個音訊幀,而對於每個音訊訊號F1~FN而言,處理器104可對其進行類似的處理,以產生對應於所考慮的音訊訊號的虛擬化身動作。為便於說明,以下暫以音訊訊號F1為例進行說明,但其並非用以限定本發明可能的實施方式。In FIG. 3, the audio signal F1~FN is, for example, a plurality of consecutive audio frames, and for each audio signal F1~FN, the
首先,在步驟S210中,處理器104可接收音訊訊號F1,並從音訊訊號F1擷取高階(high level)音訊特徵H1。在一實施例中,音訊訊號F1可包括一音訊幀,而其可表徵為具有特定維度(例如2048x1)的向量(或稱陣列),但可不限於此。在一實施例中,處理器104可將上述音訊幀輸入卷積神經網路(convolutional neural network,CNN)N1,以由CNN N1從此音訊幀擷取高階音訊特徵H1。在本發明的實施例中,CNN N1可包括一或多個卷積層,用以從所接收的音訊幀中擷取對應的高階音訊特徵,但可不限於此。以上由CNN N1擷取高階音訊特徵H1的技術細節可參照相關的現有技術文獻,於此不另贅述。First, in step S210, the
之後,在步驟S220中,處理器104可從高階音訊特徵H1中擷取潛在(latent)音訊特徵L1。在一實施例中,處理器104可將高階音訊特徵H1輸入第一遞歸神經網路(recurrent neural network,RNN)N2,以由第一RNN N2從高階音訊特徵H1擷取潛在音訊特徵L1。以上由第一RNN N2擷取潛在音訊特徵L1的技術細節可參照相關的現有技術文獻,於此不另贅述。Afterwards, in step S220, the
此外,在本實施例中,第一RNN N2除了可基於高階音訊特徵H1輸出潛在音訊特徵L1之外,還可一併輸出一第一內部狀態(internal state)IS11,其細節可參照RNN的相關技術文件,於此不另贅述。在本發明的實施例中,第一RNN N2可包括一多堆疊(multi-stack)結構,用以從所接收的高階音訊特徵中擷取對應的潛在音訊特徵,但可不限於此。In addition, in this embodiment, the first RNN N2 can output a first internal state IS11 in addition to the potential audio feature L1 based on the high-level audio feature H1. For details, please refer to the RNN correlation Technical documents, not to repeat them here. In the embodiment of the present invention, the first RNN N2 may include a multi-stack structure for extracting corresponding potential audio features from the received high-level audio features, but it is not limited to this.
此外,在一實施例中,第一內部狀態IS11可讓第一RNN N2在處理對應於下一個音訊訊號F2的高階音訊特徵H2時,進一步參考前一級的高階音訊特徵H1來產生對應的潛在音訊特徵L2,而相關細節將在之後另作說明。In addition, in one embodiment, the first internal state IS11 allows the first RNN N2 to further refer to the high-level audio feature H1 of the previous level to generate the corresponding potential audio when processing the high-level audio feature H2 corresponding to the next audio signal F2. Feature L2, and related details will be explained later.
在一實施例中,處理器104可基於潛在音訊特徵L1判斷音訊訊號F1是否對應於節拍(即是否在節拍上(on beat)),以及基於潛在音訊特徵L1判斷音訊訊號F1是否對應於音樂。在本發明的實施例中,處理器104可將潛在音訊特徵L1輸入一特定神經網路N3(其例如由多個全連接層(fully-connected layer)組成),以由特定神經網路N3基於潛在音訊特徵L1判斷音訊訊號F1是否對應於節拍以及是否對應於音樂,但可不限於此。In one embodiment, the
為便於說明,以下假設音訊訊號F1係對應於節拍且對應於音樂(即,不為雜訊、人聲或其他非音樂聲響)。因此,在步驟S230中,反應於判定潛在音訊特徵L1指示音訊訊號F1對應於節拍,處理器104可依據潛在音訊特徵L1取得關節角度分布矩陣M1,其中關節角度分布矩陣M1可包括多個高斯分布參數,且前述高斯分布參數可對應於一虛擬化身上的多個關節點。在一實施例中,處理器104可將潛在音訊特徵L1輸入第二RNN N4,以由第二RNN N4基於潛在音訊特徵L1產生關節角度分布矩陣M1。此外,第二RNN N4還可基於潛在音訊特徵L1產生第二內部狀態IS12。For ease of description, it is assumed below that the audio signal F1 corresponds to the beat and corresponds to the music (that is, it is not noise, human voice, or other non-musical sounds). Therefore, in step S230, in response to determining that the potential audio feature L1 indicates that the audio signal F1 corresponds to the beat, the
在一實施例中,上述虛擬化身例如是AR/VR環境中經配置以依據音樂舞動的角色。另外,依據生物視覺層次(biovision hierarchy,BVH)的相關規格,一個虛擬化身上可經定義有一個髖關節點絕對位置(可由x、y、z表示)以及52個其他關節點,而所述52個其他關節點個別可以一組在三度空間中的關節旋轉角度表示,例如(Rx, Ry, Rz)。舉例而言,對於虛擬化身上的一第一關節點而言,對應的Rx、Ry及Rz分別例如是在第一維度(例如X軸)、第二維度(例如Y軸)及第三維度(例如Z軸)上的關節角度,但可不限於此。In one embodiment, the aforementioned virtual avatar is, for example, a character configured to dance according to music in an AR/VR environment. In addition, according to the relevant specifications of the biovision hierarchy (BVH), a virtual body can be defined with an absolute position of the hip joint point (which can be represented by x, y, and z) and 52 other joint points, and the 52 Each of the other joint points can be represented by a group of joint rotation angles in a three-degree space, for example (Rx, Ry, Rz). For example, for a first joint point on a virtualized body, the corresponding Rx, Ry, and Rz are respectively in the first dimension (such as the X axis), the second dimension (such as the Y axis), and the third dimension ( For example, the joint angle on the Z axis), but it is not limited to this.
為便於說明本發明的概念,以下假設所考慮的虛擬化身上的關節點可包括上述髖關節點及52個其他關節點,但本發明可不限於此。此外,以下亦假設所考慮的虛擬化身的動作可基於BVH的相關規格進行定義,但本發明可不限於此。在此情況下,虛擬化身的動作可依據BVH動態捕捉(motion capture)資料檔而決定。在一實施例中,一個BVH動態捕捉資料檔可包括159個值,其個別對應於上述髖關節點絕對位置(即x、y、z)及所述52個其他關節點個別的(Rx, Ry, Rz)。因此,在取得BVH動態捕捉資料檔之後,即可相應地決定虛擬化身的動作,而本發明可基於所產生的關節角度分布矩陣M1決定BVH動態捕捉資料檔中的159個值,進而決定虛擬化身的動作。To facilitate the description of the concept of the present invention, the joint points on the virtualized body considered in the following hypothesis may include the aforementioned hip joint points and 52 other joint points, but the present invention may not be limited to this. In addition, the following also assumes that the action of the considered avatar can be defined based on the relevant specifications of BVH, but the present invention may not be limited to this. In this case, the action of the avatar can be determined based on the BVH motion capture data file. In one embodiment, a BVH dynamic capture data file may include 159 values, which respectively correspond to the absolute positions of the aforementioned hip joint points (ie x, y, z) and the respective 52 other joint points (Rx, Ry). , Rz). Therefore, after obtaining the BVH motion capture data file, the actions of the virtual avatar can be determined accordingly, and the present invention can determine the 159 values in the BVH motion capture data file based on the generated joint angle distribution matrix M1, and then determine the virtual avatar Actions.
具體而言,在第一實施例中,關節角度分布矩陣M1可實現為一個維度為159x2的矩陣,而其中的所述159個列分別對應於上述x、y、z及所述52個其他關節點個別的(Rx, Ry, Rz)。舉例而言,假設虛擬化身上的某關節點(下稱第一關節點)在第一維度上具有一第一可動角度範圍(可理解為對應於第一關節點的Rx的可動角度範圍),而此第一可動角度範圍在本發明中可模型化為一第一高斯分布模型。在此情況下,關節角度分布矩陣M1中對應於第一關節點的Rx的列可包括2個元素,而此2元素可分別是第一高斯分布模型的期望值(以 表示)及標準差(以 表示)。舉另一例而言,假設第一關節點在第二維度上還具有另一可動角度範圍(可理解為對應於第一關節點的Ry的可動角度範圍),而此另一可動角度範圍在本發明中可模型化為一另一高斯分布模型。在此情況下,關節角度分布矩陣M1中對應於第一關節點的Ry的列可包括2個元素,而此2元素可分別是所述另一高斯分布模型的期望值及標準差。 Specifically, in the first embodiment, the joint angle distribution matrix M1 can be implemented as a matrix with a dimension of 159x2, and the 159 columns in it correspond to the aforementioned x, y, z and the 52 other joints. Click the individual (Rx, Ry, Rz). For example, suppose that a certain joint point on the virtualized body (hereinafter referred to as the first joint point) has a first movable angle range in the first dimension (it can be understood as the movable angle range of Rx corresponding to the first joint point), The first movable angle range can be modeled as a first Gaussian distribution model in the present invention. In this case, the column of Rx corresponding to the first joint point in the joint angle distribution matrix M1 may include two elements, and these two elements may be the expected values of the first Gaussian distribution model (in terms of Expressed) and standard deviation (in Express). For another example, suppose that the first joint point has another movable angle range in the second dimension (which can be understood as the movable angle range corresponding to Ry of the first joint point), and this other movable angle range is in this The invention can be modeled as another Gaussian distribution model. In this case, the column of Ry corresponding to the first joint point in the joint angle distribution matrix M1 may include two elements, and the two elements may be the expected value and the standard deviation of the other Gaussian distribution model, respectively.
基於以上教示,本領域具通常知識者應可相應理解關節角度分布矩陣M1中其餘各列的意義及內容,於此不另贅述。此外,在第一實施例中,關節角度分布矩陣M1的第1行例如可由各列中的期望值組成,而關節角度分布矩陣M1的第2行例如可由各列中的標準差組成,但可不限於此。Based on the above teachings, those with ordinary knowledge in the field should be able to understand the meaning and content of the remaining columns in the joint angle distribution matrix M1, which will not be repeated here. In addition, in the first embodiment, the first row of the joint angle distribution matrix M1 may be composed of expected values in each column, for example, and the second row of the joint angle distribution matrix M1 may be composed of, for example, the standard deviations in each column, but it may not be limited to this.
在取得關節角度分布矩陣M1之後,在步驟S240中,反應於判定潛在音訊特徵L1指示音訊訊號F1對應於音樂,處理器104可基於關節角度分布矩陣M1取得對應於關節點的多個指定關節角度。After obtaining the joint angle distribution matrix M1, in step S240, in response to determining that the potential audio feature L1 indicates that the audio signal F1 corresponds to music, the
再以第一關節點為例,假設處理器104欲取得第一關節點在第一維度上的第一指定關節角度,則處理器104可基於上述第一高斯分布模型在上述第一可動角度範圍內取樣第一角度以作為第一關節點在第一維度上的第一指定關節角度。為便於理解,以下將另輔以圖4作說明。Taking the first joint point as an example again, assuming that the
請參照圖4,其是依據本發明第一實施例繪示的用以模型化第一可動角度範圍的第一高斯分布模型。在圖4中,假設第一關節點在第一維度上具有第一可動角度範圍R1,而第一高斯分布模型G1例如可用於模型化第一可動角度範圍R1。在此情況下,處理器104可基於第一高斯分布模型G1在第一可動角度範圍R1內取樣第一角度以作為第一關節點在第一維度上的第一指定關節角度。在一實施例中,處理器104例如可基於第一高斯分布模型G1在第一可動角度範圍內R1隨機取樣第一角度作為上述第一指定關節角度。在另一實施例中,處理器104亦可直接在第一可動角度範圍內R1中取樣對應於期望值(即,
)的第一角度作為上述第一指定關節角度,但可不限於此。
Please refer to FIG. 4, which is a first Gaussian distribution model for modeling the first movable angle range according to the first embodiment of the present invention. In FIG. 4, it is assumed that the first joint point has the first movable angle range R1 in the first dimension, and the first Gaussian distribution model G1 can be used to model the first movable angle range R1, for example. In this case, the
同理,假設處理器104欲取得第一關節點在第二維度上的指定關節角度,則處理器104可基於所述另一高斯分布模型在所述另一可動角度範圍內(隨機)取樣一角度以作為第一關節點在第二維度上的另一指定關節角度。基於以上教示,本領域具通常知識者應可相應理解處理器103取得各關節點在各維度上的指定關節角度的方式,於此不另贅述。In the same way, assuming that the
在取得各關節點對應的多個指定關節角度之後,在步驟S250中,處理器104可依據指定關節角度調整虛擬化身上各關節點的關節角度。在第一實施例中,處理器104可將各關節點對應的指定關節角度以指定關節角度向量S1(其維度例如是159x1)的形式輸出。舉例而言,假設處理器104對於各關節點皆是取樣對應於期望值的角度作為各關節點的指定關節角度,則處理器104可直接取用關節角度分布矩陣M1的第1行作為指定關節角度向量S1,但本發明可不限於此。After obtaining multiple designated joint angles corresponding to each joint point, in step S250, the
在此情況下,處理器104例如可基於指定關節角度向量S1中的指定關節角度產生對應的BVH動態捕捉資料檔,並基於此BVH動態捕捉資料檔調整虛擬化身上各關節點的關節角度。例如,處理器104可將第一關節點在第一維度上的關節角度調整為對應於上述第一指定關節角度(例如第一高斯分布模型G1的期望值)。並且,處理器104還可將第一關節點在第二維度上的關節角度調整為對應於上述另一指定關節角度(例如上述另一高斯分布模型的期望值)。基此,處理器104可依據BVH動態捕捉資料檔的內容調整虛擬化身上各關節點在不同維度上的關節角度,從而令虛擬化身呈現特定的動作(例如舞步)。In this case, the
由上可知,有別於習知從資料庫中挑選既有舞步進行重組的作法,本發明的方法可依據當下的音訊訊號決定虛擬化身上各關節點在各維度上的關節角度,從而讓虛擬化身可基於當下的音樂而即興在節拍上舞動。It can be seen from the above that, different from the conventional method of selecting existing dance steps from the database for reorganization, the method of the present invention can determine the joint angle of each joint point of the virtual body in each dimension according to the current audio signal, thereby allowing the virtual The avatar can improvise and dance to the beat based on the current music.
在其他實施例中,單一關節點在單一維度上可具有兩個以上的可動角度範圍,而這些可動角度範圍可模型化為一個多變量混合高斯模型,以下將以第二實施例作進一步說明。In other embodiments, a single joint point may have more than two movable angle ranges in a single dimension, and these movable angle ranges may be modeled as a multivariate mixed Gaussian model. The second embodiment will be used for further explanation below.
在第二實施例中,假設單一關節點在單一維度上具有兩個可動角度範圍,但可不限於此。在此情況下,關節角度分布矩陣M1可實現為一個維度為159x4的矩陣,而其中的所述159個列分別對應於上述x、y、z及所述52個其他關節點個別的(Rx, Ry, Rz)。再以第一關節點為例,假設第一關節點在第一維度上具有第一及第二可動角度範圍(可理解為對應於第一關節點的Rx的可動角度範圍),而此第一、第二可動角度範圍在本發明中可模型化為一第一多變量混合(multi variate mixture)高斯分布模型。在此情況下,關節角度分布矩陣M1中對應於第一關節點的Rx的列可包括4個元素,而此4元素可分別是第一多變量混合高斯分布模型的第一期望值(以 表示)、第一標準差(以 表示)、第二期望值(以 表示)及第二標準差(以 表示)。 In the second embodiment, it is assumed that a single joint point has two movable angle ranges in a single dimension, but it is not limited to this. In this case, the joint angle distribution matrix M1 can be realized as a matrix with a dimension of 159x4, and the 159 columns in it correspond to the aforementioned x, y, z and the 52 other joint points individually (Rx, Ry, Rz). Taking the first joint point as an example again, it is assumed that the first joint point has the first and second movable angle ranges in the first dimension (which can be understood as the movable angle range of Rx corresponding to the first joint point), and the first joint point 1. The second movable angle range can be modeled as a first multivariate mixture Gaussian distribution model in the present invention. In this case, the column of Rx corresponding to the first joint point in the joint angle distribution matrix M1 may include 4 elements, and these 4 elements may be the first expected values of the first multivariate mixed Gaussian distribution model (in terms of Expressed), the first standard deviation (in Expressed), the second expected value (in Expressed) and the second standard deviation (in Express).
基於以上教示,本領域具通常知識者應可相應理解第二實施例中關節角度分布矩陣M1中其餘各列的意義及內容,於此不另贅述。此外,在第二實施例中,關節角度分布矩陣M1的第1行例如可由各列中的第一期望值組成,關節角度分布矩陣M1的第2行例如可由各列中的第一標準差組成,關節角度分布矩陣M1的第3行例如可由各列中的第二期望值組成,關節角度分布矩陣M1的第4行例如可由各列中的第二標準差組成,但可不限於此。Based on the above teachings, those with ordinary knowledge in the art should be able to understand the meaning and content of the remaining columns in the joint angle distribution matrix M1 in the second embodiment, which will not be repeated here. In addition, in the second embodiment, the first row of the joint angle distribution matrix M1 may be composed of, for example, the first expected value in each column, and the second row of the joint angle distribution matrix M1 may be composed of, for example, the first standard deviation in each column. The third row of the joint angle distribution matrix M1 may be composed of, for example, the second expected value in each column, and the fourth row of the joint angle distribution matrix M1 may be composed of, for example, the second standard deviation in each column, but it may not be limited thereto.
在取得關節角度分布矩陣M1之後,在步驟S240中,反應於判定潛在音訊特徵L1指示音訊訊號F1對應於音樂,處理器104可基於關節角度分布矩陣M1取得對應於關節點的多個指定關節角度。After obtaining the joint angle distribution matrix M1, in step S240, in response to determining that the potential audio feature L1 indicates that the audio signal F1 corresponds to music, the
再以第一關節點為例,假設處理器104欲取得第一關節點在第一維度上的第一指定關節角度,則處理器104可基於上述第一多變量混合高斯分布模型在上述第一可動角度範圍或第二可動角度範圍內取樣第一角度以作為第一關節點在第一維度上的第一指定關節角度。為便於理解,以下將另輔以圖5作說明。Taking the first joint point as an example again, assuming that the
請參照圖5,其是依據本發明第二實施例繪示的用以模型化第一、第二可動角度範圍的第一多變量混合高斯分布模型。在圖5中,假設第一關節點在第一維度上具有第一可動角度範圍R11及第二可動角度範圍R12,而第一多變量混合高斯分布模型G1’例如可用於模型化第一可動角度範圍R11(其對應於
及
)及第二可動角度範圍R2(其對應於
及
)。在此情況下,處理器104可基於第一多變量混合高斯分布模型G1’在第一可動角度範圍R11內或第二可動角度範圍R12內取樣第一角度以作為第一關節點在第一維度上的第一指定關節角度。在一實施例中,處理器104例如可基於第一高斯分布模型G1’在第一可動角度範圍內R11或第二可動角度範圍R12內隨機取樣第一角度作為上述第一指定關節角度。在另一實施例中,處理器104亦可直接在第一可動角度範圍內R11或第二可動角度範圍R12內中取樣對應於期望值(即,
或
)的第一角度作為上述第一指定關節角度,但可不限於此。
Please refer to FIG. 5, which is a first multivariate mixture Gaussian distribution model for modeling the first and second movable angle ranges according to the second embodiment of the present invention. In FIG. 5, it is assumed that the first joint point has a first movable angle range R11 and a second movable angle range R12 in the first dimension, and the first multivariate mixed Gaussian distribution model G1' can be used to model the first movable angle, for example. Range R11 (which corresponds to and ) And the second movable angle range R2 (which corresponds to and ). In this case, the
在其他實施例中,假設AR/VR環境中存在兩個可控制的虛擬化身A、B,且此二虛擬化身A、B上皆具有第一關節點,則處理器104可基於第一多變量混合高斯分布模型G1’在第一可動角度範圍R11內取樣一角度以作為虛擬化身A上第一關節點在第一維度上的第一指定關節角度。另外,處理器104還可基於第一多變量混合高斯分布模型G1’在第二可動角度範圍R12內取樣一角度以作為虛擬化身B上第一關節點在第一維度上的第一指定關節角度,從而讓不同的虛擬化身因應當下的音樂呈現不同舞步,但可不限於此。基於以上教示,本領域具通常知識者應可相應理解處理器103在第二實施例中取得各關節點在各維度上的指定關節角度的方式,於此不另贅述。In other embodiments, assuming that there are two controllable avatars A and B in the AR/VR environment, and both avatars A and B have first joint points, the
此外,第一關節點在第二維度上亦可具有兩個可動角度範圍,而此二可動角度範圍亦可模型化為另一多變量高斯分布模型。在此情況下,處理器104決定第一關節點在第二維度上的指定關節角度的方式可參照以上的教示,於此不另贅述。並且,其他關節點的在各維度上的可動角度範圍亦可基於以上教示模型化為對應的多變量高斯模型,其細節亦可參照以上的教示,於此不另贅述。In addition, the first joint point may also have two movable angle ranges in the second dimension, and the two movable angle ranges may also be modeled as another multivariate Gaussian distribution model. In this case, the way for the
在取得各關節點對應的多個指定關節角度之後,在第二實施例的步驟S250中,處理器104可依據指定關節角度調整虛擬化身上各關節點的關節角度。在第二實施例中,處理器104可將各關節點對應的指定關節角度以指定關節角度向量S1(其維度例如是159x1)的形式輸出。舉例而言,假設處理器104對於各關節點皆是取樣對應於第一期望值的角度作為各關節點的指定關節角度,則處理器104可直接取用關節角度分布矩陣M1的第1行作為指定關節角度向量S1。舉另一例而言,假設處理器104對於各關節點皆是取樣對應於第二期望值的角度作為各關節點的指定關節角度,則處理器104可直接取用關節角度分布矩陣M1的第3行作為指定關節角度向量S1,但本發明可不限於此。After obtaining multiple designated joint angles corresponding to each joint point, in step S250 of the second embodiment, the
在此情況下,處理器104例如可基於指定關節角度向量S1中的指定關節角度產生對應的BVH動態捕捉資料檔,並基於此BVH動態捕捉資料檔調整虛擬化身上各關節點的關節角度。例如,處理器104可將第一關節點在第一維度上的關節角度調整為對應於上述第一指定關節角度(例如第一多變量高斯分布模型G1’的第一期望值或第二期望值)。基此,處理器104可依據BVH動態捕捉資料檔的內容調整虛擬化身上各關節點在不同維度上的關節角度,從而令虛擬化身呈現特定的動作(例如舞步)。In this case, the
請參照圖6,其是依據本發明實施例的BVH動態捕捉資料檔及對應的虛擬化身示意圖。在本實施例中,在處理器104依先前教示產生BVH動態捕捉資料檔610之後,處理器104可依據其中的內容調整虛擬化身620上各關節點在各維度上的關節角度,從而讓虛擬化身620呈現特定的動作、舞步、姿態等,但不限於此。Please refer to FIG. 6, which is a schematic diagram of a BVH dynamic capture data file and a corresponding virtual avatar according to an embodiment of the present invention. In this embodiment, after the
應了解的是,以上實施例係假設音訊訊號F1係對應於節拍及音樂,而對於未對應於節拍或音樂的其他音訊訊號而言,本發明可基於不同的機制執行本發明的方法,以下將以第三實施例作進一步說明。It should be understood that the above embodiment assumes that the audio signal F1 corresponds to the beat and music. For other audio signals that do not correspond to the beat or music, the present invention can execute the method of the present invention based on different mechanisms. Take the third embodiment for further explanation.
舉例而言,在第三實施例中,假設接續於音訊訊號F1的音訊訊號F2係對應於音樂但未對應於節拍(即,不在節拍上)。在此情況下,處理器104仍可執行步驟S210以接收音訊訊號F2,並從音訊訊號F2擷取高階音訊特徵H2。在一實施例中,處理器104可將音訊訊號F2(例如是一音訊幀)輸入CNN N1,以由CNN N1從音訊訊號F2中擷取高階音訊特徵H2。For example, in the third embodiment, it is assumed that the audio signal F2 connected to the audio signal F1 corresponds to music but does not correspond to the beat (that is, not on the beat). In this case, the
之後,在步驟S220中,處理器104可從高階音訊特徵H2中擷取潛在音訊特徵L2。在一實施例中,處理器104可將高階音訊特徵H2輸入第一RNN N2,以由第一RNN N2基於第一內部狀態IS11從高階音訊特徵H2擷取潛在音訊特徵L2。在本實施例中,由於第一內部狀態IS11可理解為來自前一級的操作,故第一內部狀態IS11可視為是第三實施例中的歷史內部狀態。並且,由於第一內部狀態IS11帶有前一級的高階音訊特徵H1的相關資訊,因而可使得第一RNN N2所擷取的潛在音訊特徵L2一併考慮先前的一級(或多級)的資訊,但可不限於此。After that, in step S220, the
此外,在本實施例中,第一RNN N2除了可基於高階音訊特徵H2輸出潛在音訊特徵L2之外,還可輸出第一內部狀態IS21以供下一級使用,但可不限於此。In addition, in this embodiment, the first RNN N2 can output not only the latent audio feature L2 based on the high-level audio feature H2, but also the first internal state IS21 for use in the next stage, but it is not limited to this.
在第三實施例中,處理器104同樣可將潛在音訊特徵L2輸入特定神經網路N3,以由特定神經網路N3基於潛在音訊特徵L2判斷音訊訊號F2是否對應於節拍以及是否對應於音樂,但可不限於此。In the third embodiment, the
由於第三實施例中的音訊訊號F2已假設為對應於音樂但不在節拍上,故處理器104可採用不同於第一、第二實施例的方式來執行步驟S230以產生對應的關節角度分布矩陣M2。具體而言,在第三實施例中,處理器104可取得一歷史關節角度分布矩陣,其中此歷史關節角度分布矩陣可包括多個歷史高斯分布參數,且前述歷史高斯分布參數可對應於虛擬化身上的關節點。在第三實施例中,上述歷史關節角度分布矩陣例如是前一級操作中所產生的關節角度分布矩陣M1,而上述歷史高斯斯分布參數即為關節角度分布矩陣M1中的內容,但可不限於此。Since the audio signal F2 in the third embodiment has been assumed to correspond to music but not on the beat, the
之後,處理器104可將此歷史關節角度分布矩陣(即,關節角度分布矩陣M1)轉換為參考音訊特徵L2’,並將此參考音訊特徵L2’定義為(新的)潛在音訊特徵L2。之後,處理器104例如可將參考音訊特徵L2’(即,新的潛在音訊特徵L2)輸入第二RNN N4,以由第二RNN N4取得關節角度分布矩陣M2。Afterwards, the
簡言之,由於音訊訊號F2未在節拍上,故處理器104可忽略原本的潛在音訊特徵L2,而是以由關節角度分布矩陣M1轉換而來的參考音訊特徵L2’作為(新的)潛在音訊特徵L2而輸入至第二RNN N4,以由第二RNN N4據以取得關節角度分布矩陣M2。In short, since the audio signal F2 is not on the beat, the
在一實施例中,為了將關節角度分布矩陣M1的維度轉換為適於輸入第二RNN N4的參考音訊特徵L2’,處理器104可簡易地採用一個全連接層神經網路來進行轉換。此外,處理器104亦可基於卷積層、池化層(pooling layer)來進行前述轉換,但可不限於此。將(轉換後的)關節角度分布矩陣M1饋入第二RNN N4以取得關節角度分布矩陣M2的相關原理可參照「
Auto-Conditioned Recurrent Networks for Extended Complex Human Motion Synthesis,cs.LG, 2017」,於此不另贅述。
In one embodiment, in order to convert the dimension of the joint angle distribution matrix M1 into a reference audio feature L2' suitable for inputting the second RNN N4, the
此外,在第三實施例中,第二RNN N4還可進一步基於參考音訊特徵L2’及第二內部狀態IS12來產生關節角度分布矩陣M2,以在考慮先前一或多級資訊的情況下產生更佳的關節角度分布矩陣M2,但可不限於此。In addition, in the third embodiment, the second RNN N4 can further generate the joint angle distribution matrix M2 based on the reference audio feature L2' and the second internal state IS12, so as to generate more information while considering the previous one or more levels of information. A good joint angle distribution matrix M2, but not limited to this.
在產生關節角度分布矩陣M2之後,處理器104例如可基於第一、第二實施例中教示的機制來產生對應的指定關節角度向量S1,並據以將虛擬化身的動作/舞步/姿態調整為對應於音訊訊號F2的態樣。After generating the joint angle distribution matrix M2, the
在第四實施例中,假設音訊訊號F3係對應於節拍及音樂,故處理器104可基於第一、第二實施例中教示的機制來將虛擬化身的動作/舞步/姿態調整為對應於音訊訊號F3的態樣,其細節於此不另贅述。In the fourth embodiment, it is assumed that the audio signal F3 corresponds to the beat and the music, so the
此外,在第五實施例中,假設特定神經網路R3判定音訊訊號FN的潛在音訊特徵(未另標示)指示音訊訊號FN既未對應於節拍亦未對應於音樂,則處理器104可不調整虛擬化身上各關節點的關節角度,或是將虛擬化身調整為呈現閒置姿態。藉此,可避免虛擬化身在沒有音樂的情況下自行舞動,但可不限於此。In addition, in the fifth embodiment, assuming that the specific neural network R3 determines that the potential audio feature (not shown separately) of the audio signal FN indicates that the audio signal FN does not correspond to either the beat or the music, the
請參照圖7,其是依據本發明實施例繪示的訓練階段示意圖。在圖7中,所示的訓練機制可用於產生先前實施例中所提及的CNN N1、第一RNN N2、特定神經網路N3及第二RNN N4。具體而言,在本實施例中,處理器104可先將音樂訓練資料輸入至待訓練的上述神經網路(即,CNN N1、第一RNN N2、特定神經網路N3及第二RNN N4)中。在一實施例中,各神經網路的相關模型參數可初始化為隨機數值,但可不限於此。Please refer to FIG. 7, which is a schematic diagram of a training phase according to an embodiment of the present invention. In FIG. 7, the training mechanism shown can be used to generate the CNN N1, the first RNN N2, the specific neural network N3, and the second RNN N4 mentioned in the previous embodiment. Specifically, in this embodiment, the
之後,處理器104可基於舞步訓練資料將虛擬化身上的各關節點在各維度上的可動角度範圍模型化對應的(單變量/多變量)高斯模型,並據以產生一預測舞步。之後,處理器104可基於預測舞步及對應的舞步訓練資料計算一損失函數,並依據損失函數的結果調整上述各神經網路的相關模型參數(例如神經元的權重)。以上流程可反復執行,直至所產生的預測舞步足夠接近於對應的舞步訓練資料。以上訓練階段技術細節可參照相關的現有技術文獻,於此不另贅述。After that, the
綜上所述,本發明提出的方法及電子裝置可在不需維護舞步資料庫的情況下,讓AR/VR環境中的虛擬化身依據當下的音樂即興地在節拍上舞動。此外,本發明的方法可讓電子裝置耗用較少的記憶體,並可讓電子裝置實時地進行相關的運算。因此,即便電子裝置屬於資源較受限的邊緣裝置,本發明的方法仍可讓電子裝置流暢地控制虛擬化身隨音樂而舞動。In summary, the method and electronic device proposed by the present invention can allow the virtual avatar in the AR/VR environment to improvisely dance on the beat according to the current music without maintaining the dance step database. In addition, the method of the present invention allows the electronic device to consume less memory, and allows the electronic device to perform related operations in real time. Therefore, even if the electronic device is an edge device with relatively limited resources, the method of the present invention can still allow the electronic device to smoothly control the virtual avatar to dance with music.
雖然本發明已以實施例揭露如上,然其並非用以限定本發明,任何所屬技術領域中具有通常知識者,在不脫離本發明的精神和範圍內,當可作些許的更動與潤飾,故本發明的保護範圍當視後附的申請專利範圍所界定者為準。Although the present invention has been disclosed in the above embodiments, it is not intended to limit the present invention. Anyone with ordinary knowledge in the technical field can make some changes and modifications without departing from the spirit and scope of the present invention. The scope of protection of the present invention shall be subject to those defined by the attached patent scope.
100:電子裝置 102:儲存電路 104:處理器 610:BVH動態捕捉資料檔 620:虛擬化身 G1:第一高斯分布模型 G1’:第一多變量高斯分布模型 H1, H2:高階音訊特徵 IS11, IS21:第一內部狀態 IS12:第二內部狀態 L1, L2:潛在音訊特徵 L2’:參考音訊特徵 M1, M2:關節角度分布矩陣 N1:CNN N2:第一RNN N3:特定神經網路 N4:第二RNN R1, R11:第一可動角度範圍 R12:第二可動角度範圍 S1, S2:指定關節角度向量 S210~S250:步驟 100: electronic device 102: storage circuit 104: processor 610: BVH motion capture data file 620: Virtual Avatar G1: The first Gaussian distribution model G1’: The first multivariate Gaussian distribution model H1, H2: High-level audio features IS11, IS21: The first internal state IS12: Second internal state L1, L2: potential audio features L2’: Reference audio features M1, M2: Joint angle distribution matrix N1: CNN N2: First RNN N3: specific neural network N4: Second RNN R1, R11: the first movable angle range R12: The second movable angle range S1, S2: Specify the joint angle vector S210~S250: steps
圖1是依據本發明實施例繪示的電子裝置示意圖。 圖2是依據本發明實施例繪示的依據音訊訊號產生動作的方法流程圖。 圖3是依據本發明實施例繪示的系統架構圖。 圖4是依據本發明第一實施例繪示的用以模型化第一可動角度範圍的第一高斯分布模型。 圖5是依據本發明第二實施例繪示的用以模型化第一、第二可動角度範圍的第一多變量混合高斯分布模型。 圖6是依據本發明實施例的BVH動態捕捉資料檔及對應的虛擬化身示意圖。 圖7是依據本發明實施例繪示的訓練階段示意圖。 FIG. 1 is a schematic diagram of an electronic device according to an embodiment of the present invention. 2 is a flowchart of a method for generating actions based on audio signals according to an embodiment of the present invention. FIG. 3 is a system architecture diagram drawn according to an embodiment of the present invention. FIG. 4 is a first Gaussian distribution model for modeling the first movable angle range according to the first embodiment of the present invention. FIG. 5 is a first multivariable mixture Gaussian distribution model for modeling the first and second movable angle ranges according to the second embodiment of the present invention. FIG. 6 is a schematic diagram of a BVH dynamic capture data file and a corresponding virtual avatar according to an embodiment of the present invention. Fig. 7 is a schematic diagram of a training phase according to an embodiment of the present invention.
S210~S250:步驟S210~S250: steps
Claims (23)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
TW109114298A TWI747258B (en) | 2020-04-29 | 2020-04-29 | Method for generating action according to audio signal and electronic device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
TW109114298A TWI747258B (en) | 2020-04-29 | 2020-04-29 | Method for generating action according to audio signal and electronic device |
Publications (2)
Publication Number | Publication Date |
---|---|
TW202141233A TW202141233A (en) | 2021-11-01 |
TWI747258B true TWI747258B (en) | 2021-11-21 |
Family
ID=79907519
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
TW109114298A TWI747258B (en) | 2020-04-29 | 2020-04-29 | Method for generating action according to audio signal and electronic device |
Country Status (1)
Country | Link |
---|---|
TW (1) | TWI747258B (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160338644A1 (en) * | 2013-09-17 | 2016-11-24 | Medibotics Llc | Smart Clothing for Ambulatory Human Motion Capture |
US20180020951A1 (en) * | 2016-07-25 | 2018-01-25 | Patrick Kaifosh | Adaptive system for deriving control signals from measurements of neuromuscular activity |
TW201935408A (en) * | 2017-04-28 | 2019-09-01 | 美商英特爾股份有限公司 | Compute optimizations for low precision machine learning operations |
WO2019217419A2 (en) * | 2018-05-08 | 2019-11-14 | Ctrl-Labs Corporation | Systems and methods for improved speech recognition using neuromuscular information |
-
2020
- 2020-04-29 TW TW109114298A patent/TWI747258B/en active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160338644A1 (en) * | 2013-09-17 | 2016-11-24 | Medibotics Llc | Smart Clothing for Ambulatory Human Motion Capture |
US20180020951A1 (en) * | 2016-07-25 | 2018-01-25 | Patrick Kaifosh | Adaptive system for deriving control signals from measurements of neuromuscular activity |
TW201935408A (en) * | 2017-04-28 | 2019-09-01 | 美商英特爾股份有限公司 | Compute optimizations for low precision machine learning operations |
WO2019217419A2 (en) * | 2018-05-08 | 2019-11-14 | Ctrl-Labs Corporation | Systems and methods for improved speech recognition using neuromuscular information |
Also Published As
Publication number | Publication date |
---|---|
TW202141233A (en) | 2021-11-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2018171717A1 (en) | Automated design method and system for neural network processor | |
WO2021254499A1 (en) | Editing model generation method and apparatus, face image editing method and apparatus, device, and medium | |
US11321891B2 (en) | Method for generating action according to audio signal and electronic device | |
CN110400575B (en) | Inter-channel feature extraction method, audio separation method and device and computing equipment | |
JP2023022090A (en) | Responsive video generation method and generation program | |
CN107832558B (en) | Intelligent generation method for creative scene of digital stage | |
TW202011280A (en) | Method of operating a searching framework system | |
JP6567610B2 (en) | Synchronizing voice and virtual motion, system and robot body | |
WO2021036665A1 (en) | Animated image driving method and apparatus based on artificial intelligence | |
JP2021128327A (en) | Mouth shape feature prediction method, device, and electronic apparatus | |
TWI747258B (en) | Method for generating action according to audio signal and electronic device | |
CN114401439B (en) | Dance video generation method, device and storage medium | |
WO2022253094A1 (en) | Image generation method and apparatus, and device and medium | |
CN113571087B (en) | Method for generating action according to audio signal and electronic device | |
WO2023284634A1 (en) | Data processing method and related device | |
CN116737895A (en) | Data processing method and related equipment | |
US20230025626A1 (en) | Method and apparatus for generating process simulation models | |
WO2020205013A1 (en) | Online video editor | |
US11928762B2 (en) | Asynchronous multi-user real-time streaming of web-based image edits using generative adversarial network(s) | |
JP2020027168A (en) | Learning device, learning method, voice synthesis device, voice synthesis method and program | |
CN113948060A (en) | Network training method, data processing method and related equipment | |
US7486295B2 (en) | Pickwalking methods and apparatus | |
US20240062497A1 (en) | Method and system for generating virtual content | |
KR102303626B1 (en) | Method and computing device for generating video data based on a single image | |
CN116152447B (en) | Face modeling method and device, electronic equipment and storage medium |