TWM652806U - Interactive virtual portrait system - Google Patents

Interactive virtual portrait system Download PDF

Info

Publication number
TWM652806U
TWM652806U TW112213295U TW112213295U TWM652806U TW M652806 U TWM652806 U TW M652806U TW 112213295 U TW112213295 U TW 112213295U TW 112213295 U TW112213295 U TW 112213295U TW M652806 U TWM652806 U TW M652806U
Authority
TW
Taiwan
Prior art keywords
module
facial
text
interactive
server
Prior art date
Application number
TW112213295U
Other languages
Chinese (zh)
Inventor
江秉承
陳錦瑜
張郁婷
Original Assignee
主導文創股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 主導文創股份有限公司 filed Critical 主導文創股份有限公司
Priority to TW112213295U priority Critical patent/TWM652806U/en
Publication of TWM652806U publication Critical patent/TWM652806U/en

Links

Images

Landscapes

  • Processing Or Creating Images (AREA)

Abstract

本新型的互動虛擬人像系統是包括一臉部辨識模組、一人工智能臉部建模模組、一深偽模組及一動畫模組。臉部辨識模組是設置於一拍攝裝置,該臉部辨識模組經由該拍攝裝置獲取並處理一張或多張清晰的正面臉部照片。人工智能臉部建模模組是設置於至少一伺服器上,人工智能臉部建模模組設計根據所述正面臉部照片形成一訓練臉部模型。深偽模組是設置於所述伺服器上且連接於該人工智能臉部建模模組,深偽模組根據所述訓練臉部模型合成一虛擬角色的臉部特徵。動畫模組是設置於所述伺服器上且連接於該深偽模組,該動畫模組用於生成該虛擬角色的指定動作和表情。互動虛擬人像系統能夠產生一個逼真且更具吸引力的虛擬角色,能夠應用於從客戶服務到互動教育和娛樂的廣泛範疇。The new interactive virtual portrait system of the present invention includes a facial recognition module, an artificial intelligence facial modeling module, a deep fake module and an animation module. The facial recognition module is installed in a photographing device, and the facial recognition module obtains and processes one or more clear frontal face photos through the photographing device. The artificial intelligence facial modeling module is installed on at least one server. The artificial intelligence facial modeling module is designed to form a training facial model based on the frontal facial photos. The deep fake module is installed on the server and connected to the artificial intelligence facial modeling module. The deep fake module synthesizes the facial features of a virtual character based on the training facial model. The animation module is installed on the server and connected to the deep fake module. The animation module is used to generate specified actions and expressions of the virtual character. Interactive avatar systems can produce a realistic and more attractive virtual character that can be used in a wide range of applications from customer service to interactive education and entertainment.

Description

互動虛擬人像系統Interactive virtual portrait system

本新型涉及互動式數位媒體領域,特別是與創建和互動虛擬角色的系統相關。更具體地說,該新型涉及使用人工智能(AI)和深度學習技術來生成、動畫化並與逼真的虛擬人像角色互動。The invention relates to the field of interactive digital media, and is particularly related to systems for creating and interacting with virtual characters. More specifically, the new model involves using artificial intelligence (AI) and deep learning techniques to generate, animate and interact with realistic avatar characters.

虛擬角色的開發一直是數位媒體領域中的重要關注點。傳統的方法主要依賴手動動畫和腳本,這限制了這些角色的動態性和互動性。隨著AI和深度學習的出現,創建更逼真和反應更靈敏的虛擬角色已取得重大進展。然而,仍存在許多挑戰。 現有的系統常常在創建逼真的面部表情和唇形同步方面遇到困難,尤其是在即時互動設置中。使用預錄的動畫和回應限制了互動的範疇,使虛擬角色對於多變和即興的用戶輸入的適應性降低。此外,整合語音識別和自然語言處理以實現用戶和虛擬角色之間的無縫互動一直是一項複雜的任務,這常常導致用戶體驗不佳。 深偽技術已被用於各種領域來生成逼真的圖像和視頻,但其在即時互動系統的應用仍然有限。 因此,有需要設計一種改進的互動虛擬人像系統來解決這些挑戰。 The development of virtual characters has always been an important focus in the field of digital media. Traditional approaches rely primarily on manual animation and scripting, which limits the dynamics and interactivity of these characters. With the advent of AI and deep learning, significant progress has been made in creating more realistic and responsive virtual characters. However, many challenges remain. Existing systems often struggle with creating realistic facial expressions and lip synchronization, especially in real-time interaction settings. The use of pre-recorded animations and responses limits the scope of interaction and makes virtual characters less adaptable to variable and improvised user input. Furthermore, integrating speech recognition and natural language processing to achieve seamless interaction between users and virtual characters has been a complex task, which often results in poor user experience. Deepfake technology has been used in various fields to generate realistic images and videos, but its application in real-time interactive systems is still limited. Therefore, there is a need to design an improved interactive virtual portrait system to address these challenges.

本新型之互動虛擬人像系統結合先進的AI臉部建模、即時面部合成的深偽技術、精密的唇形同步算法,以及用於用戶互動的直觀介面。這將產生一個更逼真、反應更靈敏且更具吸引力的虛擬角色,能夠應用於從客戶服務到互動教育和娛樂的廣泛範疇。 本新型的互動虛擬人像系統是包括一臉部辨識模組、一人工智能臉部建模模組、一深偽模組及一動畫模組。臉部辨識模組是設置於一拍攝裝置,該臉部辨識模組經由該拍攝裝置獲取並處理一張或多張清晰的正面臉部照片。人工智能臉部建模模組是設置於至少一伺服器上,人工智能臉部建模模組設計根據所述正面臉部照片形成一訓練臉部模型。深偽模組是設置於所述伺服器上且連接於該人工智能臉部建模模組,深偽模組根據所述訓練臉部模型合成一虛擬角色的臉部特徵。動畫模組是設置於所述伺服器上且連接於該深偽模組,該動畫模組用於生成該虛擬角色的指定動作和表情。其中,所述拍攝裝置通訊連接所述伺服器。 本新型提出了一種先進的互動虛擬人像系統,這在創建、動畫化和與虛擬角色互動的藝術中標誌著一個重大的進步。本系統利用人工智能(AI)和深度學習技術的最新發展,旨在克服當前虛擬角色技術的限制,提供新的增強現實感、反應性和互動性的水平。 本系統使用AI算法處理高品質的人臉照片,創建準確且詳細的臉部模型。這種技術能夠處理多種臉部表情和特徵,為虛擬角色的臉部逼真合成提供了基礎。通過整合深偽技術,該新型取得了重大的突破,使得虛擬角色的臉部表情能夠即時合成。深偽的應用不僅增強了臉部動作和表情的真實感,而且捕捉了反映人類互動的微妙細節。 本系統的一個關鍵方面是其AI驅動的唇形同步模組。該模組確保虛擬角色的唇部動作與口語話語完美同步,無論它們是預先錄製的還是即時生成的。這種同步在增強角色語音的可信度和自然性中起著至關重要的作用。 本系統還配備了一個動態動畫模組,使得虛擬角色能夠創建特定的動作和表情。這種功能使角色能夠進行廣泛的動作和手勢,使互動更具吸引力和逼真。補充這些功能的是一個靈活的用戶介面,該介面處理各種形式的輸入,包括文本、語音命令和觸控命令。這種靈活性使系統能夠適應廣泛的互動場景,滿足多樣化的用戶需求和情境。 本系統的一個創新方面是其即時互動和響應生成的能力。通過整合語音轉文字(STT)技術、語義分析和關鍵字匹配,本系統可以立即處理和理解用戶輸入。這種先進的處理使虛擬角色能夠通過文字轉語音(TTS)輸出互動語音或觸發來自廣泛數據庫的相關預錄視頻來適當地回應。 本系統的設計本質上是多功能的,使其適合於眾多應用,如客戶服務、虛擬助手、教育工具、娛樂和互動式營銷。其提供真實、反應性的互動能力使其在任何受益於增強用戶參與度的領域中都成為一種無價的工具。 該新型的優點包括增強的真實感,這得益於先進的AI和深偽技術的應用,以及增加的互動性,這得益於系統能夠理解並即時回應各種用戶輸入的能力。此外,本系統的廣泛應用範疇和用戶友好的介面擴大了其潛在市場和使用案例,使其能夠接觸到廣大的受眾。 總之,本案互動虛擬人像系統在虛擬角色領域中代表了一個變革性的步驟,提供了前所未有的真實感、反應性和多功能性的組合。它不僅解決了虛擬角色技術的當前限制,而且為用戶參與和互動開辟了新的途徑,為數位互動體驗設定了新的標準。 This new interactive virtual portrait system combines advanced AI facial modeling, real-time facial synthesis deepfakes technology, sophisticated lip synchronization algorithms, and an intuitive interface for user interaction. This will result in a more realistic, responsive and engaging virtual character that can be used in a wide range of applications from customer service to interactive education and entertainment. The new interactive virtual portrait system of the present invention includes a facial recognition module, an artificial intelligence facial modeling module, a deep fake module and an animation module. The facial recognition module is installed in a photographing device, and the facial recognition module obtains and processes one or more clear frontal face photos through the photographing device. The artificial intelligence facial modeling module is installed on at least one server. The artificial intelligence facial modeling module is designed to form a training facial model based on the frontal facial photos. The deep fake module is installed on the server and connected to the artificial intelligence facial modeling module. The deep fake module synthesizes the facial features of a virtual character based on the training facial model. The animation module is installed on the server and connected to the deep fake module. The animation module is used to generate specified actions and expressions of the virtual character. Wherein, the photographing device is communicatively connected to the server. This new model presents an advanced interactive avatar system that marks a significant advancement in the art of creating, animating, and interacting with virtual characters. This system leverages the latest developments in artificial intelligence (AI) and deep learning technology and is designed to overcome the limitations of current virtual character technology and provide new levels of augmented reality, responsiveness and interactivity. This system uses AI algorithms to process high-quality face photos to create accurate and detailed facial models. This technology can handle a variety of facial expressions and features, providing a basis for realistic facial synthesis of virtual characters. By integrating deep fake technology, this new model has achieved a major breakthrough, allowing the facial expressions of virtual characters to be synthesized in real time. The application of deepfake not only enhances the realism of facial movements and expressions, but also captures subtle details that reflect human interaction. A key aspect of this system is its AI-driven lip-sync module. The mod ensures that the avatar's lip movements are perfectly synchronized with spoken words, whether they are pre-recorded or generated on the fly. This synchronization plays a vital role in enhancing the believability and naturalness of a character's voice. The system is also equipped with a dynamic animation module that enables virtual characters to create specific movements and expressions. This functionality enables characters to perform a wide range of movements and gestures, making interactions more engaging and realistic. Complementing these capabilities is a flexible user interface that handles various forms of input, including text, voice commands, and touch commands. This flexibility enables the system to adapt to a wide range of interaction scenarios and meet diverse user needs and situations. An innovative aspect of this system is its ability to interact and generate responses on the fly. By integrating speech-to-text (STT) technology, semantic analysis, and keyword matching, the system can instantly process and understand user input. This advanced processing enables virtual characters to respond appropriately by outputting interactive speech via text-to-speech (TTS) or triggering relevant pre-recorded videos from an extensive database. The design of this system is multifunctional in nature, making it suitable for numerous applications such as customer service, virtual assistants, educational tools, entertainment, and interactive marketing. Its ability to deliver authentic, reactive interactions makes it an invaluable tool in any field that benefits from enhanced user engagement. Advantages of the new model include enhanced realism, thanks to the application of advanced AI and deepfake technology, and increased interactivity, thanks to the system's ability to understand and respond instantly to a variety of user inputs. In addition, the system's broad application scope and user-friendly interface expand its potential markets and use cases, allowing it to reach a broad audience. In summary, our interactive avatar system represents a transformative step in the field of virtual characters, providing an unprecedented combination of realism, responsiveness and versatility. It not only solves the current limitations of avatar technology, but also opens up new avenues for user participation and interaction, setting new standards for digital interactive experiences.

請參閱圖1及圖2,圖1所繪示為本實施例之互動虛擬人像系統10的示意圖,圖2所繪示為拍攝裝置13通訊連接伺服器12的立體示意圖。本實施例之互動虛擬人像系統10是包括一臉部辨識模組131、一人工智能臉部建模模組121、一深偽模組122、一動畫模組123、一基於AI的唇形同步模組124、一用戶介面132、一視頻錄製模組125、一資料庫126、一語音轉文字(STT)模組127、一語義分析模組128及一文字轉語音(TTS)模組129。 臉部辨識模組131是設置於一拍攝裝置13內,拍攝裝置13例如為智慧型手機或平板電腦,而本實施例是使用智慧型手機作為範例(請參閱圖2)。其中,臉部辨識模組131是經由拍攝裝置13獲取並處理一張或多張清晰的正面臉部照片131P。 此外,人工智能臉部建模模組121是設置於一伺服器12上,伺服器12整體是呈一矩形體的態樣。值得注意的是,拍攝裝置13是通訊連接所述伺服器12。因此,人工智能臉部建模模組121能透過伺服器12與拍攝裝置13連接於臉部辨識模組131。也因為如此,人工智能臉部建模模組121可根據所述正面臉部照片131P形成一訓練臉部模型121M。 另外,深偽模組122是設置於所述伺服器12上且連接於人工智能臉部建模模組121,深偽模組122是根據所述訓練臉部模型121M合成一虛擬角色的臉部特徵122F。此外,動畫模組123也是設置於所述伺服器12上且連接於深偽模組122,動畫模組123是用於生成虛擬角色的指定動作和表情123E。 請參閱圖1,基於AI的唇形同步模組124也是設置於伺服器12上且連接於動畫模組123,基於AI的唇形同步模組124是用於將該虛擬角色的唇部動作與口語話語同步。另外,用戶介面132是設置於所述拍攝裝置13上,用戶介面132允許拍攝裝置13的用戶輸入一特定內容132C,特定內容132C是來自以下群組:文本、語音命令及觸控命令,以供該虛擬角色說話或表演。 視頻錄製模組125也是設置於所述伺服器12上,視頻錄製模組125連接基於AI的唇形同步模組124。在本實施例中,視頻錄製模組125是用於捕捉和錄製該虛擬角色的唇部動作與口語話語,以形成一錄製視頻125V。此外,資料庫126同樣是設置於伺服器12上且連接視頻錄製模組125,資料庫126主要是用於儲存錄製視頻125V及錄製視頻125V所對應的一關鍵字126K。 語音轉文字(STT)模組127也是設置於伺服器12上,語音轉文字(STT)模組127也是經由伺服器12與拍攝裝置13獲取特定內容132C。其中,當特定內容132C為該語音命令時,語音轉文字(STT)模組127是用以將該語音命令轉換為一語音文字127W。 請參閱圖1,語義分析模組128是連接於資料庫126及語音轉文字(STT)模組127,語義分析模組128是用以處理轉換後的語音文字127W,以理解用戶輸入的特定內容132C。並且,在本實施例中,語義分析模組128還能將用戶輸入的特定內容132C與資料庫126儲存的關鍵字126K進行匹配,並觸發相應的一回應128R。其中,回應128R通常為資料庫126所儲存的錄製視頻125V。 另外,文字轉語音(TTS)模組129還連接資料庫126及語義分析模組128,文字轉語音(TTS)模組129是用以將所批配到的關鍵字126K轉換為一互動語音129V。在此實施例中,語義分析模組128所觸發的回應128R便改為互動語音129V。 在本實施例中,當語義分析模組128觸發回應128R後,被觸發的回應128R會被伺服器12傳送至拍攝裝置13,拍攝裝置13的用戶介面132除了顯示該虛擬角色之外,還會即時呈現回應128R。這樣一來,本實施例之互動虛擬人像系統10便會產生一個更逼真、反應更靈敏且更具吸引力的虛擬角色。 下方段落將針對互動虛擬人像系統10的技術特徵更詳細進行說明: 本實施例之互動虛擬人像系統10是一個全面的系統,整合了各種人工智能(AI)技術,以創建逼真且反應靈敏的虛擬角色。本系統的運作方式是獲取並處理清晰的正面臉部照片131P,這些照片作為創建虛擬角色外觀的基礎數據。這些圖像的品質和清晰度極為重要,因為它們直接影響虛擬角色臉部特徵的準確性。 一旦獲得了清晰的正面臉部照片131P,本系統就會利用AI技術根據這些圖像訓練臉部模型121M。這個過程可能涉及深度學習技術,以理解並複製照片中個體的臉部特徵和表情。然後,本系統將使用訓練臉部模型121M來合成虛擬角色的臉部特徵122F,並利用先進的算法和深偽技術來創建人臉的動態和令人信服的複製。 本系統還具有為虛擬角色生成特定動畫或動作的能力。這是通過結合預定義的動畫模板和與訓練臉部模型121M一致的AI生成的動作來實現的。此外,本系統還整合了AI唇形同步技術,以使該虛擬角色的唇部動作與口語話語對齊,從而增強虛擬角色互動的真實感。 用戶可以通過輸入特定內容132C與系統互動,以供虛擬角色說話或表演。本系統被設計為接受各種形式的特定內容132C,如文本、語音命令和觸控命令。本系統可以錄製虛擬角色說話或執行動作的視頻,並可以將其存儲在數據庫中以供未來使用或串流。本系統的用戶介面132即時呈現虛擬角色的回應,創建一種互動和吸引人的體驗。 總的來說,互動虛擬人像系統10代表了一種精密整合的AI技術,用於創建逼真且反應靈敏的虛擬角色。本系統的運作涉及一系列相互連接的步驟和模組,每一個都為創建一個能夠與用戶即時互動的逼真虛擬角色做出貢獻。 臉部辨識模組131在互動虛擬人像系統10運作的初期階段起著關鍵的作用。臉部辨識模組131被配置為獲取一個或多個清晰的正面臉部照片131P。這些照片作為創建虛擬角色外觀的基礎數據。臉部辨識模組131處理這些圖像,為本系統運作的後續步驟做好準備。 獲得的照片的品質和清晰度對於系統的運作至關重要。首選高解析度、清晰且具有明顯臉部特徵的圖像,因為它們提供了人臉的詳細和準確的表示。臉部辨識模組131處理這些圖像,專注於人的臉部特徵,如眼睛、鼻子、嘴巴和其他區別特徵。該模組可能使用各種圖像處理技術來提高圖像的品質,並將臉部特徵與圖像的其餘部分隔離。 然後,處理過的圖像被用作創建虛擬角色臉部特徵的基礎。虛擬角色臉部的準確性和真實感直接受到原始照片的品質和清晰度的影響。例如,清晰且高品質的圖像可以更精確地複製人的臉部特徵,從而產生與人極為相似的虛擬角色。另一方面,品質較低或清晰度較差的圖像可能導致人臉的表示不夠準確,可能影響虛擬角色的真實感。 因此,臉部辨識模組131在獲取和處理清晰正面臉部照片131P的過程中起著基礎性的作用。臉部辨識模組131為本系統運作的後續步驟鋪平了道路,影響了虛擬角色外觀的準確性和真實感。 在臉部辨識模組131獲取和處理臉部照片之後,互動虛擬人像系統10使用人工智能臉部建模模組121。人工智能臉部建模模組121被設計為根據提供的照片來產生訓練臉部模型121M。人工智能臉部建模模組121利用深度學習技術來理解和複製照片中個體的臉部特徵和表情。 人工智能臉部建模模組121使用這些深度學習技術來訓練一個準確代表個體臉部的模型。訓練過程涉及將處理過的照片輸入本系統,本系統然後分析圖像並學習個體臉部特徵的模式和特性。本系統使用這些學習到的信息來創建個體臉部的詳細模型,捕捉他們的臉部特徵和表情的細微差別。 人工智能臉部建模模組121的功能不僅限於靜態的臉部特徵。它還擴展到動態方面,如臉部表情。通過分析顯示個體不同臉部表情的一系列照片,該模組可以學習並在虛擬角色中複製這些表情。這種能力增強了虛擬角色的真實感,使其能夠顯示出與個體相似的一系列情緒和反應。 因此,人工智能臉部建模模組121在互動虛擬人像系統10的運作中起著關鍵的作用。通過利用深度學習技術根據清晰正面臉部照片131P產生訓練臉部模型121M,該模組為創建一個不僅與個體相似,而且行為也像個體的虛擬角色做出了貢獻。這種真實感和反應性是互動虛擬人像系統10的一個定義特性。 在人工智能臉部建模模組121完成臉部模型訓練之後,互動虛擬人像系統10使用深偽模組122。深偽模組122是負責根據訓練臉部模型121M合成虛擬角色的臉部特徵122F。深偽模組122利用一種稱為生成對抗網絡(GANs)的特定類型的AI技術來生成令人信服的臉部動作和表情。 生成對抗網絡(GANs)是一種AI演算法類型,用於無監督的機器學習,由兩個神經網絡在零和遊戲框架中相互競爭來實現。這種技術由Ian Goodfellow和他的同事在2014年提出。GANs被用來生成幾乎無法與真實數據區分的合成數據。在深偽模組122的上下文中,GANs被用來根據訓練臉部模型121M創建個體臉部的逼真和動態複製。 GANs的運作涉及兩個神經網絡:一個生成網絡和一個鑑別網絡。生成網絡根據訓練臉部模型121M創建合成數據,即臉部特徵和表情。另一方面,鑑別網絡評估生成數據與原始照片的真實性。這兩個網絡不斷從彼此學習,生成網絡努力產生更逼真的數據,而鑑別網絡則努力更好地區分合成數據和真實數據。 通過這種迭代過程,深偽模組122可以生成與照片中的個體極為相似的臉部特徵和表情。深偽模組122可以創建一系列的臉部動作和表情,從微妙的臉部肌肉變化到更明顯的表情,如微笑或皺眉。這種能力使虛擬角色能夠展現出增強互動體驗的動態性和真實感。 此外,深偽模組122可以即時調整虛擬角色的臉部特徵122F,以回應用戶輸入或虛擬環境的變化。這種動態反應性有助於本系統的互動性,使虛擬角色能夠以更逼真和可信的方式與用戶互動。因此,深偽模組122在互動虛擬人像系統10的運作中起著關鍵的作用。通過利用GANs根據訓練臉部模型121M合成虛擬角色的臉部特徵122F,該模組為創建一個不僅看起來像個體,而且動作和表情也像個體的虛擬角色做出了貢獻。 在深偽模組122合成虛擬角色的臉部特徵122F之後,互動虛擬人像系統10的動畫模組123負責為虛擬角色生成指定的動作和表情。動畫模組123的運作是通過使用一組預定義的動畫模板和與訓練臉部模型121M一致的AI生成的動作來實現的。 預定義的動畫模板作為虛擬角色的動作和表情的基礎。這些模板可能包括各種常見的動作和表情,如走路、跑步、揮手、微笑、皺眉等。這些模板為動畫模組123提供了一個起點,提供了一組可以應用於虛擬角色的標準動作和表情。 然而,動畫模組123並不僅依賴這些模板。它還利用AI生成的動作來為虛擬角色創建更具體和細緻的動作和表情。這些AI生成的動作還會基於訓練臉部模型121M,並被設計為與個體的臉部特徵和表情一致。 這些由AI生成的動作是利用機器學習技術創建的,這些技術與人工智能臉部建模模組121中使用的技術相似。動畫模組123將訓練過的臉部模型輸入到機器學習系統中,然後生成與個體的臉部特徵和表情一致的動作和表情。這些動作和表情然後被應用到虛擬角色上,增強其真實感和反應性。 動畫模組123的功能不僅限於生成靜態的動作和表情。它還包括能夠即時調整這些動作和表情的能力,以回應用戶輸入或虛擬環境的變化。這種動態反應性有助於本系統的互動性,使虛擬角色能夠以更逼真和可信的方式與用戶互動。 因此,動畫模組123在互動虛擬人像系統10的運作中起著關鍵的作用。通過利用預定義的動畫模板和AI生成的動作來生成虛擬角色的指定動作和表情123E,動畫模組123為創建一個不僅看起來和移動像個體,而且行為也與個體的動作和表情一致的虛擬角色做出了貢獻。 在動畫模組123為虛擬角色生成指定的動作和表情之後,互動虛擬人像系統10的基於AI的唇形同步模組124負責將虛擬角色的唇部動作與口語話語同步。基於AI的唇形同步模組124通過分析音頻輸入並相應地調整虛擬角色的口部動作來運作。 基於AI的唇形同步模組124利用先進的AI技術來分析音頻輸入,該音頻輸入可以是預錄的語音或來自用戶的即時語音輸入。該模組處理音頻輸入,將其分解為音位,這些音位是在特定語言中區分一個單詞與另一個單詞的最小聲音單位。然後,該模組將這些音位映射到對應的口形,也就是音位的視覺表示。 然後,基於AI的唇形同步模組124調整虛擬角色的唇部動作以匹配這些口形,從而創造出虛擬角色正在說話的假象。這個過程涉及到各種臉部肌肉的複雜互動,因為唇部、下巴和舌頭的形狀和位置都對形成不同的口形有所貢獻。基於AI的唇形同步模組124使用訓練過的臉部模型來準確地複製這些肌肉的動作,從而產生真實的唇形同步,增強了虛擬角色語音的可信度。 基於AI的唇形同步模組124的功能不僅限於靜態的唇形同步。它還包括能夠即時調整唇部動作的能力,以回應音頻輸入的變化。這種動態反應性有助於虛擬人系統的互動性,使虛擬角色能夠以更逼真和可信的方式與用戶互動。例如,如果用戶輸入一段新的語音內容,該模組可以分析音頻輸入並相應地調整虛擬角色的唇部動作,從而創造出虛擬角色正在說新內容的假象。 因此,基於AI的唇形同步模組124在互動虛擬人像系統10的運作中起著關鍵的作用。通過利用先進的AI技術將虛擬角色的唇部動作與口語話語同步,該模組為創建一個不僅看起來、移動和行為像個體,而且說話方式也與個體的語音一致的虛擬角色做出了貢獻。 在基於AI的唇形同步模組124將虛擬角色的唇部動作與口語話語同步之後,互動虛擬人像系統10的用戶介面132是用於讓拍攝裝置13的用戶輸入特定內容132C,以供虛擬角色說話或表演。用戶介面132作為用戶互動的通道,使用戶能夠與虛擬角色進行溝通並指導其動作和語音。 用戶介面132被設計為接受各種形式的輸入,為用戶提供靈活性和便利性。這些輸入形式可能包括文本、語音命令和觸控命令。每種輸入形式都提供了不同的互動模式,以滿足不同的用戶偏好和使用情境。 文本的輸入以供虛擬角色說話或表演。這種輸入形式尤其適用於劇本互動,其中語音或動作的內容是預先確定的。系統處理文本輸入,然後為虛擬角色生成相應的語音或動作。 另一方面,語音命令提供了更動態和互動的輸入模式。用戶可以直接對本系統說話,他們的語音命令被系統的語音轉文字(STT)模組127轉換為一語音文字127W。然後,本系統處理轉換後的語音文字127W,為虛擬角色生成相應的語音或動作。這種輸入形式允許即時互動,使用戶能夠以對話的方式與虛擬角色互動。 該觸控命令提供了一種觸覺輸入模式,允許用戶通過觸控手勢與系統互動。這些手勢可以用來指導虛擬角色的動作,例如指向一個物體或朝特定方向移動。系統處理觸控命令,然後為虛擬角色生成相應的動作。 因此,用戶介面132在互動虛擬人像系統10的運作中起著關鍵的作用。通過允許用戶輸入特定內容132C以供虛擬角色說話或表演,該介面促進了用戶互動並增強了系統的互動性。該介面接受的各種輸入形式滿足了不同的用戶偏好和使用情境,為用戶提供了靈活性和便利性。 在虛擬角色完成動作和語音後,互動虛擬人像系統10的視頻錄製模組125負責捕捉並錄製虛擬角色的動作和語音。視頻錄製模組125的運作方式是將虛擬角色的動作和語音編碼成數位視頻格式,該格式可以被儲存並在未來被檢索使用。 視頻錄製模組125即時捕捉虛擬角色的動作和語音,確保每一個動作和語詞都被準確地錄製下來,以形成錄製視頻125V。視頻錄製模組125可能使用各種視頻編碼技術來壓縮錄製的數據,減少儲存空間的需求,同時保持視頻的品質。視頻錄製模組125還為每一個錄製視頻125V加上時間戳,讓虛擬角色的動作和語音能夠被精確地追蹤。 然後,錄製視頻125V被儲存到一個資料庫126中,資料庫126作為虛擬角色的動作和語音的儲存庫。資料庫126用於儲存和檢索這些錄製視頻125V,提供一個全面的虛擬角色互動記錄。資料庫126可能以各種方式被組織,例如按日期、按用戶或按互動類型,以便有效地檢索錄製視頻125V。 資料庫126並非一個被動的儲存系統。它被設計為能夠被用戶介面132接收的特定內容132C所觸發。特定內容132C可能包括請求檢索特定的錄製視頻125V、刪除錄製視頻125V或播放錄製視頻125V。在接收到特定內容132C後,資料庫126檢索相應的錄製視頻125V並將其傳送到用戶介面132進行處理。 例如,如果一個用戶請求查看特定的錄製視頻125V,資料庫126會檢索視頻並將其傳送到用戶介面132進行播放。如果一個用戶請求刪除錄製視頻125V,資料庫126會從其儲存中移除視頻。如果一個用戶請求播放錄製視頻125V,資料庫126會檢索視頻並將其傳送到播放模組進行處理。 因此,視頻錄製模組125和資料庫126在互動虛擬人像系統10的運作中扮演著關鍵的角色。通過捕捉和錄製虛擬角色的動作和語音,並通過儲存和檢索這些錄製視頻125V,這些組件為本系統提供全面且互動的虛擬人體驗做出了貢獻。 在通過用戶介面132接收到用戶輸入後,語音轉文字(STT)模組127負責解釋語音輸入,將口語轉換為可以由系統處理的文字。語音轉文字(STT)模組127作為用戶與系統之間的橋樑,使本系統能夠理解並回應語音命令。 語音轉文字(STT)模組127的運作方式是分析音頻輸入,該輸入可以是預錄的語音或來自用戶的即時語音輸入。該模組處理音頻輸入,將其分解為音位,這些音位是在特定語言中區分一個單詞與另一個單詞的最小聲音單位。然後,該模組將這些音位轉換為相應的文字,創建口語的文字表示。 語音轉文字(STT)模組127利用先進的AI技術來執行這個轉換過程。該模組使用機器學習技術來學習口語的模式和特性,使其能夠準確地將口語轉換為文字。該模組也可能結合自然語言處理技術來理解口語的語境和語義,提高轉換過程的準確性。 語音轉文字(STT)模組127的功能不僅限於靜態的語音轉文字轉換。它還包括能夠即時調整轉換過程的能力,以回應音頻輸入的變化。這種動態反應性有助於虛擬人系統的互動性,使系統能夠及時理解並回應用戶輸入。例如,如果一個用戶輸入一個新的語音命令,該模組可以分析音頻輸入並將口語轉換為文字,使系統能夠處理新的命令。 因此,語音轉文字(STT)模組127在互動虛擬人像系統10的運作中扮演著關鍵的角色。通過利用先進的AI技術將口語轉換為文字,該模組促進了用戶互動並增強了系統的互動性。 在語音轉文字(STT)模組127將口語轉換為文字後,互動虛擬人像系統10使用一個語義分析模組128負責處理轉換後的文字以理解用戶的輸入。語義分析模組128作為用戶輸入與系統回應之間的橋樑,使本系統能夠理解並回應用戶命令。 語義分析模組128的運作方式是分析轉換後的文字,該文字代表用戶的口語。該模組處理文字,將其分解為單詞和短語,然後分析這些單詞和短語的語義。語義指的是在特定語境中單詞和短語的意義。通過分析轉換後的文字的語義,該模組可以理解用戶的意圖和他們命令的語境。 語義分析模組128利用先進的AI技術來進行這種語義分析。語義分析模組128使用自然語言處理技術來理解轉換後的文字的語境和語義,提高分析過程的準確性。該模組也可能結合機器學習技術來學習語言的模式和特性,使其能夠準確地理解用戶的意圖。 語義分析模組128的功能不僅限於靜態的語義分析。它還包括能夠將用戶的輸入與資料庫126中儲存的關鍵字126K進行匹配的能力。關鍵字126K代表本系統可以觸發的特定動作或回應。通過將用戶的輸入與這些關鍵字126K進行匹配,語義分析模組128可以確定對用戶命令的適當回應。 例如,如果一個用戶輸入一個命令讓虛擬角色微笑,該模組可以將"微笑"這個詞與資料庫126中觸發虛擬角色微笑動作的關鍵字126K進行匹配。同樣地,如果一個用戶輸入一個命令讓虛擬角色說出特定的詞語,該模組可以將該詞語與資料庫126中觸發虛擬角色語音動作的關鍵字126K進行匹配。 因此,語義分析模組128在互動虛擬人像系統10的運作中扮演著關鍵的角色。通過利用先進的AI技術處理轉換後的文字以理解用戶輸入,並通過將用戶輸入與資料庫126中儲存的關鍵字126K進行匹配,該模組促進了用戶互動並增強了系統的互動性。 在語義分析模組128對用戶輸入進行語義分析之後,互動虛擬人像系統10使用一個回應128R的機制。該機制是以文字轉語音(TTS)輸出和從資料庫126串流的視頻內容的形式提供對用戶輸入的特定內容132C。回應128R的機制是作為本系統理解用戶輸入與呈現適當回應之間的橋樑,使系統能夠以動態和吸引人的方式與用戶互動。 回應128R的機制之運作方式是分析語義分析模組128的結果,這些結果代表用戶的意圖和他們命令的語境。該機制處理這些結果,根據用戶的輸入確定適當的回應。回應128R可以是虛擬角色的特定動作或語音,這是由本系統生成並呈現給用戶的。 回應128R的機制利用先進的AI技術來生成這些回應。該機制使用文字轉語音(TTS)技術將文字輸入轉換為口語,創造出虛擬角色正在說出用戶輸入的假象。文字轉語音(TTS)技術的運作方式是分析文字輸入,將其分解為音位,然後生成相應的語音聲音,創建出文字的口語表示。 回應128R的機制的功能不僅限於靜態的文字轉語音(TTS)輸出和視頻串流。它還包括能夠在用戶介面132上即時呈現虛擬角色的回應的能力。這種動態呈現有助於虛擬人系統的互動性,使虛擬角色能夠以更逼真和可信的方式與用戶互動。例如,如果一個用戶輸入一個命令讓虛擬角色說出特定的詞語,該機制可以生成相應的文字轉語音(TTS)輸出並在用戶介面132上即時呈現,創造出虛擬角色正在說詞語的假象。 因此,回應128R的機制在互動虛擬人像系統10的運作中扮演著關鍵的角色。通過利用先進的AI技術提供文字轉語音(TTS)輸出和從資料庫126串流的視頻內容,並通過在用戶介面132上即時呈現虛擬角色的回應,該機制促進了用戶互動並增強了系統的互動性。 互動虛擬人像系統10可以在各種應用中使用。其中一種應用可能是在虛擬客戶服務領域。在這種情況下,本系統可以用來創建能夠與客戶即時互動的虛擬客戶服務代表。這些虛擬代表可以回答客戶查詢,提供產品或服務的信息,並引導客戶通過各種流程。本系統能夠即時理解並回應用戶輸入,並與虛擬角色的真實感和動態性相結合,可以提升客戶服務體驗,使其更具吸引力和效率。 本系統也可以用於教育領域。它可以用來創建能夠與學生即時互動的虛擬導師或教師。這些虛擬導師可以提供解釋,回答問題,並引導學生通過學習材料。本系統能夠即時理解並回應用戶輸入的資訊,並在用戶介面132上即時呈現虛擬角色的回應128R,可以使學習體驗更具互動性和吸引力。 綜上所述,本新型之互動虛擬人像系統10透過先進的AI臉部建模、即時面部合成的深偽技術、精密的唇形同步算法,能夠產生一個更逼真、反應更靈敏且更具吸引力的虛擬角色。 雖然本新型已以較佳實施例揭露如上,然其並非用以限定本新型,任何所屬技術領域中具有通常知識者,在不脫離本新型之精神和範圍內,當可作些許之更動與潤飾,因此本新型之保護範圍當視後附之申請專利範圍所界定者為準。 Please refer to FIGS. 1 and 2 . FIG. 1 shows a schematic diagram of the interactive virtual portrait system 10 of this embodiment, and FIG. 2 shows a three-dimensional schematic diagram of the communication connection between the shooting device 13 and the server 12 . The interactive virtual portrait system 10 of this embodiment includes a facial recognition module 131, an artificial intelligence facial modeling module 121, a deep fake module 122, an animation module 123, and an AI-based lip synchronization Module 124, a user interface 132, a video recording module 125, a database 126, a speech-to-text (STT) module 127, a semantic analysis module 128 and a text-to-speech (TTS) module 129. The facial recognition module 131 is installed in a camera device 13. The camera device 13 is, for example, a smartphone or a tablet computer. In this embodiment, a smartphone is used as an example (see Figure 2). Among them, the facial recognition module 131 obtains and processes one or more clear frontal face photos 131P through the shooting device 13 . In addition, the artificial intelligence facial modeling module 121 is installed on a server 12, and the server 12 is generally in the shape of a rectangular body. It is worth noting that the photographing device 13 is connected to the server 12 through communication. Therefore, the artificial intelligence facial modeling module 121 can be connected to the facial recognition module 131 through the server 12 and the shooting device 13 . Because of this, the artificial intelligence facial modeling module 121 can form a training facial model 121M based on the frontal facial photo 131P. In addition, the deep fake module 122 is installed on the server 12 and connected to the artificial intelligence face modeling module 121. The deep fake module 122 synthesizes the face of a virtual character based on the training face model 121M. Features 122F. In addition, the animation module 123 is also installed on the server 12 and connected to the deep fake module 122. The animation module 123 is used to generate designated actions and expressions 123E of the virtual character. Please refer to Figure 1. The AI-based lip synchronization module 124 is also installed on the server 12 and connected to the animation module 123. The AI-based lip synchronization module 124 is used to combine the lip movements of the virtual character with Spoken discourse synchronization. In addition, the user interface 132 is provided on the camera device 13. The user interface 132 allows the user of the camera device 13 to input a specific content 132C. The specific content 132C is from the following groups: text, voice commands and touch commands. The virtual character speaks or acts. The video recording module 125 is also provided on the server 12, and the video recording module 125 is connected to the AI-based lip synchronization module 124. In this embodiment, the video recording module 125 is used to capture and record the lip movements and spoken words of the virtual character to form a recorded video 125V. In addition, the database 126 is also installed on the server 12 and connected to the video recording module 125. The database 126 is mainly used to store the recorded video 125V and a keyword 126K corresponding to the recorded video 125V. The speech-to-text (STT) module 127 is also installed on the server 12 . The speech-to-text (STT) module 127 also obtains the specific content 132C through the server 12 and the camera 13 . When the specific content 132C is the voice command, the speech-to-text (STT) module 127 is used to convert the voice command into a voice text 127W. Please refer to Figure 1. The semantic analysis module 128 is connected to the database 126 and the speech-to-text (STT) module 127. The semantic analysis module 128 is used to process the converted speech text 127W to understand the specific content input by the user. 132C. Moreover, in this embodiment, the semantic analysis module 128 can also match the specific content 132C input by the user with the keywords 126K stored in the database 126, and trigger a corresponding response 128R. The response 128R is usually the recorded video 125V stored in the database 126 . In addition, the text-to-speech (TTS) module 129 is also connected to the database 126 and the semantic analysis module 128. The text-to-speech (TTS) module 129 is used to convert the assigned keywords 126K into an interactive voice 129V. . In this embodiment, the response 128R triggered by the semantic analysis module 128 is changed to the interactive voice 129V. In this embodiment, when the semantic analysis module 128 triggers the response 128R, the triggered response 128R will be sent to the shooting device 13 by the server 12. In addition to displaying the virtual character, the user interface 132 of the shooting device 13 will also Instant rendering response 128R. In this way, the interactive virtual character system 10 of this embodiment will generate a more realistic, responsive and attractive virtual character. The following paragraphs will describe the technical features of the interactive virtual portrait system 10 in more detail: The interactive virtual portrait system 10 of this embodiment is a comprehensive system that integrates various artificial intelligence (AI) technologies to create realistic and responsive virtual images. Role. The way this system works is to obtain and process clear frontal facial photos 131P, which are used as basic data for creating the appearance of the virtual character. The quality and clarity of these images are extremely important because they directly affect the accuracy of the facial features of the virtual character. Once clear frontal facial photos 131P are obtained, the system will use AI technology to train a facial model 121M based on these images. This process may involve deep learning technology to understand and replicate the facial features and expressions of individuals in photos. This system will then use the trained face model 121M to synthesize the facial features 122F of the virtual character, and utilize advanced algorithms and deep fake technology to create a dynamic and convincing replica of the human face. This system also has the ability to generate specific animations or actions for virtual characters. This is achieved by combining predefined animation templates and AI-generated actions consistent with the trained face model 121M. In addition, this system also integrates AI lip synchronization technology to align the virtual character's lip movements with the spoken words, thereby enhancing the realism of the virtual character's interaction. The user can interact with the system by inputting specific content 132C for the virtual character to speak or perform. The system is designed to accept various forms of specific content 132C such as text, voice commands, and touch commands. This system can record video of a virtual character speaking or performing actions, and can store it in a database for future use or streaming. The system's user interface 132 presents the virtual character's responses in real time, creating an interactive and engaging experience. Overall, the Interactive Avatar System 10 represents a sophisticated integration of AI technology for creating realistic and responsive virtual characters. The operation of this system involves a series of interconnected steps and modules, each contributing to the creation of a realistic virtual character capable of instant interaction with the user. The facial recognition module 131 plays a key role in the initial stage of operation of the interactive virtual portrait system 10 . The facial recognition module 131 is configured to obtain one or more clear frontal facial photos 131P. These photos serve as base data for creating the appearance of the virtual character. The facial recognition module 131 processes these images to prepare for subsequent steps of the system operation. The quality and clarity of the photos obtained are crucial to the operation of the system. High-resolution, clear images with distinct facial features are preferred as they provide a detailed and accurate representation of the human face. The facial recognition module 131 processes these images, focusing on the person's facial features such as eyes, nose, mouth and other distinguishing features. The mod may use various image processing techniques to improve the quality of the image and isolate facial features from the rest of the image. The processed images were then used as the basis for creating facial features of the virtual character. The accuracy and realism of the virtual character's face is directly affected by the quality and clarity of the original photo. For example, clear and high-quality images can more accurately replicate people's facial features, resulting in virtual characters that closely resemble people. On the other hand, images with lower quality or poor definition may result in less accurate representation of human faces, which may affect the realism of virtual characters. Therefore, the facial recognition module 131 plays a fundamental role in the process of obtaining and processing the clear frontal facial photo 131P. The facial recognition module 131 paves the way for subsequent steps in the operation of this system, affecting the accuracy and realism of the virtual character's appearance. After the facial recognition module 131 acquires and processes the facial photos, the interactive virtual portrait system 10 uses the artificial intelligence facial modeling module 121 . The artificial intelligence face modeling module 121 is designed to generate a training face model 121M based on the provided photos. The artificial intelligence facial modeling module 121 uses deep learning technology to understand and replicate the facial features and expressions of individuals in photos. The AI Face Modeling Module 121 uses these deep learning techniques to train a model that accurately represents an individual's face. The training process involves feeding processed photos into the system, which then analyzes the images and learns the patterns and characteristics of individual facial features. The system uses this learned information to create detailed models of individual faces, capturing the nuances of their facial features and expressions. The functions of the artificial intelligence facial modeling module 121 are not limited to static facial features. It also extends to dynamic aspects such as facial expressions. By analyzing a series of photos showing different facial expressions of individuals, the module can learn and replicate these expressions in virtual characters. This ability enhances the realism of virtual characters, allowing them to display a range of emotions and reactions similar to those of an individual. Therefore, the artificial intelligence facial modeling module 121 plays a key role in the operation of the interactive virtual portrait system 10 . By utilizing deep learning technology to generate a trained face model 121M from a clear frontal face photo 131P, the module contributes to the creation of a virtual character that not only resembles an individual, but also behaves like an individual. This sense of realism and responsiveness is a defining characteristic of the interactive avatar system 10 . After the artificial intelligence facial modeling module 121 completes facial model training, the interactive virtual portrait system 10 uses the deep fake module 122 . The deep fake module 122 is responsible for synthesizing the facial features 122F of the virtual character based on the training facial model 121M. Deepfake Mod 122 utilizes a specific type of AI technology called generative adversarial networks (GANs) to generate convincing facial movements and expressions. Generative adversarial networks (GANs) are a type of AI algorithm used for unsupervised machine learning, implemented by two neural networks competing against each other in a zero-sum game framework. This technique was proposed by Ian Goodfellow and his colleagues in 2014. GANs are used to generate synthetic data that is nearly indistinguishable from real data. In the context of deepfakes module 122, GANs are used to create realistic and dynamic replicas of individual faces based on trained face models 121M. The operation of GANs involves two neural networks: a generative network and a discriminative network. The generative network creates synthetic data, i.e. facial features and expressions, from the trained face model 121M. The discriminative network, on the other hand, evaluates the authenticity of the generated data compared to the original photo. The two networks continuously learn from each other, with the generative network striving to produce more realistic data, and the discriminative network striving to better distinguish between synthetic and real data. Through this iterative process, the deepfake module 122 can generate facial features and expressions that are very similar to the individuals in the photos. Deepfake module 122 can create a range of facial movements and expressions, from subtle facial muscle changes to more obvious expressions such as smiles or frowns. This capability enables virtual characters to exhibit dynamics and realism that enhance interactive experiences. In addition, the deepfake module 122 can real-time adjust the facial features 122F of the virtual character in response to user input or changes in the virtual environment. This dynamic reactivity contributes to the interactivity of this system, allowing virtual characters to interact with users in a more realistic and believable way. Therefore, the deep fake module 122 plays a key role in the operation of the interactive virtual portrait system 10 . By utilizing GANs to synthesize the facial features 122F of the virtual character based on the trained face model 121M, the module contributes to the creation of a virtual character that not only looks like an individual, but also moves and expresses like an individual. After the deep fake module 122 synthesizes the facial features 122F of the virtual character, the animation module 123 of the interactive virtual portrait system 10 is responsible for generating specified actions and expressions for the virtual character. The animation module 123 operates by using a set of predefined animation templates and AI-generated movements consistent with the trained facial model 121M. Predefined animation templates serve as the basis for the avatar's movements and expressions. These templates may include a variety of common actions and expressions such as walking, running, waving, smiling, frowning, etc. These templates provide a starting point for animation modules 123, providing a set of standard actions and expressions that can be applied to virtual characters. However, animation module 123 does not rely solely on these templates. It also uses AI-generated movements to create more specific and detailed movements and expressions for virtual characters. These AI-generated actions are also based on the trained facial model 121M and are designed to be consistent with the individual’s facial features and expressions. These AI-generated actions are created using machine learning techniques similar to those used in the AI Face Modeling Module 121. The animation module 123 inputs the trained facial model into the machine learning system, and then generates actions and expressions consistent with the individual's facial features and expressions. These movements and expressions are then applied to the virtual character, enhancing its realism and responsiveness. The function of the animation module 123 is not limited to generating static movements and expressions. It also includes the ability to adjust these movements and expressions on the fly in response to user input or changes in the virtual environment. This dynamic reactivity contributes to the interactivity of this system, allowing virtual characters to interact with users in a more realistic and believable way. Therefore, the animation module 123 plays a key role in the operation of the interactive virtual portrait system 10 . By utilizing predefined animation templates and AI-generated movements to generate specified movements and expressions 123E of a virtual character, the animation module 123 creates a virtual character that not only looks and moves like an individual, but also behaves consistent with the individual's movements and expressions. The character contributed. After the animation module 123 generates specified actions and expressions for the virtual character, the AI-based lip synchronization module 124 of the interactive virtual portrait system 10 is responsible for synchronizing the lip movements of the virtual character with spoken words. The AI-based lip sync module 124 operates by analyzing audio input and adjusting the avatar's mouth movements accordingly. The AI-based lip sync module 124 utilizes advanced AI technology to analyze audio input, which may be pre-recorded speech or real-time speech input from the user. This module processes audio input, breaking it down into phonemes, which are the smallest units of sound that distinguish one word from another in a particular language. The module then maps these phonemes to corresponding embouchures, which are visual representations of the phonemes. The AI-based lip sync module 124 then adjusts the avatar's lip movements to match these mouth shapes, thereby creating the illusion that the avatar is speaking. This process involves a complex interaction of various facial muscles, as the shape and position of the lips, chin, and tongue all contribute to the different mouth shapes. The AI-based lip sync module 124 uses a trained facial model to accurately replicate the movements of these muscles, thereby producing realistic lip sync and enhancing the credibility of the virtual character's voice. The function of the AI-based lip sync module 124 is not limited to static lip sync. It also includes the ability to adjust lip movements on the fly in response to changes in audio input. This dynamic reactivity contributes to the interactivity of virtual human systems, allowing virtual characters to interact with users in a more realistic and believable way. For example, if a user enters a new piece of speech, the module can analyze the audio input and adjust the avatar's lip movements accordingly, creating the illusion that the avatar is saying something new. Therefore, the AI-based lip sync module 124 plays a key role in the operation of the interactive virtual portrait system 10 . By leveraging advanced AI technology to synchronize the avatar's lip movements with spoken utterances, the mod contributes to the creation of an avatar that not only looks, moves and behaves like an individual, but also speaks in a manner consistent with the individual's voice . After the AI-based lip synchronization module 124 synchronizes the lip movements of the virtual character with the spoken words, the user interface 132 of the interactive virtual portrait system 10 is used to allow the user of the shooting device 13 to input specific content 132C for the virtual character. speak or act. The user interface 132 serves as a channel for user interaction, enabling the user to communicate with the virtual character and guide its movements and voices. User interface 132 is designed to accept various forms of input, providing flexibility and convenience to the user. These input forms may include text, voice commands, and touch commands. Each input form provides different interaction modes to meet different user preferences and usage scenarios. Input of text for virtual characters to speak or perform. This form of input is particularly useful for scripted interactions, where the content of speech or action is predetermined. The system processes the text input and then generates corresponding speech or actions for the virtual character. Voice commands, on the other hand, offer a more dynamic and interactive input mode. Users can speak directly to the system, and their voice commands are converted into spoken text 127 by the system's speech-to-text (STT) module 127. Then, the system processes the converted voice text 127W to generate corresponding voices or actions for the virtual character. This form of input allows for instant interaction, allowing users to interact with virtual characters in a conversational manner. This touch command provides a tactile input mode that allows users to interact with the system through touch gestures. These gestures can be used to guide the actions of a virtual character, such as pointing at an object or moving in a specific direction. The system processes touch commands and then generates corresponding actions for the virtual character. Therefore, the user interface 132 plays a key role in the operation of the interactive avatar system 10 . The interface facilitates user interaction and enhances the interactivity of the system by allowing the user to input specific content 132C for the virtual character to speak or perform. The various input forms accepted by this interface satisfy different user preferences and usage scenarios, providing users with flexibility and convenience. After the virtual character completes its actions and voice, the video recording module 125 of the interactive virtual portrait system 10 is responsible for capturing and recording the actions and voice of the virtual character. The video recording module 125 operates by encoding the movements and voices of the virtual character into a digital video format that can be stored and retrieved for future use. The video recording module 125 captures the movements and voices of virtual characters in real time, ensuring that every movement and word is accurately recorded to form a recorded video 125V. The video recording module 125 may use various video encoding technologies to compress recorded data and reduce storage space requirements while maintaining video quality. The video recording module 125 also adds a timestamp to each recorded video 125V so that the movements and voices of the virtual characters can be accurately tracked. Then, the recorded video 125V is stored in a database 126, which serves as a storage database for the virtual character's movements and voices. Database 126 is used to store and retrieve these recorded videos 125V, providing a comprehensive record of virtual character interactions. The database 126 may be organized in various ways, such as by date, by user, or by interaction type, in order to efficiently retrieve recorded video 125V. Database 126 is not a passive storage system. It is designed to be triggered by specific content 132C received by user interface 132. The specific content 132C may include a request to retrieve a specific recorded video 125V, delete the recorded video 125V, or play the recorded video 125V. After receiving the specific content 132C, the database 126 retrieves the corresponding recorded video 125V and transmits it to the user interface 132 for processing. For example, if a user requests to view a particular recorded video 125V, the database 126 retrieves the video and passes it to the user interface 132 for playback. If a user requests deletion of recorded video 125V, database 126 will remove the video from its storage. If a user requests to play a recorded video 125V, the database 126 retrieves the video and sends it to the playback module for processing. Therefore, the video recording module 125 and the database 126 play a key role in the operation of the interactive virtual portrait system 10 . By capturing and recording the movements and speech of virtual characters, and by storing and retrieving these recorded videos 125V, these components contribute to the system providing a comprehensive and interactive virtual human experience. After receiving user input through the user interface 132, the speech-to-text (STT) module 127 is responsible for interpreting the speech input and converting the spoken language into text that can be processed by the system. The speech-to-text (STT) module 127 serves as a bridge between the user and the system, enabling the system to understand and respond to voice commands. The speech-to-text (STT) module 127 operates by analyzing audio input, which may be pre-recorded speech or real-time speech input from the user. The module processes audio input, breaking it down into phonemes, which are the smallest units of sound that distinguish one word from another in a particular language. The module then converts these phonemes into their corresponding text, creating a textual representation of the spoken language. The speech-to-text (STT) module 127 utilizes advanced AI technology to perform this conversion process. The mod uses machine learning technology to learn the patterns and characteristics of spoken language, allowing it to accurately convert spoken language into text. The module may also combine natural language processing technology to understand the context and semantics of spoken language to improve the accuracy of the conversion process. The functionality of the speech-to-text (STT) module 127 is not limited to static speech-to-text conversion. It also includes the ability to adjust the conversion process on the fly in response to changes in audio input. This dynamic reactivity contributes to the interactivity of the virtual human system, allowing the system to understand and respond to user input in a timely manner. For example, if a user enters a new voice command, the module can analyze the audio input and convert spoken words into text, allowing the system to process the new command. Therefore, the speech-to-text (STT) module 127 plays a key role in the operation of the interactive avatar system 10 . By utilizing advanced AI technology to convert spoken words into text, the module facilitates user interaction and enhances the interactivity of the system. After the speech-to-text (STT) module 127 converts the spoken language into text, the interactive avatar system 10 uses a semantic analysis module 128 to process the converted text to understand the user's input. The semantic analysis module 128 serves as a bridge between user input and system response, enabling the system to understand and respond to user commands. The semantic analysis module 128 operates by analyzing the converted text, which represents the user's spoken language. This mod processes text, breaks it down into words and phrases, and then analyzes the semantics of those words and phrases. Semantics refers to the meaning of words and phrases in a specific context. By analyzing the semantics of the converted text, the module can understand the user's intent and the context of their commands. The semantic analysis module 128 uses advanced AI technology to perform this semantic analysis. The semantic analysis module 128 uses natural language processing technology to understand the context and semantics of the converted text to improve the accuracy of the analysis process. The module may also incorporate machine learning technology to learn patterns and characteristics of language, allowing it to accurately understand the user's intent. The function of the semantic analysis module 128 is not limited to static semantic analysis. It also includes the ability to match user input to keywords 126K stored in the database 126 . Keyword 126K represents a specific action or response that this system can trigger. By matching the user's input to these keywords 126K, the semantic analysis module 128 can determine the appropriate response to the user's command. For example, if a user enters a command to make the avatar smile, the module can match the word "smile" with keywords 126K in the database 126 that trigger the avatar's smiling action. Likewise, if a user enters a command to have the avatar speak a specific word, the module can match the word with the keyword 126K in the database 126 that triggers the avatar's voice action. Therefore, the semantic analysis module 128 plays a key role in the operation of the interactive virtual portrait system 10 . This module facilitates user interaction and enhances the interactivity of the system by processing the converted text using advanced AI technology to understand user input, and by matching user input with keywords 126K stored in the database 126 . After the semantic analysis module 128 performs semantic analysis on the user input, the interactive virtual portrait system 10 uses a response mechanism 128R. The mechanism is to provide specific content 132C for user input in the form of text-to-speech (TTS) output and video content streamed from the database 126 . The mechanism of response 128R is to serve as a bridge between the system's understanding of user input and rendering appropriate responses, allowing the system to interact with users in a dynamic and engaging way. The response 128R mechanism operates by analyzing the results of the semantic analysis module 128, which represent the user's intent and the context of their command. The mechanism processes these results to determine an appropriate response based on the user's input. The response 128R may be a specific action or voice of the virtual character, which is generated by the system and presented to the user. The mechanism for responding to 128R utilizes advanced AI technology to generate these responses. The mechanism uses text-to-speech (TTS) technology to convert text input into spoken language, creating the illusion that the virtual character is speaking the user's input. Text-to-speech (TTS) technology works by analyzing text input, breaking it down into phonemes, and then generating corresponding speech sounds to create a spoken representation of the text. The capabilities of the mechanisms responding to 128R are not limited to static text-to-speech (TTS) output and video streaming. It also includes the ability to instantaneously render the avatar's responses on the user interface 132 . This dynamic presentation contributes to the interactivity of the virtual human system, allowing virtual characters to interact with users in a more realistic and believable way. For example, if a user enters a command to have the virtual character speak a specific word, the mechanism can generate corresponding text-to-speech (TTS) output and present it in real time on the user interface 132, creating the illusion that the virtual character is speaking the word. Therefore, the mechanism responding to 128R plays a key role in the operation of the interactive avatar system 10 . This mechanism facilitates user interaction and enhances the system by leveraging advanced AI technology to provide text-to-speech (TTS) output and video content streamed from the database 126 and by instantly rendering the virtual character's responses on the user interface 132 Interactivity. The interactive avatar system 10 can be used in a variety of applications. One such application might be in the field of virtual customer service. In this case, this system can be used to create virtual customer service representatives who can interact with customers instantly. These virtual representatives can answer customer inquiries, provide information about products or services, and guide customers through various processes. The system's ability to instantly understand and respond to user input, combined with the realism and dynamics of virtual characters, can enhance the customer service experience, making it more engaging and efficient. This system can also be used in the educational field. It can be used to create virtual tutors or teachers who can interact with students instantly. These virtual tutors can provide explanations, answer questions, and guide students through study material. This system can instantly understand and respond to the information input by the user, and instantly present the virtual character's response 128R on the user interface 132, which can make the learning experience more interactive and attractive. To sum up, this new type of interactive virtual portrait system 10 can produce a more realistic, responsive and attractive image through advanced AI facial modeling, real-time facial synthesis deep fake technology, and sophisticated lip synchronization algorithm. A powerful virtual character. Although the present invention has been disclosed above in terms of preferred embodiments, they are not intended to limit the present invention. Anyone with ordinary knowledge in the technical field may make slight changes and modifications without departing from the spirit and scope of the present invention. , therefore, the scope of protection of the present invention shall be subject to the scope of the patent application attached.

10:互動虛擬人像系統 12:伺服器 121:人工智能臉部建模模組 121M:訓練臉部模型 122:深偽模組 122F:虛擬角色的臉部特徵 123:動畫模組 123E:虛擬角色的指定動作和表情 124:基於AI的唇形同步模組 125:視頻錄製模組 125V:錄製視頻 126:資料庫 126K:關鍵字 127:語音轉文字(STT)模組 127W:語音文字 128:語義分析模組 128R:回應 129:文字轉語音(TTS)模組 129V:互動語音 13:拍攝裝置 131:臉部辨識模組 131P:正面臉部照片 132:用戶介面 132C:特定內容 10:Interactive virtual portrait system 12:Server 121:Artificial intelligence facial modeling module 121M: Training face model 122:Deep fake module 122F: Facial features of virtual characters 123:Animation module 123E: Specified actions and expressions of virtual characters 124: AI-based lip sync module 125:Video recording module 125V: record video 126:Database 126K:Keywords 127: Speech-to-text (STT) module 127W: Voice text 128:Semantic analysis module 128R:Response 129: Text-to-speech (TTS) module 129V: Interactive voice 13: Shooting device 131: Facial recognition module 131P: Frontal face photo 132: User interface 132C:Specific content

圖1所繪示為本實施例之互動虛擬人像系統10的示意圖。 圖2所繪示為拍攝裝置13通訊連接伺服器12的立體示意圖。 FIG. 1 shows a schematic diagram of the interactive virtual portrait system 10 of this embodiment. FIG. 2 shows a three-dimensional schematic diagram of the communication connection between the photographing device 13 and the server 12 .

10:互動虛擬人像系統 10:Interactive virtual portrait system

12:伺服器 12:Server

121:人工智能臉部建模模組 121:Artificial intelligence facial modeling module

121M:訓練臉部模型 121M: Training face model

122:深偽模組 122:Deep fake module

122F:虛擬角色的臉部特徵 122F: Facial features of virtual characters

123:動畫模組 123:Animation module

123E:虛擬角色的指定動作和表情 123E: Specified actions and expressions of virtual characters

124:基於AI的唇形同步模組 124: AI-based lip sync module

125:視頻錄製模組 125:Video recording module

125V:錄製視頻 125V: record video

126:資料庫 126:Database

126K:關鍵字 126K:Keywords

127:語音轉文字(STT)模組 127: Speech to text (STT) module

127W:語音文字 127W: Voice text

128:語義分析模組 128:Semantic analysis module

128R:回應 128R:Response

129:文字轉語音(TTS)模組 129: Text-to-speech (TTS) module

129V:互動語音 129V: Interactive voice

13:拍攝裝置 13: Shooting device

131:臉部辨識模組 131: Facial recognition module

131P:正面臉部照片 131P: Frontal face photo

132:用戶介面 132: User interface

132C:特定內容 132C:Specific content

Claims (9)

一種互動虛擬人像系統,包括: 一臉部辨識模組,設置於一拍攝裝置,該臉部辨識模組經由該拍攝裝置獲取並處理一張或多張清晰的正面臉部照片; 一人工智能臉部建模模組,設置於至少一伺服器上,該人工智能臉部建模模組設計根據所述正面臉部照片形成一訓練臉部模型; 一深偽模組,設置於所述伺服器上且連接於該人工智能臉部建模模組,該深偽模組根據所述訓練臉部模型合成一虛擬角色的臉部特徵;及 一動畫模組,設置於所述伺服器上且連接於該深偽模組,該動畫模組用於生成該虛擬角色的指定動作和表情; 其中,所述拍攝裝置通訊連接所述伺服器。 An interactive virtual portrait system, including: A facial recognition module is provided in a photographing device, and the facial recognition module obtains and processes one or more clear frontal facial photos through the photographing device; An artificial intelligence facial modeling module, installed on at least one server, the artificial intelligence facial modeling module is designed to form a training facial model based on the frontal facial photos; A deep fake module, installed on the server and connected to the artificial intelligence facial modeling module, the deep fake module synthesizes the facial features of a virtual character based on the training facial model; and An animation module, installed on the server and connected to the deep pseudo module, the animation module is used to generate specified actions and expressions of the virtual character; Wherein, the photographing device is communicatively connected to the server. 如請求項1所述的互動虛擬人像系統,還包括: 一基於AI的唇形同步模組,設置於所述伺服器上且連接於該動畫模組,該基於AI的唇形同步模組用於將該虛擬角色的唇部動作與口語話語同步; 一用戶介面,設置於所述拍攝裝置上,該用戶介面允許用戶輸入一特定內容,以供該虛擬角色說話或表演。 The interactive virtual portrait system as described in request item 1 also includes: An AI-based lip synchronization module, installed on the server and connected to the animation module, the AI-based lip synchronization module is used to synchronize the lip movements of the virtual character with spoken words; A user interface is provided on the shooting device. The user interface allows the user to input a specific content for the virtual character to speak or perform. 如請求項2所述的互動虛擬人像系統,其中該特定內容為來自以下群組:文本、語音命令及觸控命令。The interactive virtual portrait system as described in claim 2, wherein the specific content is from the following groups: text, voice commands and touch commands. 如請求項3所述的互動虛擬人像系統,還包括: 一視頻錄製模組,設置於所述伺服器上,該視頻錄製模組用於捕捉和錄製該虛擬角色的唇部動作與口語話語,以形成一錄製視頻;及 一資料庫,設置於所述伺服器上且連接該視頻錄製模組,該資料庫用於儲存該錄製視頻及該錄製視頻所對應的一關鍵字。 The interactive virtual portrait system as described in request item 3 also includes: A video recording module, installed on the server, the video recording module is used to capture and record the lip movements and spoken words of the virtual character to form a recorded video; and A database is provided on the server and connected to the video recording module. The database is used to store the recorded video and a keyword corresponding to the recorded video. 如請求項4所述的互動虛擬人像系統,還包括: 一語音轉文字(STT)模組,設置於所述伺服器上,該語音轉文字(STT)模組用以將該語音命令轉換為一語音文字;及 一語義分析模組,連接於該語音轉文字(STT)模組,該語義分析模組處理轉換後的該語音文字,以理解用戶輸入的該特定內容。 The interactive virtual portrait system as described in request item 4 also includes: A speech-to-text (STT) module, installed on the server, the speech-to-text (STT) module is used to convert the voice command into a voice text; and A semantic analysis module is connected to the speech-to-text (STT) module. The semantic analysis module processes the converted speech text to understand the specific content input by the user. 如請求項5所述的互動虛擬人像系統,其中該語義分析模組被配置為將用戶輸入的該特定內容與該資料庫中儲存的該關鍵字匹配,並觸發相應的一回應。The interactive virtual portrait system of claim 5, wherein the semantic analysis module is configured to match the specific content input by the user with the keyword stored in the database, and trigger a corresponding response. 如請求項6所述的互動虛擬人像系統,其中該回應為該資料庫的該錄製視頻。The interactive virtual portrait system as described in claim 6, wherein the response is the recorded video of the database. 如請求項6所述的互動虛擬人像系統,還包括一文字轉語音(TTS)模組,該文字轉語音(TTS)模組連接該資料庫及該語義分析模組,該文字轉語音(TTS)模組用以將該關鍵字轉換為一互動語音,其中該回應為該互動語音。The interactive virtual portrait system as described in claim 6 further includes a text-to-speech (TTS) module connected to the database and the semantic analysis module, and the text-to-speech (TTS) module is connected to the database and the semantic analysis module. The module is used to convert the keyword into an interactive voice, wherein the response is the interactive voice. 如請求項7或請求項8所述的互動虛擬人像系統,其中,該用戶介面還用於顯示該虛擬角色並即時呈現該回應。The interactive virtual portrait system as claimed in claim 7 or claim 8, wherein the user interface is also used to display the virtual character and present the response in real time.
TW112213295U 2023-12-05 2023-12-05 Interactive virtual portrait system TWM652806U (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
TW112213295U TWM652806U (en) 2023-12-05 2023-12-05 Interactive virtual portrait system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
TW112213295U TWM652806U (en) 2023-12-05 2023-12-05 Interactive virtual portrait system

Publications (1)

Publication Number Publication Date
TWM652806U true TWM652806U (en) 2024-03-11

Family

ID=91269007

Family Applications (1)

Application Number Title Priority Date Filing Date
TW112213295U TWM652806U (en) 2023-12-05 2023-12-05 Interactive virtual portrait system

Country Status (1)

Country Link
TW (1) TWM652806U (en)

Similar Documents

Publication Publication Date Title
CN110688911B (en) Video processing method, device, system, terminal equipment and storage medium
WO2022048403A1 (en) Virtual role-based multimodal interaction method, apparatus and system, storage medium, and terminal
JP6019108B2 (en) Video generation based on text
US11514634B2 (en) Personalized speech-to-video with three-dimensional (3D) skeleton regularization and expressive body poses
Mattheyses et al. Audiovisual speech synthesis: An overview of the state-of-the-art
US8224652B2 (en) Speech and text driven HMM-based body animation synthesis
Cosatto et al. Lifelike talking faces for interactive services
US20120130717A1 (en) Real-time Animation for an Expressive Avatar
JP2014519082A5 (en)
WO2022106654A2 (en) Methods and systems for video translation
US20030163315A1 (en) Method and system for generating caricaturized talking heads
CN114357135A (en) Interaction method, interaction device, electronic equipment and storage medium
KR102174922B1 (en) Interactive sign language-voice translation apparatus and voice-sign language translation apparatus reflecting user emotion and intention
US20230082830A1 (en) Method and apparatus for driving digital human, and electronic device
CN116311456A (en) Personalized virtual human expression generating method based on multi-mode interaction information
WO2021232877A1 (en) Method and apparatus for driving virtual human in real time, and electronic device, and medium
TWM652806U (en) Interactive virtual portrait system
Kolivand et al. Realistic lip syncing for virtual character using common viseme set
Wolfe et al. Exploring localization for mouthings in sign language avatars
Verma et al. Animating expressive faces across languages
Lin et al. A speech driven talking head system based on a single face image
Luerssen et al. Head x: Customizable audiovisual synthesis for a multi-purpose virtual head
CN110166844A (en) A kind of data processing method and device, a kind of device for data processing
Fernandes et al. A Comprehensive Review and Taxonomy of Audio-Visual Synchronization Techniques for Realistic Speech Animation
Mažonavičiūtė et al. English talking head adaptation for Lithuanian speech animation