TW201120834A

TW201120834A - Audio-visual synthesis interaction system, its method, and its computer program product.

Info

Publication number: TW201120834A
Application number: TW98142564A
Authority: TW
Inventors: Ming-Chang Wu
Original assignee: Ming-Chang Wu
Priority date: 2009-12-11
Filing date: 2009-12-11
Publication date: 2011-06-16

Abstract

The present invention discloses an audio-visual synthesis interaction system, its method, and its computer program product, which is suitable for use in the audio-visual synthesis interaction system of electronic paper, story machine, and network, and whose synthesis method includes: at first, receiving and transforming the printed content such as the book or magazine through the receiving end into the figure content recognizable by the disclosed system; subsequently, loading from the database the audio-visual data relative to the above-mentioned figure content, in which the audio-visual data can comprise the audio data, the visual data, and the data for controlling the physical character; finally, combining the visual data and the audio data loaded from the database into the multimedia audio-visual module, and then outputting to the output end to carry out the action of multimedia broadcasting or controlling of the physical teaching material.

Description

201120834 六、發明說明：【發明所屬之技術領域】右發月疋於一種影音合成互動之技術’特別是 =於m合情境别音效、角色別音效、調整音效與時間別視訊、場景別視訊、角色別視訊或—特殊效= 視訊等各種音效及视訊#料的影音合成互動系統、其方法及其電腦程式產品。【先前技術】語音合成是指以人工的方式產生人類語音，語音合成可以用軟體或硬體來加以實現。將文字轉為語音之相關技術，是新一代人機介面中重要的工具之一’可以廣泛應用於視覺障礙者不想麻煩他人，閱讀非點字書籍、車站或機場錄製廣播訊息、公司製作職員研修用資料、或者是錄製電子字典發音資料等不同情境，而其中最常使用的技術應是文字轉語音（text-to-speech，TTS)技術。對人類而言，語言是用於交談、分享和討論非常重要的溝通工具，研究指出（Jensen，1998)就算是在強褓中的嬰兒也可以傾聽語言’即使他們仍無法開口說話，所有這些語言無論有無理解，皆有助於日後學習語法、字彙和意義的發展。習知技藝係有如中華民國新型專利第590288號所揭露的一種「電子有聲書之結構」’此結構係改變傳統書本提供靜態閱讀方式’使其結合硬體與電子電路而具備點選目 201120834 標物產生影音播放之功能，其結構係包括一殼體、一内置於殼體適位之電路模組及天線以及一與電路模組相連結之指標器所構成，藉由指標器點選書本内頁之目標物，該目標物則對應殼體内天線之座標值，進而獲致電路模組擷取記憶體對應座標值之資料，藉由天線產生一射頻訊號，經指標器擷取該訊號後，將訊號回傳送至殼體内部之電路模組，而產生遙控播放影像或聲音播放之功能，俾於使用者點選書本内頁之内容即可同時收聽該内容之影音聲效者。但是，習知的文字轉語音技術大多僅將文字資料轉換為語音效號，其通常以非真人之機械音輸出，即使再根據語意分析以調整語音效號之語調，聽來仍較為呆板且不自然。為了解決輸出呆板且不自然之問題，進而發展出一種真人發音的技術，此技術大多是將存在資料庫内的許多已錄好的真人語音連接起來，其輸出之效果與一般人所發出之語音較為接近，因此較為一般人所能接受。然而，即便曰常生活中，已有許多具有真人發音產品可供使用，但人類於本質上乃為群體動物，與他人相互交談、溝通以及共同進行各種學習活動實為人性本質國内外許多研究指出，親子共讀可增加孩子語言、學習認知能力。透過為孩子講述繪本故事，可以增進親子關係、語言以及學習認知能力。靜態的閱讀、圖形辨認等方式，讓孩子比較專注，比流行的動晝、影片，更能培養孩子認知及注意力發展，透過朗讀的聲音，能適 201120834 當產生共鳴，可良性刺激身心、大腦思緒發展。相關研究亦指出，孩子要成為一個能學且學得好的人，全看能不能讀得好。學習閱讀要花很多時間和精力，每天花一點時間共讀，對孩子幫助很大。然而，很多家長在孩子還是嬰兒時，抱著孩子要親子共讀，嬰兒的反應常常是推開書本、啃書，甚至是不耐大哭，父母帶著共讀的熱情也大減。兒童的注意力不足，沒有辦法將注意力維持在同一件事物上太久，很容易分心、好動，父母單調、沒有變化念書，無法讓孩子把心思放在書本上。而故事中通常具有某些具有聲音特性之字串。習知的技術中，無論採用文字轉語音技術或是真人發音的技術，並沒有將此種聲音特性考慮進來，使得轉換後之語音並不生動，無法有效的引起孩子的注意0 事實上，閱讀書籍的過程中，其實就是一種訊息的傳遞。訊息的傳遞通常可以靠語言及非語言的兩種方式進行，一般而言，我們或比較注意言語的訊息，因為他們直接、易於表達、思考及分析。此外，家長若能善加運用肢體語言，就能讓兒童專心，例如面部表情、手勢、眼神接觸、身體接近與姿勢等非語言行為也不容忽視。因此，家長可透過相關語音轉換技術，將一段故事轉為語音，以降低父母同時需進行語言及非語言所造成的負擔，替父母分擔語言傳遞的部份。因此，有鑒於上述問題，有必要提出一種影音合成 201120834 互動系統、其方法及其電腦程式產品，可自書籍、雜誌或是網路中讀取印刷内容或文件電子檔之文字内容，取出相對應之音效及視訊，轉換成身歷其境之多媒體影音模組，以增加單純閱讀書籍之生動性。【發明内容】有鑑於上述先前技術之問題，本發明之其中一目的就是在提供一種影音合成互動系統、其方法及其電腦程式 φ 產品，透過音效與視訊的合成，所產生之數位内容並可適時搭配實體教材，而得以更生動的方式表現合成後之多媒體影音模組及實體角色的構成。根據本發明之目的，提出一種影音合成互動方法，此合成方法包含，首先，將書籍、雜誌或網頁等之印刷内容或電子檔文件，經由接收端接收轉換為可辨識之文字内容。接下來，自資料庫中載入相對於上述文字内容之多影音資料，此影音資料可包含聲音資料及視訊資 • 料。最後，結合自資料庫載入之視訊資料及聲音資料為多媒體影音模組後輸出至輸出端，進行即時播放或儲存等動作。其中，自接收端接收文字内容時，更可以自接收端接收一互動控制訊號，以依據此互動控制訊號，生成一實體角色控制資料輸出至輸出端，藉以控制至少一實體角色。其中，影音資料中之聲音資料是依據現階段所接收 201120834 之文子内容t ’摘取出關鍵字參數數等參數，來合成與此q 角色參數或調整參色別音效及調整音效其情境別音效、角模組可㈣已合成其且錢體影音慢、語調高低或抑揚頓挫、曰:i大調2出時之速度快調整音效之調敕夾私t曰里大小等音效效果。合成號、嘆詞或助°以是文字内容中出現的標點符其中201120834 VI. Description of the invention: [Technical field of invention] The right-wing moon is used in a technology of interactive audio-visual interaction. In particular, it is a sound effect, a sound effect, a sound effect, a time-sensitive video, and a scene video. The role of video or - special effects = video and other audio and video and audio and video synthesis system, its methods and computer program products. [Prior Art] Speech synthesis refers to the artificial generation of human speech, and speech synthesis can be realized by software or hardware. The technology of converting text into speech is one of the important tools in the new generation of human-machine interface. It can be widely used in visually impaired people who do not want to trouble others, read non-word-making books, record broadcast information at stations or airports, and train staff in company training. Different situations, such as data or audio dictionary pronunciation data, and the most commonly used technology should be text-to-speech (TTS) technology. For humans, language is a very important communication tool for talking, sharing, and discussing. Research indicates (Jensen, 1998) that even infants who are in strong can listen to the language even if they still can't speak, all these languages Whether it is understood or not, it will help to develop grammar, vocabulary and meaning in the future. The Department of Applied Arts has a "structure of electronic audio books" as disclosed in the Republic of China New Patent No. 590288. This structure changes the traditional book to provide a static reading method, which combines hardware and electronic circuits with a point selection 201120834. The standard produces a function of video and audio playback, and the structure comprises a casing, a circuit module and an antenna embedded in the casing, and an index device connected with the circuit module, and the book is selected by the indexer. The object of the inner page, the object corresponds to the coordinate value of the antenna in the housing, and the circuit module obtains the data of the corresponding coordinate value of the memory, and generates an RF signal through the antenna, and the signal is captured by the indicator. The signal is sent back to the circuit module inside the casing, and the function of remotely playing the image or sound playing is generated, so that the user can select the content of the inner page of the book to listen to the audio and sound effect of the content at the same time. However, most of the conventional text-to-speech technology only converts text data into voice ticks, which are usually output by non-real human mechanical sounds. Even if the semantics are adjusted according to semantic analysis, it still sounds dull and not natural. In order to solve the problem of dull and unnatural output, and then develop a real human voice technology, this technology mostly connects many recorded live human voices in the database, and the output effect is compared with that of ordinary people. Close, so more acceptable to the average person. However, even in the ordinary life, there are many products that can be used for human beings, but human beings are essentially group animals. Talking with others, communicating with each other and conducting various learning activities together is a human nature. Many studies at home and abroad point out Parent-child reading can increase children's language and learn cognitive ability. By telling children about storybooks, you can improve parent-child relationships, language, and cognitive learning. Static reading, graphic recognition and other methods, so that children are more focused, more popular than the popular animation, film, can cultivate children's cognitive and attention development, through the sound of reading, can adapt to 201120834 when resonating, can be benign to stimulate the body and mind, brain Thought development. Relevant research also pointed out that children must be able to learn well and learn well. It takes a lot of time and energy to learn to read. It takes a little time to read a day and it helps a lot. However, many parents hold their children to read together when they are still babies. The baby's reaction is often to open books, swearing, and even crying. The enthusiasm of parents to read together is greatly reduced. Children's attention is not enough. There is no way to keep their attention on the same thing for too long. It is easy to be distracted and active. Parents are monotonous and have no change in their studies. They cannot let their children put their thoughts on books. The story usually has some strings with sound characteristics. In the conventional technology, no matter whether the text-to-speech technology or the human voice is used, the sound characteristics are not taken into consideration, so that the converted voice is not vivid and cannot effectively attract the attention of the child. In the process of books, it is actually a kind of message transmission. The transmission of messages can usually be done in both linguistic and non-verbal ways. In general, we may pay more attention to the message of speech because they are direct, easy to express, think and analyze. In addition, if parents can use body language, they can concentrate on children's attention. Non-verbal behaviors such as facial expressions, gestures, eye contact, body proximity and posture can not be ignored. Therefore, parents can use a voice-conversion technology to turn a story into a voice, so that parents can reduce the burden of language and non-language, and share the language transfer for parents. Therefore, in view of the above problems, it is necessary to propose an audio-visual synthesis 201120834 interactive system, its method and its computer program product, which can read the text content of the printed content or the electronic file of the file from a book, a magazine or a network, and take out the corresponding content. The sound effects and video are converted into immersive multimedia audio and video modules to increase the vividness of reading books. SUMMARY OF THE INVENTION In view of the above problems of the prior art, one of the objects of the present invention is to provide an audio-visual synthesis interactive system, a method thereof, and a computer program φ product, which are digital content generated through the synthesis of sound effects and video. With the appropriate textbooks in a timely manner, the composition of the synthesized multimedia modules and physical characters can be expressed in a more vivid way. According to the object of the present invention, an interactive method for video and audio synthesis is proposed. The method comprises the following steps: first, converting a printed content or an electronic file of a book, a magazine or a web page into a recognizable text content via receiving at the receiving end. Next, the multi-audio data relative to the above text content is loaded from the database, and the audio-visual material may include sound data and video materials. Finally, the video data and sound data loaded from the database are output to the output end of the multimedia audio and video module for instant playback or storage. When receiving the text content from the receiving end, an interactive control signal may be received from the receiving end to generate an entity role control data output to the output terminal according to the interactive control signal, so as to control at least one entity role. Among them, the sound data in the audio-visual data is based on the parameters of the text of the 201120834 received at the current stage t 'extracting the keyword parameters, etc., to synthesize the q character parameters or adjust the color sound effects and adjust the sound effect and the context sound effect. The angle module can be (4) synthesized and the money is slow, the tone is high or low, or the swaying is stunned, 曰: the speed of the i major 2 is quickly adjusted to adjust the sound effect, and the sound effect is the same. Synthetic number, interjection, or help is the punctuation character that appears in the text content.

收之中之視訊資料則是依據現階段所去教 * ’ 取4時間參數、場景參數、角色4 二=效果ΐ數等參數，來合成與此些參數相對應; “之!別視訊、角色別視訊及特殊效果視窗、且13。且多媒體影音模組可依據已合成之寺殊效果視訊，調整於輸出時之前景遠景配置、淡出、或速度快慢等視訊效果。 ~當接收端欲接收來自書籍或雜諸等來自實體的印刷内容時’更可於自接收端接收文字内容輸人之步驟中，加入將印仙容朗出其内容中包含之關鍵字參數整參數、時間參數、場景參數、肖色參數或特殊效果參· 數等可辨識之文字内容。此外，當所接收之文字内容未包含時間參數、場景參數、角色參數或特殊效果參數等參數時，本發明可使用一預设的視訊資料結合已合成之聲音資料成為多媒體影音模組後輸出至輸出端。此預設之視訊資料可以是靜止之風景圖片、相片、動態之動晝或者是實體的視覺效 8 201120834 果。為使結合聲音資料與視訊資料所產生之多媒體影音模組可充分利用，更可於多媒體影音模組輸出至輸出端後，予以儲存以便於日後觀賞、編輯等。根據本發明之目的，又提出一種影音合成互動電腦程式產品，當此產品的影音合成互動電腦程式被載入電腦並執行後，可執行本發明之影音合成互動的方法。根據本發明之目的，另提出一種影音合成互動系統， • 包含接收端、資料庫及影音合成互動裝置。接收端用以接收文字内容或互動控制訊號或其組合之輸入。資料庫，電性連接於接收端，儲存有相對於文字内容之影音資料及相對於互動控制訊號之實體角色控制資料。上述影音資料可包含複數個視訊資料及複數個聲音資料。影音合成互動裝置，電性連接於資料庫，自資料庫載入影音資料後，結合視訊資料及聲音資料為多媒體影音模組後，將多媒體影音模組輸出至輸出端，並且可依據該實體角色控制資料，控 • 制至少一個的實體角色。其中，為讀取實體之印刷内容，接收端更可包含光學讀取單元及文字内容識別部。光學讀取單元用以接收印刷内容輸入；文字内容識別部，則是將所接收之印刷内容識別成本系統可識別之文字内容，並擷取出關鍵字參數、調整參數、時間參數、場景參數、角色參數或特殊效果參數等參數。其中，資料庫中儲存有聲音資料與視訊資料，聲音 201120834 為料至少包含相對應於關鍵字參數或調整參數之情境別音效、角色別音效或調整音效；視訊資料則包含相掛於時間參數、場景參數、角色參數或特殊效果間別視訊、場景別視訊、角色別視訊或特殊效果視訊。其中’於接收端與資料庫間更包含一個影音編輯器，主要用以接收文字内容後，依據所接收之文字内容，輸出一基礎影音資料至資料庫内暫存。於此影音編輯器内，更具有一控制器，主要用於接收互動控制訊號，以生成實體角色控制資料後儲存至資料庫。 f中，影音合成互動裝置係包含，記憶單元、音效< 合成單元、視訊合成單元、音效產生單元、視訊產生單元及輸出單元。記憶單元，可暫時儲存自資料庫載入的，訊資料、該聲音資料及實體角色控制資料。音效合成單元與視訊合成單元，各自與記憶單元電性連接，用以合成聲音資料與視訊資料。音效產生單元與視訊產生單元77另】與0效合成單元與視訊合成單元電性連接，用 =產生合，完成之聲音資料與視訊資料。輸出單元，與音效產生，7L及視訊產生單元電性連接，用以輸出結合· 元與視訊產生單元所產生之聲音資料及視訊貝料之多媒體影音模組，且亦可輸出實體角色控制資料 /、中，為即時播放多媒體影音模組，輸裝置。此外輸出端亦可包含儲存裝置，用以儲存夕媒體影音模組，以便日後播放、編輯或複製時 201120834 使用。承上所述’依本發明之影音合成互動线、其方法及其電腦程式產品，其可具有一或多個下述優點： ⑴藉由可辨識之文字内容中掘取出之關鍵數、調整參數、時間參數、場景參數 =綱等產生相對應之包含音效及視訊之多媒= 棋組。 (2)透過音效與視訊的結合，可以更生動的方，原士實體的書籍或雜諸無法表達之臨場效果及說服 L °+成後之媒體影音模組可令使用者更為容易融入文境中，且可解決閱讀純文字，易生枯燥感之 W，可活用於幼兒或兒童之學習可藉以提高學習意願。 > (3)透過快速完成將文字内容合成為包含音效 :的：媒體影，組，於商業應用上，適用於劇場之排 Γ制ί站或機場等廣播訊息之錄製、公司職負或學校學 ^作研修報告資料時；亦或是於個人生活面上，可活閱dir說故事時間、忙碌主婦想要邊進行家事邊閲咳網頁新知，可以充分有效利用時間。 (4)透過快速完成將文字内容合成為包含 =多^影音模組，以及接收互動控制訊號以生成實 :色控制資料藉以控制實體角色，於產業利用上，可匕於輔助舞台監督用以模擬排演效果，藉以節省廣所需成本，以及減短作品製作所需時間。 201120834 【實施方式】请參閱第1圖，其係為本發明之影音合成互動方法之流程圖。圖中’影音合成互動方法包含下列步驟：首先，於步驟S110中，自接收端210接收文字内容 f互=控制訊號，此文字内容可包括實質上具有印刷文子之曰籍或雜誌、手機或網頁上的短篇文章、下載至電子書的小說、亦或是儲存於電腦等電子裝置上之電子播文，等。此互動控制訊號可依據實體角色所採用之控^ 方.而可為射頻識別訊號，紅外線控制訊號或藍芽，號等。其中，接收端210之文字内容可以包含文字二數字、標點符料，且文字内容可以是由單-文字或複數個^字所組成，亦可以固定單位大小的—段文章，例如是單位為20KB的text形式的電子檔，但不以此為限。可以依據實施型態不同’於—次接收整篇小說後，亦或 ί以：態非固定單位大小分段方式，來接收整篇小說後，再進行下一步驟。，下來’於步驟S12G中’自資料庫23G中載入相對於文字内容之影音資料。為因應文字内容，此資料庫230 内儲存有複數個聲音資料232、複數個視訊資料242及複數個實體角色控制資料238。具體實施時，亦可於步驟S110至步驟S120増加下列步驟：步驟S112:判斷接收端21〇所接收之文字内容是否具有關鍵字參數、角色參數或調整參數；步驟S114.取出與關鍵字參數、角色參數或調整參 12 201120834 數相應之情境別音效2321、角色別音效2323或調整音效2325等；步驟S116:進一步判斷接收端210所接收之文字内容是否具有時間參數、場景參數、角色參數或特殊效果參數；步驟S118:取出與時間參數、場景參數、角色參數或特殊效果參數相應之時間別視訊2421、場景別視訊 2423、角色別視訊2425及特殊效果視訊等。 • 在上述實施例中，雖然是先執行判斷是否具有與音效相關參數後，再執行判斷是否具有與視訊相關參數，但不以此為限，可以先執行判斷是否具有視訊相關參數後再執行關於音效相關參數的判斷，亦可以同時間判斷視訊與音效相關參數。於執行完步驟S120後，再於步驟S130中，結合自資料庫230載入之視訊資料242及聲音資料232為多媒體影音模組。為結合自資料庫230載入之複數個聲音資 • 料232及複數個視訊資料242。具體實施時，亦可於步驟S120至步驟S130增加下列步驟：步驟S122 :儲存自資料庫230取出之各種音效，如情境別音效2321、角色別音效2323或調整音效2325 等，以及各種視訊，如時間別視訊2421、場景別視訊 2423、角色別視訊2425及特殊效果視訊2427等；步驟S124 :對所儲存之各種音效，如情境別音效 2321、角色別音效2323或調整音效2325等進行合成； 13 201120834 步驟S126 :對所儲存之各種視訊，如時間別視訊 2421、場景別視訊2423、角色別視訊2425及特殊效果視訊2427等進行合成；步驟S128:將進行過合成之各種音效與各種視訊結合後輸出。於執行完步驟S130後，再於步驟S140中，將多媒體影音模組及實體角色控制資料238輸出至輸出端270。具體實施時，更可於步驟S140後，增加將所產生之多媒體影音模組儲存至輸出端270,上述所例示之影音合成互動方法，雖以接收文字内容的方式來實現，但也可以經由適當的置換接收端210來接收使用者自己本身事先錄製的聲音檔或聲音影像檔的方式來實現，此實施方式可以修飾事先錄製的聲音檔或聲音影像檔中，不完美的聲音大小、品質及效果等缺點，於商業應用面，可於小成本的動晝或宣導短片的後製時，有效地使用本影音合成互動方法，藉以達到節省製作時間、成本及提高所製作之多媒體影音品質，由於本影音合成互動方法更可配合多媒體影音品質與實體角色互動，可模擬舞台效果藉以減少舞台監督所需之排演時間，可使舞台監督之作業更有效率。於上述實施例中，雖然實體角色的控制可由輸出的實體角色控制資料238來加以控制，但是在其他實施方式中，使用者亦可依自己需求，而使用他種控制方法以控制實體角色，例如，當實體角色係為貼付有無線射頻標籤的實體教材時，則使用者可使用無線射頻控制器來另外控制實 201120834 體角色。此影音合成互動方法，更可具體實施為影音合成互動電腦程式產品，當電腦載入該電腦程式並執行後，可於網路系統、單獨的電腦糸统或其他適當的電子裝置上，執行上述影音合成互動方法所述之各步驟。舉例來說，當於網路系統上執行時，可以藉由下載網路社群彼此分享的電子文件檔，經執行下載成可於個人專屬的播放裝置上進行聆聽播放的格式。此外，當於單獨電腦系統上執行時，則可參將手邊的書籍，加以製作成方便攜帶的電腦可讀取媒體的形式，具體來說可以製作成光碟或是以MP3格式儲存在記憶體上，方便隨時想聆聽觀賞時使用。 β 其中，適用於本發明之影音合成互動方法的將文字資料轉換為聲音資料232的技術，可透過所謂的文字轉語土 (text-t〇-speech，TTS)技術、真人發音技術或其他可將二字轉為語音之技術，進行轉換◊且可根據實際上不同的需求於聲θ >料232中擷取出各式各樣不同音效，以角色 •別音效2323為例，即可包含有男聲、女聲、老人聲、穿土聲和卡通人物聲等。里請參閱第2圖，其係為本發明之影音合成互動系方塊示意圖。圖中，影音合成互動系統2〇〇包含，接 210、影音編輯器22〇、資料庫23〇、影音 = 及輸出端270。 ^250 其中，接收端210主要用來接收文字内容與互動制訊號的輸人’於本實施例中，是以光學讀取單元^ 201120834 與文字内容識別部214來實現接收端210，經由光學讀取單元212將印刷文字專換為電性訊號後，再經由文字内容識別部214達到可將書本202上之印刷内容辨識成為包含關鍵字參數、調整參數、時間參數、場景參數、角色參數或特殊效果參數其中之一或其組合之文字内容的目的，但不以此為限，例如，光學讀取單元212可為無線射頻辨識標籤讀取器，用以讀取電子書上的内容，此時，文字内容識別部214更可包含具有解碼功能的無線射頻接收器，以便將儲存於無線射頻辨識標籤中之資料解調成為可令資料庫230理解之具有各種參數的文字内容。因所使用的光學讀取單元212 —次所能讀取的文字内容大小相異，所以接收端210所接收之文字内容，可以依照固定單位大小另外也可以是一次依照使用者期望大小的文字内容。其中，影音編輯器220，設置於接收端210與資料庫 230間，適用以接收文字内容，據以輸出一基礎影音資料至資料庫230。於此影音編輯器220中，更具有控制器 22卜此控制器221可接收互動控制訊號，以生成實體角色控制資料，用以控制實體角色280，此實體角色280 可以是可利用射頻、紅外線或藍芽控制之布偶、塑膠公仔或是紙偶等。 ^ 其中，資料庫230中儲存有複數個聲音資料232與複數個視訊資料242。聲音資料232中包含有情境別音效2321、角色別音 201120834 效2323以及調整音效2325等，但不以此為限，如同電影之所以可以令人感動’乃是因為它所提供的除二度空間的影像外，更以聲音，包含人聲及配樂來製造出三度空間的立體感’所以為使聲音資料232具有更強的空間表現力’可包含情境別音效2321、角色別音效2323、調整曰效2325以及使用者設定之音樂作為配樂以使合成之多媒體影音模組更具影響力。 ^ 舉例來說，當接收端210所輸入之文字内容擷取出「下雨」、「雷聲」、「敲門」、「滴滴答答」和「腳步聲」等關鍵字參數時，可以自資料庫23〇中查詢出與「下雨」苗聲」、敲門」、「滴滴答答」、和「腳步聲」等情境相對應之情境別音效2321。 ¥接收端210所輸入之文字内容擷取出「貓狗」、「蜜蜂」和「老爺爺」等角色參庙士士从•丨.人 r . · - J Θ 貝書苗」、「狗」、「蜜蜂」和「老爺爺料庫230中查詢出與等出現角色相稱的角色別音效2323 調整音效2325主要用以調整达立慢、扭纲古你 „ 门金°口《效旒之速度供凋同低、抑揚頓挫或音量大小等，告接此她所輸入之文字内容擷取屮令宝田接收知2Η 如：「丨」二t Λ 中出現之標點符號例 :呀」、「喔」、「喊」、「嗔」、「嘴」和、;Ί例厂啊」、常用於句尾助詞，例如：「啊」、「呀、「、」等，以及通「喲」、「了」和「嗎」等調整參數時，二哪」、「啦」、音效加上變換速度快慢、注調古乂會將現正播放之慢°。調间低、抑揚頓挫或音量大 17 201120834 小等調整，具體舉例來說當為字内容出現，「啊^下了！」’由於-段短句中’接連出現兩個表現⑼的標點符號，-隸於句首的嘆詞，因此可以將此段文字以較大音量、較高語調及較快速度的方式來呈現。乂另，視訊資料242中包含有時間別視訊2421、場景別視訊2423、角色別視訊2425以及特殊效果視訊Μ” 等，但不以此為限。時間別視訊2421，主要用以呈現不同時間之視訊，，如，當1字内容出現，「早上」、「黃昏」、「凌晨3點」、「二更半夜」、「春」、「夏」、「秋」、「冬」等時間參數時，則可將自資料庫230擷取出相對於此些時間參數之當的燈時間視訊，具體來說當文字出現「春」時，則可使用較明亮耀眼效果的時間視訊，使多媒體影音模組呈現出春天的氣息，相反地當出現「三更半夜」時，則使用較灰暗不明效果的時間視訊，表現出夜色昏暗的氣氛。場景別視訊2423，主要用以表現不同場景之視訊，例如，當自内容擷取出「皇宮」、「外太空」、「森林」、「雪國」、「沙漠」等場景參數時’則可將接下來出現的視訊加上適合此些時間參數之燈光效果，具體來說當文字出現「皇宮」時，則可使用金碧輝煌的宮殿的場景別視訊 2423。角色別視訊2425可與角色別音效2323相配合，當文字内容擷取出「貓」、「狗」、「蜜蜂」和「老爺爺」等角色參數時，則可自資料庫230中查詢出與「貓」、「狗」、 201120834 「蜜蜂」和「老爺爺」等出現角色相稱的角色別視訊 2425。特殊效果視訊2427可以依使用者因需求於文字内容中設定的特殊效果參數，對視訊資料242進行不同處理，包含前景遠景配置、淡入淡出或速度快慢之調整，例如，當文字内容為「很久很久以前，在一座森林中的深處有一座小木屋，……」，可依據使用者所設定之特殊效果參數於森林的場景別視訊2423中，將小木屋的場景別視訊2423處理成遠景。當以卡通角色刷牙劇情來製作保健宣傳短片時，則可於進行刷牙動作時，將卡通角色除嘴巴之外其他五官淡出，藉以突顯嘴巴中原本骯髒的牙齒，經由刷牙的動作，變成乾淨的牙齒的視訊。此外，當欲表現火車快飛的情境時，則可藉由將周圍場景別視訊2423以較快速度播放來表現火車快速移動的感覺。影音合成互動裝置250與資料庫230電性連接，主要可於載入影音資料（視訊資料242及聲音資料232)後，將視訊資料242及聲音資料232結合成為多媒體影音模組後，將其輸出至輸出端270。於本實施例中影音合成互動裝置250可以包含記憶單元252、音效合成單元254、音效產生單元256、視訊合成單元264、視訊產生單元266及輸出單元258等以達成上述機能，但不以此為限。其中，記憶單元252使用以暫時儲存自資料庫230載入的至少一個視訊資料 242、至少一個聲音資料232及至少一個實體角色控制資 201120834 料238使用。接下來當記憶單元252載入至少一個聲音資料232後，即可送至與其電性連接的音效合成單元254 進行合成此些聲音資料232所具有的情境別音效2321、角色別音效2323或調整音效2325其中之一或其組合的合成步驟。另一方面，當記憶單元252載入至少一個視訊資料242後，即可送至與其電性連接的視訊合成單元 264進行此些視訊資料242所具有的時間別視訊2421、場景別視訊2423、角色別視訊2425或特殊效果視訊2427 其中之一或其組合的合成步驟，藉由上述步驟分別將聲音資料232與視訊資料242進行合成，達到使用者所期籲望的效果。於此實施例中，雖然是以分別合成聲音資料 232與視訊資料242的音效合成單元254與視訊合成單元264來實現本發明，但是亦可使用可同時實現音效合成與視訊合成的影音合成互動單元來實現本發明。接下來，於本實施例中，經過合成步驟的聲音資料 232會送至音效產生單元256,以便將經過合成步驟的聲音資料232由數位形式轉變為類比資料的形式，等待輸出。同樣的，經過合成步驟的視訊資料242則會送至視 · 訊產生單元266，以便將經過合成步驟的視訊資料242 由數位形式轉變為類比資料的形式，準備輸出。最後，轉換為類比資料形式的聲音資料232與視訊資料242送至輸出單元258，輸出單元258會依照上述聲音資料232與視訊資料242上所載有的時脈訊號，來決定結合具有同時脈訊號的聲音資料232與視訊資料 242成多媒體影音模組後輸出，且亦可於輸出多媒體影 20 201120834 音模組時’同時搭配自記憶單元252取得至輸出端27〇。再其他實施形態中，; 料V成 z 相_二：後再輸出至輸出端27〇, 相ο 4 ’ $僅具有視訊資料242時設的聲音資料232當作背景音樂似的合成為=2 出端270,如此一來，影音合成互動裝置250的目的就此達成。The video data received is based on the parameters of the current stage, such as 4 time parameters, scene parameters, role 4 2 = effect parameters, etc. to synthesize and correspond to these parameters; "It! Do not video, role Do not video and special effects window, and 13. And the multimedia audio and video module can be adjusted according to the synthesized temple special effect video, and the video effect such as the front view configuration, fade out, or speed can be adjusted when the output is received. When books or miscellaneous printed content from the entity is used, the step of receiving the text content from the receiving end can be added to the key parameters, time parameters, and scene parameters included in the content. The identifiable text content, such as a black color parameter or a special effect parameter, etc. In addition, when the received text content does not include parameters such as a time parameter, a scene parameter, a character parameter, or a special effect parameter, the present invention can use a preset. The video data is combined with the synthesized sound data to be output to the output end of the multimedia audio and video module. The preset video data can be a still landscape. , photos, dynamics, or physical visual effects 8 201120834. In order to make full use of the multimedia audio and video modules generated by combining sound data and video data, it can be output to the output end of the multimedia audio and video module. Storing for later viewing, editing, etc. According to the purpose of the present invention, an interactive audio-visual computer program product is provided, and when the audio-visual composite interactive computer program of the product is loaded into a computer and executed, the audio-visual synthesis interaction of the present invention can be performed. According to the purpose of the present invention, an audio-visual synthesis interactive system is further provided, which comprises: a receiving end, a data base and an audio-video synthesis interactive device. The receiving end is used for receiving input of text content or interactive control signals or a combination thereof. Electrically connected to the receiving end, storing audio-visual data relative to the text content and entity character control data relative to the interactive control signal. The audio-visual material may include a plurality of video data and a plurality of sound data. Connected to the database, after loading video and audio data from the database After combining the video data and the sound data into the multimedia audio and video module, the multimedia audio and video module is output to the output end, and the at least one physical character can be controlled according to the entity role control data. Among them, the printing of the reading entity The content receiving end may further include an optical reading unit and a text content identifying unit. The optical reading unit is configured to receive the printing content input, and the text content identifying unit is configured to identify the received printing content into a text content recognizable by the system. And take out parameters such as keyword parameters, adjustment parameters, time parameters, scene parameters, role parameters or special effect parameters. Among them, the sound data and video data are stored in the database, and the sound 201120834 is at least corresponding to the keyword parameters. Or adjust the parameter's situational sound effect, character sound effect or adjust the sound effect; the video data includes the time parameter, the scene parameter, the character parameter or the special effect video, the scene video, the character video or the special effect video. The device includes a video editor between the receiving end and the database, and is mainly used for receiving the text content, and outputting a basic audio and video data to the database for temporary storage according to the received text content. In the video editor, there is a controller, which is mainly used for receiving the interactive control signal, and generating the entity character control data and storing it in the database. In f, the video and audio synthesis interactive device includes a memory unit, a sound effect < synthesis unit, a video synthesis unit, a sound effect generation unit, a video generation unit, and an output unit. The memory unit temporarily stores the information, the sound data and the entity role control data loaded from the database. The sound synthesis unit and the video synthesis unit are each electrically connected to the memory unit for synthesizing sound data and video data. The sound effect generating unit and the video generating unit 77 are electrically connected to the 0-effect combining unit and the video synthesizing unit, and use the = to generate the combined sound data and the video data. The output unit is electrically connected to the sound generation, the 7L and the video generating unit, and is configured to output the multimedia audio and video module of the sound data and the video material generated by the combined video and video generating unit, and can also output the entity character control data/ , in the middle, for the instant playback of multimedia audio and video modules, transmission devices. In addition, the output terminal may also include a storage device for storing the media media module for later use, playing, editing or copying 201120834. The invention relates to the audio-visual composite interactive line, the method thereof and the computer program product thereof, which may have one or more of the following advantages: (1) key numbers and adjustment parameters extracted by recognizable text content , time parameter, scene parameter = class, etc. corresponding to the multi-media = chess group containing sound effects and video. (2) Through the combination of sound effects and video, it can be more vivid, the books of the original entity or the inconspicuous effects of the original and the media audio and video module that persuaded L °+ can make the user easier to integrate into the text. In the environment, it can solve the problem of reading pure text, easy to be dull, and can be used for learning of children or children to improve the willingness to learn. > (3) Synthesize text content into sound effects by quick completion: media shadow, group, for commercial applications, for theater recording, recording of broadcast messages such as airports or airports, company duties or schools When you study the materials for training, or on your personal life, you can read the dir story time, and the busy housewife wants to read the coughing page while doing family affairs, and can make full use of time. (4) By synthesizing the text content into a containing/multi-audio module, and receiving the interactive control signal to generate real: color control data to control the entity role, in the industrial utilization, can be used to assist the stage supervision to simulate The rehearsal effect, in order to save a wide range of costs, and to reduce the time required for the production of the work. [2011] [Embodiment] Please refer to Fig. 1, which is a flow chart of the interactive method for video and audio synthesis of the present invention. In the figure, the audio-visual synthesis interaction method includes the following steps: First, in step S110, the text content f is received from the receiving end 210: the control signal, the text content may include a book or a magazine, a mobile phone or a webpage substantially having a printed text. Short articles, downloads to e-book novels, or electronic broadcasts stored on electronic devices such as computers. The interactive control signal can be based on the control function of the entity role, and can be a radio frequency identification signal, an infrared control signal or a Bluetooth, and the like. The text content of the receiving end 210 may include two digits of characters and punctuation, and the text content may be composed of a single-word or a plurality of characters, or may be a fixed-unit-segment article, for example, the unit is 20 KB. The electronic file in the form of text, but not limited to this. Depending on the implementation type, after receiving the whole novel, or after receiving the whole novel in a non-fixed unit size segmentation mode, the next step is performed. And, in the step S12G, the video material relative to the text content is loaded from the database 23G. In response to the text, the database 230 stores a plurality of voice data 232, a plurality of video materials 242, and a plurality of entity role control materials 238. In the specific implementation, the following steps may be added to the step S110 to the step S120: Step S112: determining whether the text content received by the receiving end 21〇 has a keyword parameter, a role parameter or an adjustment parameter; Step S114. Extracting the keyword parameter, The role parameter or adjustment parameter 12201120834 corresponding situational sound effect 2321, character-specific sound effect 2323 or adjustment sound effect 2325, etc.; Step S116: further determining whether the text content received by the receiving end 210 has a time parameter, a scene parameter, a character parameter or a special The effect parameter; Step S118: Take out the time-series video 2421, the scene-specific video 2423, the character-specific video 2425, and the special effect video corresponding to the time parameter, the scene parameter, the character parameter or the special effect parameter. In the above embodiment, although it is first performed to determine whether there is a parameter related to the sound effect, and then whether the determination has a video-related parameter, but not limited thereto, it may be performed to determine whether the video-related parameter is performed before executing the The judgment of the sound effect related parameters can also determine the parameters related to the video and sound effects at the same time. After the step S120 is performed, in step S130, the video data 242 and the sound data 232 loaded from the data library 230 are combined into a multimedia audio and video module. In order to combine the plurality of voice resources 232 and a plurality of video materials 242 loaded from the database 230. In the specific implementation, the following steps may be added to the step S120 to the step S130: Step S122: storing various sound effects extracted from the data library 230, such as a situational sound effect 2321, a character-specific sound effect 2323 or an adjustment sound effect 2325, and various video messages, such as Time video 2421, scene video 2423, character video 2425 and special effect video 2427; step S124: synthesize various stored sound effects, such as situational sound effect 2321, character sound effect 2323 or adjustment sound effect 2325; 201120834 Step S126: synthesizing various stored video messages, such as time-in-time video 2421, scene-by-video 2423, character video 2425, and special effect video 2427; step S128: combining various synthesized audio effects with various video signals Output. After the step S130 is performed, the multimedia video module and the entity character control data 238 are output to the output terminal 270 in step S140. In a specific implementation, after the step S140, the generated multimedia audio and video module is added to the output terminal 270. The above-mentioned illustrated video and audio synthesis interaction method is implemented by receiving text content, but may also be appropriately The replacement receiving end 210 is implemented by receiving the sound file or the sound image file recorded by the user itself in advance. This embodiment can modify the sound quality, quality and effect in the previously recorded sound file or sound image file. And other shortcomings, in the commercial application, can effectively use the video and audio synthesis interactive method in the small cost of the dynamic or the post-production of the short film, so as to save production time, cost and improve the quality of the multimedia audio and video produced, due to This video and audio synthesis interactive method can interact with the multimedia audio quality and the physical character, and can simulate the stage effect to reduce the rehearsal time required for the stage supervision, which can make the stage supervision work more efficient. In the above embodiment, although the control of the entity role can be controlled by the output entity role control data 238, in other embodiments, the user can also use other control methods to control the entity role according to his own needs, for example, When the entity role is a physical teaching material with a radio frequency tag attached, the user can use the radio frequency controller to additionally control the real role of the 201120834 body. The video and audio synthesis interactive method can be embodied as an audio-visual composite interactive computer program product. After the computer is loaded into the computer program and executed, the above-mentioned computer system, a separate computer system or other appropriate electronic device can be executed. The steps described in the video synthesis interaction method. For example, when executed on a network system, the electronic file file shared by the online community can be downloaded and downloaded into a format that can be played on a personal playback device. In addition, when executed on a separate computer system, you can refer to the book at hand and make it into a portable computer-readable medium. Specifically, it can be made into a disc or stored in MP3 format on the memory. It is convenient to use whenever you want to listen. β, wherein the technique for converting text data into sound data 232 suitable for the interactive method of video and audio synthesis of the present invention can be performed by a so-called text-t〇-speech (TTS) technique, a human voice technique or the like. Converting the word into a voice technology, converting it, and extracting various sound effects in the sound θ > 232 according to the actual different needs, the character • the sound effect 2323 as an example, may include Male voices, female voices, old voices, earth sounds and cartoon characters. Please refer to FIG. 2, which is a block diagram of the video and audio synthesis interaction system of the present invention. In the figure, the video and audio synthesis interactive system 2 includes, 210, video editor 22, database 23, video = and output 270. ^250, the receiving end 210 is mainly used to receive the input of the text content and the interactive signal. In this embodiment, the optical reading unit ^ 201120834 and the text content identifying unit 214 are used to implement the receiving end 210, via optical reading. The capturing unit 212 converts the printed characters into electrical signals, and then the text content identifying unit 214 can recognize the printed content on the book 202 to include keyword parameters, adjustment parameters, time parameters, scene parameters, role parameters, or The purpose of the text content of one of the special effect parameters or a combination thereof is not limited thereto. For example, the optical reading unit 212 may be a radio frequency identification tag reader for reading the content on the e-book. The text content identification unit 214 may further include a wireless radio frequency receiver having a decoding function to demodulate the data stored in the radio frequency identification tag into text content having various parameters that the database 230 can understand. Because the size of the text content that can be read by the optical reading unit 212 is different, the text content received by the receiving end 210 can be in accordance with the fixed unit size or the text content according to the user's desired size. . The video editor 220 is disposed between the receiving end 210 and the database 230, and is adapted to receive text content, thereby outputting a basic video material to the data library 230. In this video editor 220, there is further provided a controller 22, wherein the controller 221 can receive an interactive control signal to generate an entity role control data for controlling the entity role 280, which can be radio frequency, infrared or Blue-controlled puppets, plastic dolls or paper dolls. ^ The database 230 stores a plurality of voice data 232 and a plurality of video materials 242. The sound data 232 includes the situational sound effect 2321, the character sound 21720834 effect 2323, and the adjustment sound effect 2325, etc., but not limited thereto, just as the movie can be moved 'because it is provided by the second space In addition to the sound, the sound, including the vocals and soundtracks to create a three-dimensional sense of space three 'so that the sound data 232 has a stronger spatial expression' can include contextual sound effects 2321, character sound effects 2323, adjustment 曰Effect 2325 and user-set music as a soundtrack to make the synthesized multimedia module more influential. ^ For example, when the text content input by the receiving end 210 extracts keyword parameters such as "rain", "thunder", "knocking", "tick answer" and "footstep", the data can be self-made. In the 23rd section of the library, the situational sound effect 2321 corresponding to the situation of "raining", "knocking", "drip answer", and "footstep" is found. ¥The text content input by the receiving end 210, and the characters such as "cats and dogs", "bees" and "grandfathers" are taken from the temples, including "丨.人r. · - J Θ Bei Shumiao", "Dog", "Bees" and "Grandfather's Library 230" query the role of the character that is commensurate with the role of the character. 2323 Adjusting the sound effect 2325 is mainly used to adjust the speed of the vertical, the twist of the ancient you „ 门金°口“The speed of the effect With the same low, swaying or volume, etc., the content of the text that she entered is used to obtain the knowledge of Baotian. 2: For example, the punctuation marks appearing in "二" 2t Λ: "Yes", "喔", "Shouting """"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" When adjusting the parameters, the two, ",", the sound effect plus the speed of the conversion, the note will be played slowly. Low adjustment, swaying or loud volume 17 201120834 minor adjustment, for example, when the word content appears, "Ah ^ is down!" 'Because the - paragraph phrase in the 'two consecutive punctuation (9) punctuation marks, - The interjection at the beginning of the sentence, so this paragraph can be presented in a larger volume, higher tone and faster. In addition, the video data 242 includes time video 2421, scene video 2423, character video 2425, and special effects video, etc., but not limited thereto. Time video 2421 is mainly used to present different time. Video, if, when a word appears, time parameters such as "morning", "dusk", "3 am", "two midnight", "spring", "summer", "autumn", "winter", etc. Then, the light time video corresponding to the time parameters can be taken out from the database 230. Specifically, when the text appears "spring", the time video of the brighter and brighter effect can be used to make the multimedia video module. It shows the breath of spring. On the contrary, when there is "three nights", the time video with darker and unclear effect is used to show the dark atmosphere of night. Scene Scene 2423 is mainly used to display video of different scenes. For example, when the scene parameters such as "Imperial Palace", "Outer Space", "Forest", "Snow Country" and "Desert" are taken out from the content, then The video that appears next adds the lighting effects that are appropriate for these time parameters. Specifically, when the text appears in the "Imperial Palace", you can use the scene of the magnificent palace 2423. The character video 2425 can be matched with the character sound effect 2323. When the character content is taken out, the character parameters such as "cat", "dog", "bee" and "grandfather" can be queried from the database 230. Cats, "dogs", 201120834 "bees" and "grandfathers" and other roles commensurate with the role of video 2425. The special effect video 2427 can perform different processing on the video data 242 according to the special effect parameters set by the user in the text content, including the foreground vision configuration, fade or fade speed adjustment, for example, when the text content is "long and long time" In the past, there was a log cabin in the depths of a forest, ...", according to the special effect parameters set by the user in the forest scenes of the video 2423, the cabin scenes video 2423 into a distant view. When a cartoon character is used to create a health promotion video, the cartoon character can be faded out of the mouth except for the mouth, so as to highlight the original dirty teeth in the mouth and become clean through the brushing action. Video of the teeth. In addition, when the situation of the train flying fast is to be expressed, the feeling of the train moving quickly can be expressed by playing the surrounding scenes video 2423 at a faster speed. The video-synthesis interactive device 250 is electrically connected to the data library 230, and is mainly configured to load the video data 242 and the sound data 232 into a multimedia audio-visual module after loading the audio-visual data (the video data 242 and the sound data 232), and output the audio-visual data device 232. To the output 270. In this embodiment, the video and audio synthesis interaction device 250 may include a memory unit 252, a sound effect synthesis unit 254, a sound effect generation unit 256, a video synthesis unit 264, a video generation unit 266, and an output unit 258, etc., to achieve the above functions, but not limit. The memory unit 252 is configured to temporarily store at least one video material 242 loaded from the database 230, at least one voice material 232, and at least one entity role control element 238. Then, when the memory unit 252 loads the at least one sound material 232, it can be sent to the sound effect synthesizing unit 254 electrically connected thereto to synthesize the situational sound effect 2321, the character sound effect 2323 or the adjustment sound effect of the sound data 232. A synthetic step of one of 2325 or a combination thereof. On the other hand, when the memory unit 252 loads the at least one video material 242, it can be sent to the video synthesizing unit 264 electrically connected thereto to perform the time-series video 2421, the scene video 2423, and the role of the video data 242. The synthesizing step of one of the video 2425 or the special effect video 2427 or a combination thereof combines the sound data 232 and the video data 242 by the above steps to achieve the effect expected by the user. In this embodiment, although the sound synthesis unit 254 and the video synthesis unit 264 respectively synthesize the sound data 232 and the video material 242 to implement the present invention, an audio-video synthesis interaction unit capable of simultaneously realizing sound synthesis and video synthesis may be used. To implement the invention. Next, in the present embodiment, the sound data 232 subjected to the synthesizing step is sent to the sound effect generating unit 256 to convert the sound material 232 subjected to the synthesizing step into a form of analog data from the digital form, and wait for output. Similarly, the video data 242 subjected to the synthesizing step is sent to the video generating unit 266 to convert the video data 242 subjected to the synthesizing step into a form of analog data from the digital form to prepare for output. Finally, the sound data 232 and the video data 242 converted to the analog data form are sent to the output unit 258, and the output unit 258 determines the combined pulse signal according to the sound signal 232 and the clock signal carried on the video data 242. The sound data 232 and the video data 242 are outputted as a multimedia audio and video module, and can also be obtained from the memory unit 252 to the output terminal 27 when the multimedia image 20 201120834 sound module is output. In other embodiments, the material V is z phase _ two: and then output to the output terminal 27 〇, the phase ο 4 ' $ only has the sound data 232 when the video data 242 is set as the background music is synthesized as = 2 The destination 270, as such, the purpose of the video-synthesis interactive device 250 is achieved.

此外’虽應使用情境需求，而須使用多個實體 :為實體教材使用時’使用者亦可選用其他種控制方式來控制除實體角色28〇以外的實體角色（未圖示），可利 2紅外線、藍芽或無線射頻等控制方式來控制其他實體角色。藉由採用不同控制方式的實體角色，並搭配使用本發明’而得以更為滿足使用者需求。此外，輸出端270可包含顯示裴置274及音效裝置 272分別用以播放影音合成互動裝置25〇所輸出的^媒體影音模組中的聲音資料232與視訊資料242。適用於本實施例的顯示裝置274可以是液晶顯示器、電漿顯示器，電，紙等但不以此為限。適用的音效裝置272則可以是揚聲器或耳機等但不以此為限。此外，於本實施例中，輸出端270更可包含儲存裝置276，此儲存裝置276可以是記憶體、隨身碟、硬^ 或光碟其中之一或其組合，但不以此為限，主要是用以將影音合成互動裝置250所輸出的多媒體影音模組加以In addition, 'While contextual needs should be used, multiple entities must be used: when used as physical textbooks', users can also use other control methods to control entity roles other than physical characters (not shown). Controls such as infrared, Bluetooth, or radio to control other entity roles. The user's needs are more satisfied by using the physical characters of different control modes and using the present invention. In addition, the output terminal 270 can include a display device 274 and a sound effect device 272 for playing the sound data 232 and the video data 242 in the media audio and video module output by the video synthesis interaction device 25, respectively. The display device 274 suitable for the present embodiment may be a liquid crystal display, a plasma display, electricity, paper, etc., but not limited thereto. The applicable sound device 272 can be a speaker or a headphone, etc., but is not limited thereto. In addition, in this embodiment, the output terminal 270 may further include a storage device 276, which may be one of a memory, a flash drive, a hard disk, or a compact disk, or a combination thereof, but not limited thereto, mainly The multimedia audio and video module output by the video and audio synthesis interaction device 250 is used.

21 201120834 儲存，以便於日後剪接、播放、存檔或製作成品時使用。以下，請配合參閱第2圖與第3圖以方便理解下述說明，其中，第3圖為實際實施本發明之影音合成互動系統及其方法之態樣圖。將舉一實際上實施影音合成互動系統200中，音效合成的態樣，以進一步詳細說明本發明之影音合成互動系統200與影音合成互動方法。假設有-段故事A的文字資料為：「外頭正下著大雨，屋頂上滴滴答答地響個不停。突㈣，雷聲大作丨小猶被如其來的聲響嚇了-大跳丨獨自瑟縮在牆說明，以下以，，故事A”簡稱之），若透 *更成互動純，料將單辦板的故事/之=合轉換^含具有各種音效之聲音資料232 :貝^+座謂中預先儲存有各式各樣的與文^ 、貝枓庫21 201120834 Stored for later editing, playback, archiving or production of finished products. Hereinafter, please refer to FIG. 2 and FIG. 3 for convenience of understanding the following description, wherein FIG. 3 is a view showing an aspect of the audio-visual synthesis interactive system and the method thereof for actually implementing the present invention. A mode of sound synthesis in the video-synthesis interactive system 200 will be actually implemented to further explain the interactive method of the audio-video synthesis interactive system 200 and the audio-video synthesis of the present invention. Suppose there is a paragraph-story A's text: "There is heavy rain outside, and the roof is ringing and ticking. The sudden (4), thunder and loudness are scared by the sound of the sound - big flea alone Shrinking in the wall, the following,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, Pre-stored a variety of texts and texts

參數或調整參數相對應的 =字J 色別音效2323或調整音效2325，如文2321、角便敛述，將以<X>表示X的音效，例如：<〇不（為了方音效，即代表風的聲音）^ 風 >代表風之 1參數和對應之音效參數 —1 一 __ 音挝 --------- 風 <風> 雷 <雷> 雨 <雨> 腳步 <腳步> 敲門 <敲門> 22 201120834The parameter or adjustment parameter corresponding to the = word J color sound effect 2323 or the adjustment sound effect 2325, as described in the text 2321, the angle will be condensed, will represent the sound effect of X with <X>, for example: < 〇 not (for square sound, That is, the sound of the wind) ^ Wind > represents the wind 1 parameter and the corresponding sound effect parameters - 1 a __ sounds --------- wind &wind; wind &th; thunder > rain <;rain>footsteps<steps> knocking on the door<knocking> 22 201120834

開門 <開門> 滴滴答答 <滴滴答答> 咬咬噇噇 <咬吱咳喧> 描 <貓> 狗 <狗> 蜜蜂 <蜜蜂> 爺爺 <爺爺> 牛 <牛> 羊 <羊> | <!> ? <?> 啊 <啊> 呀 <呀> 喔 <喔> 首先，接收端210接收故事A的文字資料。文字内容識別部214故事A的文字資料所含有的各種參數以和貪料庫230中的各種音效進行比對後，擷取出的有效參數^下：「雨」、「滴滴答答」、「雷」、「丨」、「貓」和「丨」c 接著，根據參數的擷取結果，由資料庫23〇中杳之聲音資料扣中的有效音效如下 :雷二<!〉、<書苗〉和◊，並將此些音效載入影音合成互早7G 250合成為聲音資料232。影音合成互動單元乃( 私合成單疋254可依據故事A的文字資料藉由真人發曰技術所產生聲音資料331與使用者需求的音效效果 23 201120834 的聲音資料332合成為一聲音資料333後，於音效產生單元256將聲音資料333轉換為類比形式的聲音資料 333後，送至輸出單元258。最後，可將合成後的聲音資料333透過輸出單元258進行輸出，此時亦可依使用者需求使用預設的視訊資料242加以組合成多媒體影音模組後輸出，亦可僅輸出僅包含聲音資料232的多媒體影音模組，且更可於輸出包含多媒體影音模組時，同時搭配自記憶單元252取得的實體角色控制資料238輸出至輸出端270。以下，請配合參閱第2圖與第4圖以方便理解下述說明，其中，第4圖為實際實施本發明之影音合成互動系統200及其方法之另一態樣圖。將舉一實際上實施影音合成互動系統200中，視訊合成的態樣，以進一步詳細說明本發明之影音合成互動系統200與影音合成互動方法。將舉一實際上實施影音合成互動系統中，視訊合成的態樣，以進一步詳細說明本發明之影音合成互動系統 200與影音合成互動方法。假設有一段故事B的文字資料為：「你看到了什麼？」、「是一頂帽子」、「或著是一條吞了大象的大蟒蛇」（為方便說明，以下以”故事B”簡稱之），若透過本發明之影音合成互動系統200,則可將單調平板的故事B的文字資料，轉換為蘊含具有各種視訊之視訊資料242，當然資料庫230中預先儲存有各式各樣的與文字資料中的時間參數、場景參數、角色參數或一特殊效果參數相對應之時間別視訊2421、場景別視訊 24 201120834 角色別視訊2425及特殊效果視訊2427並一 =合二第4圖所示，當聲音資料出現了？你看到了什麼·」’則視訊資料242可以出現第—贿示Open <Open the door> Drip &Answer;Trip &Drop> Bite Biting &<Bite Cough &Cough> Description <Cat> Dog &Dog; Dog &Bee; Bee > Grandpa & Grandfather > cattle <cow> sheep <sheep> | <!> ? <?> Ah <ah> Yeah &Yeah 喔<喔> First, the receiving end 210 receives The text of story A. The text content recognition unit 214 compares the various parameters included in the text data of the story A with the various sound effects in the greedy library 230, and then extracts the valid parameters: "Rain", "Drip Answer", "Ray" "," "丨", "猫", and "丨" c Next, according to the result of the parameter extraction, the effective sound effect in the sound data buckle of the database 23 is as follows: Ray II <!〉, <Book Miao> and ◊, and these sound effects into the video synthesis early 7G 250 synthesized into sound data 232. The audio-synthesis interactive unit is (the private synthesis unit 254 can synthesize the sound data 331 generated by the real hairpin technology and the sound data 332 of the user demand 23 201120834 into a sound data 333 according to the text data of the story A, The sound effect generating unit 256 converts the sound data 333 into the audio data 333 of the analog form, and then sends the sound data 333 to the output unit 258. Finally, the synthesized sound data 333 can be output through the output unit 258, and can also be output according to the user's needs. The preset video data 242 is combined into a multimedia audio and video module for output, and only the multimedia audio and video module including only the sound data 232 can be output, and the multimedia video module can be combined with the self-memory unit 252 when the output includes the multimedia audio and video module. The acquired entity role control data 238 is output to the output terminal 270. Hereinafter, please refer to FIG. 2 and FIG. 4 for convenience of understanding the following description, wherein FIG. 4 is an actual implementation of the video-synthesis interactive system 200 of the present invention and Another aspect of the method will be described in the actual implementation of the video-synthesis interactive system 200, the aspect of video synthesis, The method for interacting with the audio-visual synthesis system of the present invention will be described in detail in one step. The method of video synthesis in the video-synthesis interactive system will be implemented in detail to further explain the video-synthesis interactive system 200 and video-to-speech synthesis of the present invention. Interactive method. Suppose there is a story B's text: "What did you see?", "It's a hat", "Or a big python that swallows an elephant" (for convenience, the following story) B), if the audio-visual interactive system 200 of the present invention is used, the text data of the story B of the monotonous tablet can be converted into video data 242 containing various video, and of course, the data library 230 is pre-stored in various forms. All kinds of time parameters corresponding to time parameters, scene parameters, role parameters or a special effect parameter in the text data 2421, scene video 24 201120834 character video 2425 and special effect video 2427 one and two = 4 As shown in the figure, when the sound data appears? What do you see?", then the video material 242 can appear the first - bribery

=晝音資料配合結合成多媒體影音模組。接下U 二St進行到「是一頂帽子」時，則視訊資料 :巧切換到第二列的帽子的畫面來與聲配象的夫磁π \ 订到或著是一條吞了大象的大蛑蛇」時，則視訊資料242 =子内包含有-頭大象的晝面來與聲音資=! ^多^體影f模組。上述實施例礙於僅能: 動態之動畫。頁際貫施時，可以使用 ^要特別注意的是，上述實施例為方便理解，僅以中文為例’但本發明之寻彡立人士、^ 及其電腦程犬產口 * 動系統2〇〇、其方法二電細程式產，並不以此為限，亦適用於如央文或日文等文字内容’亦可透過本發明之 =動系統、其方法及其電腦程式產品，製作出具；：：境的影音效果之多媒體影音模組。〃身歷其以上所述僅為舉例性，而非為限制性 :本發明之精神與範脅’而對其進 = 更，均應包含於後附之申請專利範圍中。改或良【圖式簡單說明】第1圖係為本發明之影音合成互動方法之流程圖; 25 201120834 第2圖係為本發明之影音合成互動系統之方塊示意圖；以及第3圖係為實際實施本發明之影音合成互動系統及其方法之態樣圖。第4圖係為實際實施本發明之影音合成互動系統及其方法之另一態樣圖。【主要元件符號說明】 S110-S140 :各流程步驟； 200 :影音合成互動系統； 202 :書本； 210 :接收端； 212 :光學讀取單元； 214 :文字内容識別部； 220 :影音編輯器； 221 :控制器； 230 :資料庫； 232、33卜332、333 :聲音資料； 2321 :情境別音效； 2323 :角色別音效； 2325 :調整音效； 238 :實體角色控制資料； 242 :視訊資料； 2421 :時間別視訊； 26 201120834 2423 :場景別視訊； 2425 :角色別視訊； 2427 :特殊效果視訊； 250 :影音合成互動裝置； 252 :記憶單元； 254 :音效合成單元； 256 :音效產生單元； 258 :輸出單元； • 264 :視訊合成單元； 266 :視訊產生單元； 270 :輸出端； 272 :音效裝置； 274 :顯示裝置； 276 :儲存裝置；以及 280 :實體角色。= Voice data is combined into a multimedia audio and video module. When U 2 St is carried out to "is a hat", then the video data: the screen of the hat of the second column is switched to the sound of the sound image, or it is an elephant swallowing. In the case of the big python, the video data 242 = the inside of the child contains the head of the elephant and the voice =! ^ more ^ body shadow f module. The above embodiment is only capable of: dynamic animation. In the case of inter-page, you can use it. It should be noted that the above embodiment is for convenience of understanding. Only Chinese is used as an example. However, the finder of the present invention, ^ and its computer program dog mouth* system 2〇〇其其其其其其其其其其其其其其其其其其其其其其其其其其电电电电电电电电电电电电电电电电电电电电电电:: Multimedia audio and video modules for audio and video effects. The above description is intended to be illustrative only and not limiting, and the spirit and scope of the invention may be included in the scope of the appended claims. Change or good [schematic description] Fig. 1 is a flow chart of the interactive method of audio and video synthesis of the present invention; 25 201120834 Fig. 2 is a block diagram of the interactive system for video and audio synthesis of the present invention; and Fig. 3 is actual An aspect map of an audio-visual synthesis interactive system and method thereof embodying the present invention. Fig. 4 is a view showing another aspect of the interactive system for video and audio synthesis and the method for actually implementing the present invention. [Main component symbol description] S110-S140: each process step; 200: video and audio synthesis interactive system; 202: book; 210: receiving end; 212: optical reading unit; 214: text content recognition unit; 220: video editing device 221: Controller; 230: Database; 232, 33 Bu 332, 333: Sound data; 2321: Situational sound effects; 2323: Character sound effects; 2325: Adjust sound effects; 238: Physical character control data; 242: Video data 2421: Time Video; 26 201120834 2423: Scene Video; 2425: Character Video; 2427: Special Effects Video; 250: Video and Audio Interactive Device; 252: Memory Unit; 254: Sound Synthesis Unit; 256: Sound Generation Unit 258: output unit; • 264: video synthesizing unit; 266: video generating unit; 270: output; 272: audio device; 274: display device; 276: storage device; and 280: physical character.

2727

Claims

201120834 VII. Patent application scope: 1. An interactive method for video and audio synthesis, which comprises the following steps: receiving a text content from a receiving end; loading a video material relative to one of the text contents from a database and the video material is at least One or a combination of the audio data and the video data; the video data and the sound data are combined into a multimedia audio and video module; and the multimedia audio and video module is output to an output. 2. The video and audio synthesis interaction method as described in item i of the patent application scope = the receiving end receives the text material, and further comprises receiving an interactive control signal from the terminal. 3. The video synthesis according to item 2 of the patent application scope further includes according to the interactive control signal, and the data is rotated to the output end to control to the real:: color role control 4 = please 2 range The audio-synthesis interaction described in item 1 or item 2 = the sound data of the audio-visual material is included in the content of the text - the keyword parameter, the parameter is adjusted, and the corresponding situational sound is synthesized. B adjusts and adjusts the sound effect by one of ten or a combination thereof. Xiao color, "Efficacy 5. As stated in the fourth paragraph of the patent application scope, the application for the application of the 申效效效 r ^ ^ ^ ^ ^ ^ ^ 成成成成成 ' ' ' I I I I I 依据依据依据依据依据依据依据', the time the sound data is speed, '古: 曰,,, and the output is adjusted back low, swaying 28 201120834 or volume adjustment. 6. The method for interactive video and audio as described in claim 4, wherein the adjustment parameter comprises punctuation, an interjection or a helper. A. The method for interactive video and audio according to claim 1 or 2, wherein the video data further includes the time parameter, a scene parameter, and a role included in the text content. A parameter or a special effect parameter, which is one of or a combination of one of the scene video, a character video, and a special effect video. 8. The method for interactive video and audio according to item 7 of the patent application scope, which further comprises: configuring, visualizing, fading, or slowing the foreground video data of the video data according to the special effect video. Adjustment. 9. In the video synthesis method as described in item i or item 2 of the patent application, wherein the step of receiving the text content from the receiving end is in the step of inputting

A printed content is identified as containing the keyword content, a = number one time parameter, a scene parameter, a character parameter, or a special effect parameter. The application of the patent range Scope or item 2 of the video = material: the c-body audio and video material is the multimedia audio-visual module β α video Beko and the sound please (4) range i or 2 The audio and video method 'outputs the multimedia audio and video module to the output step 29 201120834 includes storing the multimedia audio and video module to the output terminal. 12 people into an interactive computer program product, when the video and audio of the product; = computer program is loaded into the computer and executed, can complete the application for patents to the first to the πth item - [13]; - an audio-visual synthesis interactive system, comprising: a text content, an interactive control signal-receiving end, is received by receiving or a combination thereof; 'the system is electrically connected to the receiving end, and stores relative to the audio-visual data, The audio-visual material comprises a plurality of video factories, a plurality of sound data and a plurality of physical materials; and (4) a Bayi video and audio synthesis interactive device, which is electrically connected to the database, and the θ > material is combined to combine the video The data and the sound data are a multimedia audio and video module that outputs the multimedia video module to an output, and controls at least one physical character according to the entity role control data. 14. The audio-visual composite interactive system of claim 13 (4), wherein the receiving end further comprises: an optical reading unit that receives a printed content input; and a text content identifying unit that identifies the printed content as including a key The text content of one or a combination of a word parameter, an adjustment parameter, a time parameter, a scene parameter, a character parameter, or a special effect parameter. 30 201120834 15. If the patent application scope item 13 is in the receiving end and the data _ ^ ^ interactive system, receive the text content, according to the wheel 2 3 - video editor, library. The basic audio and video data is sent to the data. 16·If the application is 鄕 _ 15 , the audio and video editor further includes a synthetic interactive system and a signal to generate the interactive role to receive the interactive control. 17 . + The text content is at least a voice-synthesis interactive system, number, time parameter, a key key, and effect parameters. Role parameter or a special 18. As in the patent application, the 15th of which is stored in the database, the interactive interaction system, the keyword parameter, the angle == sound effect, a role sound effect or one of the parameters of the situation 19. If the patent application scope is the μth item, Lu Qi, the database, is stored in the database. At this time, 4 == the information data contains at least the phase parameter - time video, for the news Or a special effect video. The character does not regard the video and audio interactive system as described in the article 13 of the Chinese version of the syllabus, and the singer-synthesis interactive system includes: = unit 'sto temporarily store the sound from the database loading to the evening Poor material and at least one entity role 31 201120834 control data; unit 'electrically connected to the memory unit, synthesizing the unit' electrically connected with the memory unit 'synthesizing: the sound effect generating unit' is electrically connected with the sound effect synthesizing unit , the cover sound synthesis of the sound data;

a video generating unit, the video data synthesizing unit Newson synthesized the video data; and an output unit, and the sound effect generating unit and the video generating unit, the sexual connection 'combines the sounds and the video media audio and video After the module is output to the output, and output to the entity role control data to the output. Xi item, the dynamic system, the Zhuang, Zhongzhong, "the system includes a device for playing the multimedia audio-visual module and a sound effect device."

= Application Specialization (4) 19 items of f-sound synthesis. The output terminal further includes the storage of the multi-video mode, and the device. - 32