TWM621764U - A system for customized speech - Google Patents

A system for customized speech Download PDF

Info

Publication number
TWM621764U
TWM621764U TW110208867U TW110208867U TWM621764U TW M621764 U TWM621764 U TW M621764U TW 110208867 U TW110208867 U TW 110208867U TW 110208867 U TW110208867 U TW 110208867U TW M621764 U TWM621764 U TW M621764U
Authority
TW
Taiwan
Prior art keywords
voice
text
server
speech
customized
Prior art date
Application number
TW110208867U
Other languages
Chinese (zh)
Inventor
陳冠宇
陳韻茹
張朝智
陳皓文
蔣思霈
許郁欣
Original Assignee
遊戲橘子數位科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 遊戲橘子數位科技股份有限公司 filed Critical 遊戲橘子數位科技股份有限公司
Priority to TW110208867U priority Critical patent/TWM621764U/en
Publication of TWM621764U publication Critical patent/TWM621764U/en

Links

Images

Abstract

The invention disclose a system for generating customized speech, which let users enjoying personalized speech for data likes news or story. The data might be downloaded from server 2. Importing data to the server 1 via the app, then the data would be translated to the speech by text-to-speech services. Adjusting the speech conform parameters likes tone or timbre, users finally acquire customized speech for data.

Description

客製化語音服務系統 Customized voice service system

本新型係關於一種語音服務之系統,尤其是一種客製化語音服務之系統,其以應用程式取得文本,傳送至伺服器進行文字辨識產生模擬人物之第一語音資料,用戶選擇以訂製之語音參數播放為模擬人物之第二語音資料,而對文本提供客製化語音服務,進行播報、配音。 This new model relates to a system for voice service, especially a system for customized voice service, which obtains text through an application program and transmits it to a server for text recognition to generate the first voice data of a simulated character. The user chooses to customize the text The voice parameter playback is the second voice data of the simulated character, and the customized voice service is provided for the text for broadcast and dubbing.

當前技術已有將文字轉換成語音的技術,如:google翻譯所提供之語音朗讀功能。 The current technology has the technology of converting text into speech, such as the speech reading function provided by Google Translate.

google翻譯中使用較廣泛的語言都有「朗讀」功能,對多中心語言而言,朗讀採用的口音取決於所在地區,例如:英語:美洲、亞太(香港、馬、新除外)及西亞大多使用美國英語(女聲),其餘均用英國英語(女聲),但澳洲、紐西蘭和諾福克島使用一種特殊的大洋洲口音(女聲),印度使用印度口音(女聲)。法語:除加拿大使用魁北克口音(女聲)外,其餘大部分地區使用標準歐陸口音(女聲)。西班牙語:美洲(除美國)使用美洲西班牙語(女聲),其餘大部分地區使用卡斯蒂利亞西班牙語(男聲)。標準漢語:繁體使用國語(女聲),簡體使用普通話(女聲)。葡萄牙語:除葡萄牙使用本國口音(女聲)外,其餘大部分地區使用聖保羅口音(女聲)。 The more widely used languages in Google Translate have the function of "reading aloud". For multi-centric languages, the accent used for reading aloud depends on the region, for example: English: America, Asia Pacific (except Hong Kong, Malaysia, Singapore) and West Asia are mostly used American English (female) and British English (female) for the rest, but Australia, New Zealand and Norfolk Island use a special Oceanian accent (female) and India uses an Indian accent (female). French: With the exception of Canada, which uses a Quebec accent (female), most other regions use a standard continental accent (female). Spanish: American Spanish (female voice) is used in the Americas (except the United States), and Castilian Spanish (male voice) is used in most other regions. Standard Chinese: Traditional Chinese (female voice) is used, and Mandarin Chinese (female voice) is used for simplified Chinese. Portuguese: Except for Portugal, which uses the national accent (female voice), most of the rest of the region uses the São Paulo accent (female voice).

其中,文字轉換成語音的技術稱為文字轉語音Text To Speech,簡稱TTS。Text-to-speech(TTS)的目標是將給定的文字(text)合成出對人類來說自然的語音(speech)。 Among them, the technology of converting text to speech is called Text To Speech, or TTS for short. The goal of Text-to-speech (TTS) is to synthesize a given text (text) into a natural speech (speech) for humans.

傳統上有兩種方法來實現TTS,分別是拼接式合成(concatenative speech synthesis)和參數式合成(parametric speech synthesis)。 There are traditionally two approaches to implement TTS, concatenative speech synthesis and parametric speech synthesis.

拼接式(concatenative speech synthesis)合成:拼接式合成將預錄好的語句、文字語音組成一個資料庫。當欲生成語音的文字進來時,在資料庫裡挑選適當的example拼接成新的語音。 Concatenative speech synthesis: Concatenative synthesis combines pre-recorded sentences, text and speech into a database. When the text to generate voice comes in, select the appropriate example in the database to splicing it into a new voice.

然而該如何挑選語句來合成語音是一個大問題。若使用長的語句,合成出來的聲音可能比較自然,但會要求很多的example儲存在資料庫裡,因此會有記憶空間的問題以及該如何取得這麼多example的人力問題,因此彈性較低;若使用短的語句,彈性較高,但文字的標註、聲音的挑選會變得較困難、合成出來的聲音會較不自然。 However, how to choose sentences to synthesize speech is a big problem. If a long sentence is used, the synthesized voice may be more natural, but it will require many examples to be stored in the database, so there will be the problem of memory space and the manpower problem of how to obtain so many examples, so the flexibility is low; if using Short sentences have higher flexibility, but the labeling of text and the selection of sounds will become more difficult, and the synthesized sounds will be less natural.

參數式合成(parametric speech synthesis):參數式合成是利用訓練好的模型來合成聲音的波形,好處是他不需要像拼接式合成那樣有一個龐大的資料庫,比起拼接式合成來得更有彈性、節省人力。但是過去的實驗結果顯示,參數式方法合成出來的語音比較平淡、品質較差且不自然。 Parametric speech synthesis: Parametric synthesis uses a trained model to synthesize sound waveforms. The advantage is that it does not require a huge database like splicing synthesis, which is more flexible than splicing synthesis. ,save human effort. However, past experimental results show that the speech synthesized by the parametric method is relatively flat, poor quality and unnatural.

參數式合成會從語音訊號x={x1,...,xT}抽取vocoder參數o={o1,...,oN}和從文字W抽取語言特徵(linguistic features)l。 Parametric synthesis extracts vocoder parameters o={o1,...,oN} from speech signals x={x1,...,xT} and linguistic features l from text W.

訓練時使用一個生成模型(generative model),可以是HMM、類神經網路等,希望根據語言特徵l產生適當的vocoder參數o,最後讓vocoder根據這個參數o產生聲音波形。 A generative model is used during training, which can be HMM, neural network, etc. It is hoped to generate an appropriate vocoder parameter o according to the language feature l, and finally let the vocoder generate a sound waveform according to this parameter o.

Training階段,更新模型參數λ:

Figure 110208867-A0305-02-0005-1
In the training phase, update the model parameter λ:
Figure 110208867-A0305-02-0005-1

Synthesis階段,從linguistic features l和模型λ得到vocoder參數o:

Figure 110208867-A0305-02-0005-2
In the Synthesis stage, the vocoder parameter o is obtained from the linguistic features l and the model λ:
Figure 110208867-A0305-02-0005-2

最後vocoder再使用這個o來合成聲音波形。 Finally, the vocoder uses this o to synthesize the sound waveform.

隨著人工智慧、深度學習的發展,Google DeepMind於2016年提出WaveNet,讓合成出來的語音更自然、更像人聲。 With the development of artificial intelligence and deep learning, Google DeepMind proposed WaveNet in 2016 to make the synthesized speech more natural and more human-like.

Deep Learning(深度學習):近年來,隨著deep learning技術的發展,對TTS這個領域產生了重大的影響。2016年Google DeepMind率先提出WaveNet的深度學習架構,取得了很大的成功,並且投入商業應用,包含Google助理、Google翻譯等等。 Deep Learning: In recent years, with the development of deep learning technology, it has had a significant impact on the field of TTS. In 2016, Google DeepMind took the lead in proposing the deep learning architecture of WaveNet, which achieved great success and put it into commercial applications, including Google Assistant, Google Translate, etc.

然而,目前關於語音播報服務技術,多專注於產生語音之正確性,或讓語音產生語句時更自然、更像人聲,並無針對用戶想要的聲音產生客製化語音有所著墨,因此如何針對於語音播報,提升其使用體驗,服務社會大眾,為相關領域之技術人員有待克服之課題。 However, the current voice broadcast service technology focuses more on the correctness of the voice generated, or makes the voice more natural and more human-like when generating sentences, and has not made any efforts to generate customized voice for the voice desired by the user. For voice broadcast, improving its user experience and serving the public is a problem to be overcome by technical personnel in related fields.

基於上述問題,本新型提供一種客製化語音服務系統,其由執行應用程式,產生符合訂製參數之客製化語音,供用戶能以喜愛的客製化語音以播報特定事件,如此一來可大幅提升其使用體驗。 Based on the above problems, the present invention provides a customized voice service system, which executes an application program to generate customized voices that meet customized parameters, so that users can broadcast specific events with their favorite customized voices. It can greatly improve its user experience.

本新型之一目的,在於提供一種客製化語音服務系統,其藉由文字辨識模組根據應用程式取得之文本產生模擬人物之語音資料,再調整為符合訂製之語音參數之客製化語音,從而對文本以客製化語音進行配音,為自訂語音為文本配音之服務,對比先前技術之語音播報文本之技術,增加了用戶於播報服務中對語音選擇之自由度。 One object of the present invention is to provide a customized voice service system, which uses a text recognition module to generate voice data of a simulated character according to the text obtained by an application program, and then adjusts it to a customized voice that conforms to the customized voice parameters , so that the text is dubbed with a customized voice, and the custom voice is used for the text dubbing service. Compared with the previous technology of voice broadcast text technology, the user's freedom of voice selection in the broadcast service is increased.

針對上述之目的,本新型揭示一種客製化語音服務系統,當一用戶選擇一文本,使用系統將文本以訂製語音播報,系統包含一第一伺服器、一第二伺服器、一用戶端裝置;於此,第一伺服器包含一文字辨識模組,文字辨識模組採用文字轉語音技術(TTS)將文字轉換為語音;而用戶端裝置通信連接第一伺服器與第二伺服器;第二伺服器包含一資料庫,儲存有文本;用戶端裝置利用一應用程式經由一網路連接第一伺服器並執行一事件,依據事件應用程式連結至第二伺服器取得文本,應用程式將文本發送至該第一伺服器,第一伺服器接收文本並以文字辨識模組將文本轉換為一模擬人物之一第一語音資料,用戶端裝置選擇第一語音資料或模擬人物之一第二語音資料予以播放,其中第一語音資料有預設之一語音參數不同於第二語音資料之語音參數。本新型可以透過伺服器產生符合訂製之參數之客製化語音服務以播報輸入文本,提升給予用戶關於語音播報服務之使用體驗。 For the above purpose, the present invention discloses a customized voice service system. When a user selects a text, the system is used to broadcast the text in customized voice. The system includes a first server, a second server, and a client device; here, the first server includes a text recognition module, the text recognition module uses text-to-speech technology (TTS) to convert text into speech; and the client device communicates with the first server and the second server; The two servers include a database that stores text; the client device uses an application program to connect to the first server through a network and execute an event, and the application program connects to the second server to obtain the text according to the event, and the application program converts the text to the second server. Sent to the first server, the first server receives the text and uses the text recognition module to convert the text into a first voice data of an analog character, and the client device selects the first voice data or a second voice of the analog character The data is played, wherein the first voice data has a preset voice parameter different from the voice parameter of the second voice data. The new model can generate a customized voice service that conforms to the customized parameters through the server to broadcast the input text, so as to improve the user experience of the voice broadcast service.

本新型提供一實施例,其中語音參數包含一音色,或包含音色及一音調。 The present invention provides an embodiment, wherein the speech parameter includes a timbre, or includes a timbre and a tone.

本新型另揭示了一種客製化語音服務系統,當一用戶輸入一用戶語音,使用系統將語音以訂製語音播放,系統包含一第一伺服器、一用戶端裝置; 於此,第一伺服器包含一語音辨識模組,語音辨識模組採用語音轉文字識別(STT)辨識語音;而用戶端裝置通信連接第一伺服器;用戶端裝置利用一應用程式經由一網路連接第一伺服器並執行一事件,依據事件應用程式將用戶語音發送至該第一伺服器,第一伺服器接收用戶語音並以語音辨識模組將用戶語音轉換為一模擬人物之一第一語音資料,用戶端裝置選擇第一語音資料或模擬人物之一第二語音資料予以播放,其中第一語音資料有預設之一語音參數不同於第二語音資料之語音參數。由上可知,本新型可以透過伺服器辨識輸入之語音,產生符合訂製之參數之客製化語音並進行播放,提升給予用戶關於語音播報服務之使用體驗。 The present invention further discloses a customized voice service system. When a user inputs a user voice, the system is used to play the voice with the customized voice, and the system includes a first server and a client device; Here, the first server includes a speech recognition module, and the speech recognition module adopts speech-to-text recognition (STT) to recognize speech; and the client device communicates with the first server; the client device uses an application program through a network The first server is connected to the first server and an event is executed, and the user's voice is sent to the first server according to the event application program. The first server receives the user's voice and uses the voice recognition module to convert the user's voice into an analog character A voice data, the client device selects the first voice data or a second voice data of a simulated character to play, wherein the first voice data has a preset voice parameter different from the voice parameter of the second voice data. As can be seen from the above, the present invention can recognize the input voice through the server, generate customized voice according to the customized parameters and play it, so as to improve the user experience of the voice broadcast service.

本新型提供另一實施例,其中,語音參數包含一音色,或包含音色及一音調。 The present invention provides another embodiment, wherein the speech parameter includes a timbre, or a timbre and a tone.

1、2:客製化語音服務系統 1, 2: Customized voice service system

10:第一伺服器 10: The first server

102:文字辨識模組 102: Text recognition module

104:語音辨識模組 104: Speech recognition module

12:模擬人物 12: Simulated characters

122:第一語音資料 122: The first voice data

124:第二語音資料 124:Second voice data

14:語音參數 14: Voice parameters

142:音色 142: Tone

144:音調 144: Tone

20:第二伺服器 20: Second server

22:資料庫 22:Database

30:用戶端裝置 30: Client Device

32:文本 32: Text

34:事件 34: Events

36:用戶語音 36: User Voice

40:網路 40: Internet

APP:應用程式 APP: application

S10-S26:步驟 S10-S26: Steps

第1圖:其為本新型之一實施例之方塊示意圖;第2圖:其為本新型之一實施例之步驟S10-S18之方法流程圖;第3圖:其為本新型之另一實施例之方塊示意圖;及第4圖:其為本新型之另一實施例之步驟S20-S26之方法流程圖。 Figure 1: It is a schematic block diagram of an embodiment of the new model; Figure 2: It is a flow chart of steps S10-S18 of an embodiment of the new model; Figure 3: It is another implementation of the new model The block diagram of the example; and FIG. 4 : it is a flow chart of the method of steps S20-S26 of another embodiment of the novel.

為使 貴審查委員對本新型之特徵及所達成之功效有更進一步之瞭解與認識,謹佐以較佳之實施例及配合詳細之說明,說明如後: In order to enable your reviewers to have a further understanding and understanding of the features and effects of this new model, I would like to provide a preferred embodiment and a detailed description. The description is as follows:

習知語音播報文本之方法,是將語音與文字進行連結,而產生語音資料從而實現語音對文本進行播報,而本新型所提供之客製化語音服務系統,更可提供一種用戶能自訂之客製化語音服務,從而實現使用客製化語音對文本進行播報,以提供更具使用體驗之語音播報服務。 The conventional method of broadcasting text by voice is to connect voice and text to generate voice data so as to realize voice-to-text broadcast, and the customized voice service system provided by this new model can also provide a user-defined Customized voice service, so as to use customized voice to broadcast text to provide voice broadcast service with more user experience.

在下文中,將藉由圖式來說明本新型之各種實施例來詳細描述本新型。然而本新型之概念可能以許多不同型式來體現,且不應解釋為限於本文中所闡述之例式性實施例。 Hereinafter, the present invention will be described in detail by illustrating various embodiments of the present invention by means of the drawings. The inventive concepts may, however, be embodied in many different forms and should not be construed as limited to the exemplary embodiments set forth herein.

首先,請參閱第1圖,其為本新型之一實施例之方塊示意圖。如圖所示,本新型之客製化語音服務系統1,包含一第一伺服器10、一第二伺服器20以及一用戶端裝置30,本新型之客製化語音服務系統1所提供之服務,例如:當一用戶選擇一文本32,使用系統1將文本32以訂製語音播報。 First, please refer to FIG. 1 , which is a block diagram of an embodiment of the novel. As shown in the figure, the customized voice service system 1 of the present invention includes a first server 10, a second server 20 and a client device 30. The customized voice service system 1 of the present invention provides Services, such as: when a user selects a text 32, the system 1 is used to broadcast the text 32 to a customized voice.

於此,第一伺服器10包含一文字辨識模組102,文字辨識模組102採用文字轉語音技術(TTS)將文字轉換為語音。文字轉換成語音技術(TTS)乃係習知技術,能將有聲學特徵之語音材料合成為符合文字或字句之語音或語句,實施方法於先前技術部分已有介紹,此處不再贅述。 Here, the first server 10 includes a text recognition module 102 , and the text recognition module 102 converts text into speech using text-to-speech technology (TTS). Text-to-speech technology (TTS) is a conventional technology, which can synthesize voice materials with acoustic features into voices or sentences that conform to text or words.

而用戶端裝置30通信連接第一伺服器10與第二伺服器20,其中通訊連接可以包括用來傳遞訊息的任何有線或無線傳輸技術;目前現有已知的無線傳輸技術,諸如CDMA、3G或4G、LTE、Wi-Fi、WiMax、WWAN、WLAN、WPAN、藍牙等等。 The client device 30 is communicatively connected to the first server 10 and the second server 20, wherein the communication connection may include any wired or wireless transmission technology for transmitting messages; currently known wireless transmission technology, such as CDMA, 3G or 4G, LTE, Wi-Fi, WiMax, WWAN, WLAN, WPAN, Bluetooth, and more.

用戶端裝置30並安裝有應用程式APP。於此,應用程式APP可為行動應用程式(mobile application,簡稱mobile app或app)或線上應用程式(web application,簡稱web app或app)。於此,用戶端裝置30可執行行動作業系統 (Mobile operating system;簡稱Mobile OS),並且利用行動作業系統執行應用程式APP與對應之伺服器互動。於此,行動作業系統例如:安卓(Android)作業系統、iOS作業系統等。 The client device 30 is installed with an application program APP. Here, the application program APP may be a mobile application (mobile application, referred to as mobile app or app) or an online application (web application, referred to as web app or app). Here, the client device 30 may execute the mobile operating system (Mobile operating system; referred to as Mobile OS), and use the mobile operating system to execute the application program APP to interact with the corresponding server. Here, the mobile operating system is, for example, an Android (Android) operating system, an iOS operating system, and the like.

第二伺服器20包含一資料庫22,資料庫22儲存有文本32。第二伺服器20能用以儲存各式各項的檔案(例如:文件、圖片或其組合等)。於此,第二伺服器20可為雲端硬碟平台、網路平台、瀏覽器或即時通訊平台。雲端硬碟平台可為Google雲端硬碟、Dropbox、SugarSync、SkyDrive等。網路平台可為Facebook(臉書)、Google+、Twitter(推特)、或微博等。瀏覽器可為Chrome、Safiri等。即時通訊平台可為QQ、Skype、WeChat、WhatsApp或LINE等。文本32係指文本(Text),是書面語言的表現形式,或一種文檔類型,或指任何文字材料。文本(Text),是指書面語言的表現形式,從文學的角度說,通常是具有完整、系統含義(Message)的一個句子或多個句子的組合。一個文本可以是一個句子(Sentence)、一個段落(Paragraph)或者一個篇章(Discourse)。 The second server 20 includes a database 22 that stores text 32 . The second server 20 can be used to store various kinds of files (eg, documents, pictures or combinations thereof). Here, the second server 20 may be a cloud hard disk platform, a network platform, a browser or an instant messaging platform. The cloud drive platform may be Google Drive, Dropbox, SugarSync, SkyDrive, and the like. The network platform may be Facebook (Facebook), Google+, Twitter (Twitter), or Weibo. The browser may be Chrome, Safari, or the like. The instant messaging platform can be QQ, Skype, WeChat, WhatsApp or LINE, etc. Text 32 means Text, a representation of written language, or a type of document, or any written material. Text refers to the expression of written language. From a literary point of view, it is usually a sentence or a combination of multiple sentences with complete and systematic meaning (Message). A text can be a sentence (Sentence), a paragraph (Paragraph) or a chapter (Discourse).

請同時參照第1圖和第2圖。其中第2圖為繪示本新型之客製化語音服務系統進行客製化語音服務的方法流程圖。 Please refer to Figure 1 and Figure 2 at the same time. Fig. 2 is a flowchart illustrating a method for performing customized voice service in the customized voice service system of the present invention.

步驟S10:用戶端裝置利用應用程式經由一網路連接第一伺服器並執行一事件。 Step S10: The client device uses the application program to connect to the first server via a network and execute an event.

步驟S12:依據事件應用程式連結至第二伺服器取得文本。 Step S12: Connect to the second server according to the event application to obtain text.

步驟S14:應用程式將文本發送至第一伺服器。 Step S14: The application program sends the text to the first server.

步驟S16:第一伺服器接收文本並以文字辨識模組將文本轉換為一模擬人物之一第一語音資料。 Step S16 : The first server receives the text and converts the text into a first voice data of a simulated character by the text recognition module.

步驟S18:用戶端裝置選擇第一語音資料或模擬人物之一第二語音資料予以播放,其中第一語音資料有預設之一語音參數不同於第二語音資料之語音參數。本新型可以透過伺服器產生符合訂製之參數之客製化語音服務以播報輸入文本,提升給予用戶關於語音播報服務之使用體驗。 Step S18 : The client device selects the first voice data or a second voice data of a simulated character to play, wherein the first voice data has a preset voice parameter different from the voice parameter of the second voice data. The new model can generate a customized voice service that conforms to the customized parameters through the server to broadcast the input text, so as to improve the user experience of the voice broadcast service.

如步驟S10所示,用戶端裝置30利用應用程式APP經由網路40連接第一伺服器10並執行事件34,即,執行以訂製語音播報文本之客製化語音服務。 As shown in step S10 , the client device 30 uses the application APP to connect to the first server 10 via the network 40 and execute the event 34 , that is, execute the customized voice service of broadcasting the text with the customized voice.

如步驟S12所示,依據事件34應用程式APP透過網路40連結至第二伺服器20之資料庫22取得文本32。於此,以文本32儲存於第二伺服器20之資料庫22而舉例,而文本32也可以儲存於用戶端裝置30,則不需以應用程式APP連結至第二伺服器20而取得文本32,但不限於此。 As shown in step S12 , according to the event 34 , the application program APP is connected to the database 22 of the second server 20 through the network 40 to obtain the text 32 . Here, the text 32 is stored in the database 22 of the second server 20 as an example, and the text 32 can also be stored in the client device 30, so it is not necessary to connect the application APP to the second server 20 to obtain the text 32 , but not limited to this.

如步驟S14所示,應用程式APP將文本32由網路40發送至第一伺服器10。 As shown in step S14 , the application program APP sends the text 32 to the first server 10 through the network 40 .

如步驟S16所示,第一伺服器10接收文本32並以文字辨識模組102將文本32應用文字轉換成語音技術(TTS)轉換為模擬人物12之第一語音資料122,其係根據文本32相對應之語音元素或拼音之組合產生第一語音資料122。於此,第一語音資料122為對應文本32之具有語音屬性之語音資料,即,第一語音資料122能被播放。於此,第一語音資料122具有預設之語音參數14。語音參數14指構成語音之聲波之屬性,例如:響度,即聲波之振幅(amplitude)、音調(tone),即聲波之頻率(frequency)、音色(timbre),即聲波之波型(waveform)。於此,產生語音資料必具有預設之屬性即語音參數14。 As shown in step S16 , the first server 10 receives the text 32 and uses the text recognition module 102 to convert the text 32 into the first voice data 122 of the simulated character 12 by applying text-to-speech technology (TTS), which is based on the text 32 The combination of corresponding phonetic elements or pinyin generates the first phonetic data 122 . Here, the first voice data 122 is the voice data corresponding to the text 32 with a voice attribute, that is, the first voice data 122 can be played. Here, the first voice data 122 has preset voice parameters 14 . The speech parameter 14 refers to the properties of the sound waves that constitute the speech, such as: loudness, that is, the amplitude of the sound waves, tone, that is, the frequency of the sound waves, and timbre, that is, the waveform of the sound waves. Here, the generated voice data must have a preset attribute, that is, the voice parameter 14 .

如步驟S18所示,用戶端裝置30可以選擇第一語音資料122,或以模擬人物12之第二語音資料124予以播放,其中,第二語音資料124之語音參數14不同於第一語音資料122有預設之語音參數14。於此,本新型提供用戶可以藉由調整語音參數14,即,調整第一語音資料122之一音色142或包含音色142及一音調144,而獲得符合訂製要求之客製化語音。 As shown in step S18 , the client device 30 can select the first voice data 122 or play the second voice data 124 of the simulated character 12 , wherein the voice parameters 14 of the second voice data 124 are different from the first voice data 122 There are preset voice parameters 14 . Here, the present invention provides that the user can obtain a customized voice that meets the customized requirements by adjusting the voice parameters 14 , that is, adjusting a timbre 142 of the first voice data 122 or including the timbre 142 and a tone 144 .

本新型尚提供一種客製化語音服務系統,請參閱第3圖,其為本新型之一實施例之方塊示意圖。如圖所示,本新型之客製化語音服務系統2,包含第一伺服器10以及用戶端裝置30,本新型之客製化語音服務系統2所提供之服務,例如:當用戶輸入用戶語音36,使用系統2將用戶語音36以訂製語音播放。此處之輸入用戶語音36指用戶透過上傳數位檔案或用戶端裝置30以錄音機採集之形式,將其所唸出之話語輸入於用戶端裝置30。於此,用戶語音36若以檔案形式緩存於用戶端裝置30,則用戶端裝置30應具有資料庫22用以儲存檔案。 The present invention also provides a customized voice service system. Please refer to FIG. 3 , which is a block diagram of an embodiment of the new model. As shown in the figure, the customized voice service system 2 of the present invention includes a first server 10 and a client device 30. The services provided by the customized voice service system 2 of the present invention, for example, when the user inputs the user's voice 36. Use the system 2 to play the user's voice 36 in the customized voice. The input user voice 36 here refers to the user inputting the words spoken by the user into the client device 30 by uploading a digital file or in the form of recording by the client device 30 with a tape recorder. Here, if the user voice 36 is cached in the client device 30 in the form of a file, the client device 30 should have a database 22 for storing the file.

於此,第一伺服器10包含一語音辨識模組104,語音辨識模組104採用語音轉文字識別(STT)將語音識別為文字。語音轉文字識別(STT),其相關於文字轉換成語音技術(TTS),語音轉文字識別(SST)是以演算法將語音內容轉換為相對應的文字,其實施方式為將人類發出的語音輸入電腦系統後,首先由數位訊號處理器(Digital Signal Processor)將聲音分解成不同的聲音頻帶(Frequency Bands),這些頻帶必須經過電腦內專家系統程式的解析,以判斷每個聲音的片段,稱之為音位(Phonemes),然後在配合其他程式的判斷將音位組合成單字,再利用語法的知識庫(Knowlege Base)判斷每個單字的語法關係是否符合人類的用法,最後將整個結果於電腦螢幕上輸出,簡單描述即透過將聲音特徵比對資料庫22,並將語音內容轉換為可能的文字。 Here, the first server 10 includes a speech recognition module 104, and the speech recognition module 104 uses speech-to-text recognition (STT) to recognize speech into text. Speech-to-text recognition (STT), which is related to text-to-speech technology (TTS), and speech-to-text recognition (SST) is an algorithm that converts speech content into corresponding text, which is implemented by converting human speech. After input into the computer system, the digital signal processor (Digital Signal Processor) first decomposes the sound into different frequency bands (Frequency Bands), these frequency bands must be analyzed by the expert system program in the computer to determine each sound segment, called It is called Phonemes, and then combined with the judgment of other programs, the phonemes are combined into single words, and then the knowledge base of grammar is used to judge whether the grammatical relationship of each word conforms to human usage, and finally the whole result is put in Output on the computer screen, the simple description is by comparing the sound features with the database 22 and converting the speech content into possible words.

此處採用語音辨識模組104辨識用戶語音36為文字,而非直接調整用戶語音36,第一點原因係根據語音轉文字識別(SST)能將聲音拆解為數個片段,再依據每個片段分別作出聲學上的調整,得到的結果將更加精確。第二點原因係根據辨識用戶語音36為文字之結果,將能結合聲控之技術而加以利用。 Here, the speech recognition module 104 is used to recognize the user's voice 36 as text, rather than directly adjusting the user's voice 36. The first reason is that the voice can be disassembled into several segments according to the speech-to-text recognition (SST), and then each segment Acoustic adjustments are made separately, and the result will be more accurate. The second reason is that according to the result of recognizing the user's voice 36 as text, it can be used in combination with the technology of voice control.

而用戶端裝置30通信連接第一伺服器10,其中通訊連接可以包括用來傳遞訊息的任何有線或無線傳輸技術;目前現有已知的無線傳輸技術,諸如CDMA、3G或4G、LTE、Wi-Fi、WiMax、WWAN、WLAN、WPAN、藍牙等等。 The client device 30 is communicatively connected to the first server 10, wherein the communication connection may include any wired or wireless transmission technology for transmitting messages; currently known wireless transmission technologies, such as CDMA, 3G or 4G, LTE, Wi- Fi, WiMax, WWAN, WLAN, WPAN, Bluetooth, and more.

用戶端裝置30並安裝有應用程式APP。於此,應用程式APP可為行動應用程式(mobile application,簡稱mobile app或app)或線上應用程式(web application,簡稱web app或app)。於此,用戶端裝置30可執行行動作業系統(Mobile operating system;簡稱Mobile OS),並且利用行動作業系統執行應用程式APP與對應之伺服器互動。於此,行動作業系統例如:安卓(Android)作業系統、iOS作業系統等。 The client device 30 is installed with an application program APP. Here, the application program APP may be a mobile application (mobile application, referred to as mobile app or app) or an online application (web application, referred to as web app or app). Here, the client device 30 can execute a mobile operating system (Mobile OS for short), and use the mobile operating system to execute an application program APP to interact with the corresponding server. Here, the mobile operating system is, for example, an Android (Android) operating system, an iOS operating system, and the like.

請同時參照第3圖和第4圖。其中第4圖為繪示本新型之客製化語音服務系統進行客製化語音服務的方法流程圖。 Please refer to Figure 3 and Figure 4 at the same time. FIG. 4 is a flow chart illustrating a method for performing customized voice service in the customized voice service system of the present invention.

步驟S20:用戶端裝置利用應用程式經由網路連接第一伺服器並執行事件。 Step S20: The client device uses the application program to connect to the first server via the network and execute the event.

步驟S22:應用程式將用戶語音發送至第一伺服器。 Step S22: The application program sends the user's voice to the first server.

步驟S24:第一伺服器接收用戶語音並以語音辨識模組將用戶語音轉換為模擬人物之第一語音資料。 Step S24: The first server receives the user's voice and converts the user's voice into the first voice data of the simulated character by using the voice recognition module.

步驟S26:用戶端裝置選擇第一語音資料或模擬人物之第二語音資料予以播放,其中第一語音資料有預設之語音參數不同於第二語音資料之語 音參數。本新型可以透過伺服器辨識輸入之語音,產生符合訂製之參數之客製化語音並進行播放,提升給予用戶關於語音播報服務之使用體驗。 Step S26: The client device selects the first voice data or the second voice data of the simulated character to play, wherein the first voice data has preset voice parameters different from the words of the second voice data sound parameters. The new model can identify the input voice through the server, generate customized voice that meets the customized parameters and play it, and improve the user experience of the voice broadcast service.

如步驟S20所示,用戶端裝置30利用應用程式APP經由網路40連接第一伺服器10並執行事件34,即,執行將輸入語音變聲之客製化語音服務。 As shown in step S20, the client device 30 uses the application program APP to connect to the first server 10 via the network 40 and execute the event 34, that is, execute the customized voice service of changing the voice of the input voice.

如步驟S22所示,應用程式APP將用戶語音36由網路40發送至第一伺服器10。 As shown in step S22 , the application program APP sends the user voice 36 to the first server 10 through the network 40 .

如步驟S24所示,第一伺服器10接收用戶語音36並以語音辨識模組104將用戶語音36應用語音轉文字識別(SST)轉換為模擬人物12之第一語音資料122,其係根據用戶語音36將聲音拆解片段而重製產生第一語音資料122。於此,第一語音資料122為對應用戶語音36之具有語音屬性之語音資料,即,第一語音資料122能被播放。於此,第一語音資料122具有預設之語音參數14。舉例來說,語音參數14之聲學上特徵,例如:音調(tone)、音色(timbre)應對應於用戶之聲學特徵,即用戶之原聲,但不限於此。 As shown in step S24, the first server 10 receives the user's voice 36 and uses the voice recognition module 104 to convert the user's voice 36 into the first voice data 122 of the simulated character 12 using speech-to-text recognition (SST). The voice 36 disassembles the voice into segments and reproduces the first voice material 122 . Here, the first voice data 122 is the voice data corresponding to the user's voice 36 with voice attributes, that is, the first voice data 122 can be played. Here, the first voice data 122 has preset voice parameters 14 . For example, the acoustic features of the speech parameter 14, such as tone and timbre, should correspond to the user's acoustic features, ie, the user's original voice, but not limited thereto.

如步驟S26所示,用戶端裝置30可以選擇第一語音資料122,或以模擬人物12之第二語音資料124予以播放,其中,第二語音資料124之語音參數14不同於第一語音資料122有預設之語音參數14。於此,本新型提供用戶可以藉由調整語音參數14,即,調整第一語音資料122之音色142或包含音色142及音調144,而獲得符合訂製要求之客製化語音。 As shown in step S26, the client device 30 can select the first voice data 122, or play the second voice data 124 of the simulated character 12, wherein the voice parameters 14 of the second voice data 124 are different from the first voice data 122 There are preset voice parameters 14 . Here, the present invention provides that the user can obtain a customized voice that meets the customized requirements by adjusting the voice parameters 14 , that is, adjusting the timbre 142 of the first voice data 122 or including the timbre 142 and the pitch 144 .

此處並提供本新型之一些具體實施例作為實施方式之參考,但不在此限: Some specific embodiments of the present invention are provided here as a reference for implementation, but are not limited thereto:

本例之用戶端裝置30可以係手機;今於用戶端裝置30啟用應用程式APP並透過網路40連接第一伺服器10執行事件34,事件34係客製化語音播報 服務,本例之客製化語音播報服務以用戶手動啟用應用程式APP並啟動導讀功能為例,也可以係用戶聲控應用程式APP啟用代讀功能,但不在此限;本例以對新聞頁面執行客製化語音播報服務為例,也可以係對文章執行客製化語音播報服務或用戶於遊戲中要求應用程式APP讀出遊戲攻略,但不在此限;首先,應用程式APP透過網路40連接至第二伺服器20取得文本32,本例之文本32以新聞網頁(例:.html檔案)上之文字內容(text或.txt檔案)為例,但不在此限;用戶端裝置30將文本32透過網路40傳送至第一伺服器10,第一伺服器10上有文字辨識模組102透過文字轉換成語音技術將文本32以對應模擬人物12之聲音產生之第一語音資料122,本例可以有多個模擬人物12供用戶挑選;再以預設之第一語音資料122之語音參數14予以播放,語音參數14包含音色142或包含音色142及音調144,用戶端裝置30可以選擇模擬人物12之第二語音資料124予以播放,第二語音資料124之語音參數14與第一語音資料122之語音參數14不同。 The client device 30 in this example can be a mobile phone; now, the application program APP is activated on the client device 30 and connected to the first server 10 through the network 40 to execute the event 34, and the event 34 is a customized voice broadcast Service, the customized voice broadcast service in this example takes the user manually enabling the application APP and enabling the guide reading function as an example, or the user's voice-controlled application APP to enable the reading function, but not limited to this; this example is performed on the news page. Take the customized voice broadcast service as an example, it can also be used to implement the customized voice broadcast service for the article or the user requests the application APP to read the game guide in the game, but this is not limited; first, the application APP is connected through the network 40 Obtain the text 32 from the second server 20. The text 32 in this example takes the text content (text or .txt file) on a news webpage (eg: .html file) as an example, but not limited to this; the client device 30 stores the text 32 is transmitted to the first server 10 through the network 40. The first server 10 has the text recognition module 102 through the text-to-speech technology to convert the text 32 into the first voice data 122 generated by the voice corresponding to the simulated character 12. For example, there may be a plurality of simulated characters 12 for the user to select; then the preset voice parameters 14 of the first voice data 122 are used to play, and the voice parameters 14 include the timbre 142 or the timbre 142 and the tone 144, and the client device 30 can select the simulation The second voice data 124 of the character 12 is played, and the voice parameters 14 of the second voice data 124 are different from the voice parameters 14 of the first voice data 122 .

本新型尚提供一具體實施例,本例之用戶端裝置30可以係手機或個人電腦,亦可以係其他應用程式APP於用戶端裝置30啟用應用程式APP並連結提供服務之第一伺服器10,本例以某遊戲程式啟用應用程式APP提供遊戲角色之配音;並執行事件34,事件34係客製化語音播報服務,本例以程式執行該服務對不同場景(例:遊戲角色)提供不同對話內容為例,但不在此限;其他步驟如上述不再贅述,該具體實施例之特徵在於對不同場景之不同對話內容,有不同之語音參數14設定,本例之語音參數14由其他應用程式APP提供,文本32為角色配音台詞亦由其他應用程式APP提供,但不在此限;通過客製化語音播報系統,其他應用程式APP便能對不同場景提供不同對話內容,能優化檔案容量之利用率。 The present invention also provides a specific embodiment. The client device 30 in this example can be a mobile phone or a personal computer, or other application APP. The client device 30 enables the application APP and connects to the first server 10 that provides services. In this example, a game program is used to activate the application APP to provide the voice of the game character; and event 34 is executed. Event 34 is a customized voice broadcast service. In this example, the program executes the service to provide different dialogues for different scenarios (eg game characters). The content is taken as an example, but not limited to this; other steps are not repeated as described above. The feature of this specific embodiment is that different speech parameters 14 are set for different dialogue contents in different scenarios, and the speech parameters 14 in this example are set by other applications. Provided by the APP, the text 32 is the character dubbing and the lines are also provided by other application APPs, but not limited to this; through the customized voice broadcast system, other application APPs can provide different dialogue content for different scenarios, which can optimize the utilization of file capacity Rate.

本新型另提供一具體實施例,參閱第4圖,本例之用戶端裝置30可以係手機或個人電腦,但不在此限,今於用戶端裝置30啟用應用程式APP並透過網路40連接應用程式APP之服務供應商之第一伺服器10並執行事件34,事件34係客製化語音服務,本例以用戶唸出自己要發送的內容進行變聲為例說明客製化語音服務,但不在此限。應用程式APP將用戶語音36,本例以用戶唸出內容之檔案(例:.mp3或.aac檔案、應用程式APP亦能提供即時錄音之功能)為用戶語音36,但不限於此;用戶端裝置30將用戶語音36傳送至第一伺服器10,第一伺服器10上有語音辨識模組104利用語音轉文字識別(STT)產生對應用戶語音36之第一語音資料122,其對應模擬人物12,本例為用戶對應模擬人物12,但不在此限;再以預設之第一語音資料122之語音參數14予以播放,語音參數14包含音色142或包含音色142及音調144,用戶端裝置30可以選擇模擬人物12之第二語音資料124予以播放,第二語音資料124之語音參數14與第一語音資料122之語音參數14不同。本例之第一語音資料122、第二語音資料124即前述用戶唸出內容經變聲後之結果,但不在此限。 The present invention also provides a specific embodiment, please refer to FIG. 4 , the client device 30 in this example can be a mobile phone or a personal computer, but not limited to this. Now, the application program APP is enabled on the client device 30 and the application is connected through the network 40 . The first server 10 of the service provider of the program APP executes event 34. Event 34 is a customized voice service. In this example, the user reads out the content to be sent and changes the voice as an example to illustrate the customized voice service. this limit. The application APP uses the user's voice 36 . In this example, the file of the content read out by the user (for example: .mp3 or .aac file, the application APP can also provide the function of real-time recording) is the user's voice 36 , but it is not limited to this; The device 30 transmits the user's voice 36 to the first server 10, and the first server 10 has a voice recognition module 104 using speech-to-text recognition (STT) to generate a first voice data 122 corresponding to the user's voice 36, which corresponds to a simulated character 12. In this example, the user corresponds to the simulated character 12, but it is not limited to this; the voice parameter 14 of the preset first voice data 122 is used to play, and the voice parameter 14 includes the timbre 142 or the timbre 142 and the tone 144. The user-end device 30. The second voice data 124 of the simulated character 12 can be selected to be played, and the voice parameters 14 of the second voice data 124 are different from the voice parameters 14 of the first voice data 122. In this example, the first voice data 122 and the second voice data 124 are the results of the voice-changed content spoken by the user, but not limited to this.

綜上所述,本新型之客製化語音服務系統,其供用戶自訂語音參數並據此調整語音為客製化語音,對特定事件進行客製化語音配音,提供用戶一種能使用喜愛的聲音來進行語音播報服務之可能,提高了使用體驗。 To sum up, the new customized voice service system of the present invention allows users to customize the voice parameters and adjust the voice to a customized voice accordingly. The possibility of using voice to perform voice broadcast service improves the user experience.

故本新型實為一具有新穎性、進步性及可供產業上利用者,應符合我國專利法專利申請要件無疑,爰依法提出新型專利申請,祈鈞局早日賜准專利,至感為禱。 Therefore, this new model is truly novel, progressive and available for industrial use. It should meet the patent application requirements of my country's patent law. It is absolutely necessary to file a new patent application in accordance with the law.

惟以上所述者,僅為本新型之較佳實施例而已,並非用來限定本新型實施之範圍,舉凡依本新型申請專利範圍所述之形狀、構造、特徵及精神所為之均等變化與修飾,均應包括於本新型之申請專利範圍內。 However, the above descriptions are only preferred embodiments of the present invention and are not intended to limit the scope of implementation of the present invention. All changes and modifications made in accordance with the shape, structure, features and spirit described in the scope of the patent application of the present invention are equivalent. , shall be included in the scope of the patent application of this new model.

1:客製化語音服務系統 1: Customized voice service system

10:第一伺服器 10: The first server

102:文字辨識模組 102: Text recognition module

12:模擬人物 12: Simulated characters

122:第一語音資料 122: The first voice data

124:第二語音資料 124:Second voice data

14:語音參數 14: Voice parameters

142:音色 142: Tone

144:音調 144: Tone

20:第二伺服器 20: Second server

22:資料庫 22:Database

30:用戶端裝置 30: Client Device

32:文本 32: Text

34:事件 34: Events

40:網路 40: Internet

APP:應用程式 APP: application

Claims (4)

一種客製化語音服務系統,當一用戶選擇一文本,使用該系統將該文本以訂製語音播報,該系統包含:一第一伺服器,其包含一文字辨識模組,該文字辨識模組採用文字轉語音技術(TTS)將文字轉換為語音;一第二伺服器,其包含一資料庫,儲存有該文本;以及一用戶端裝置,通信連接該第一伺服器與該第二伺服器,用以利用一應用程式經由一網路連接該第一伺服器並執行一事件,依據該事件該應用程式連結至該第二伺服器取得該文本,該應用程式將該文本發送至該第一伺服器,該第一伺服器接收該文本並以該文字辨識模組將該文本轉換為一模擬人物之一第一語音資料,該用戶端裝置選擇該第一語音資料或該模擬人物之一第二語音資料予以播放,其中該第一語音資料有預設之一語音參數不同於該第二語音資料之該語音參數。 A customized voice service system, when a user selects a text, the system is used to broadcast the text with customized voice, the system includes: a first server, which includes a text recognition module, the text recognition module adopts Text-to-speech technology (TTS) converts text into speech; a second server including a database storing the text; and a client device communicating with the first server and the second server, for using an application program to connect to the first server via a network and execute an event, according to the event the application program connects to the second server to obtain the text, the application program sends the text to the first server the first server receives the text and converts the text into a first voice data of an analog character by the text recognition module, and the client device selects the first voice data or a second voice data of the analog character The voice data is played, wherein a preset voice parameter of the first voice data is different from the voice parameter of the second voice data. 如請求項1所述之客製化語音服務系統,其中,該語音參數包含一音色,或包含該音色及一音調。 The customized voice service system according to claim 1, wherein the voice parameter includes a timbre, or the timbre and a tone. 一種客製化語音服務系統,當一用戶輸入一用戶語音,使用該系統將該語音以訂製語音播放,該系統包含:一第一伺服器,其包含一語音辨識模組,該語音辨識模組採用語音轉文字識別(STT)辨識語音;以及一用戶端裝置,通信連接該第一伺服器,用以利用一應用程式經由一網路連接該第一伺服器並執行一事件,依據該事件該應用程式將該用戶語音發送至該第一伺服器,該第一伺服器接收該用戶語音並以該語音辨識模組將該用戶語音轉換為一模擬人物之一第一語音資料,該用戶端裝置選擇該第一語音資料或該模擬人物之一第二語音資料予以播放,其中該第一語音資料有預設之一語音參數不同於該第二語音資料之該語音參數。 A customized voice service system, when a user inputs a user voice, use the system to play the voice as a customized voice, the system includes: a first server, which includes a voice recognition module, the voice recognition module The group adopts speech-to-text recognition (STT) to recognize speech; and a client device communicatively connected to the first server for using an application program to connect to the first server via a network and execute an event, according to the event The application program sends the user voice to the first server, the first server receives the user voice and uses the voice recognition module to convert the user voice into a first voice data of an analog character, the client The device selects the first voice data or a second voice data of the simulated character to play, wherein the first voice data has a preset voice parameter different from the voice parameter of the second voice data. 如請求項3所述之客製化語音服務系統,其中,該語音參數包含一音色,或包含該音色及一音調。 The customized voice service system according to claim 3, wherein the voice parameter includes a timbre, or the timbre and a tone.
TW110208867U 2021-07-28 2021-07-28 A system for customized speech TWM621764U (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
TW110208867U TWM621764U (en) 2021-07-28 2021-07-28 A system for customized speech

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
TW110208867U TWM621764U (en) 2021-07-28 2021-07-28 A system for customized speech

Publications (1)

Publication Number Publication Date
TWM621764U true TWM621764U (en) 2022-01-01

Family

ID=80784853

Family Applications (1)

Application Number Title Priority Date Filing Date
TW110208867U TWM621764U (en) 2021-07-28 2021-07-28 A system for customized speech

Country Status (1)

Country Link
TW (1) TWM621764U (en)

Similar Documents

Publication Publication Date Title
KR102581346B1 (en) Multilingual speech synthesis and cross-language speech replication
US9905220B2 (en) Multilingual prosody generation
CN106898340B (en) Song synthesis method and terminal
US20210366462A1 (en) Emotion classification information-based text-to-speech (tts) method and apparatus
CN110197655B (en) Method and apparatus for synthesizing speech
US20100312565A1 (en) Interactive tts optimization tool
Qian et al. A cross-language state sharing and mapping approach to bilingual (Mandarin–English) TTS
JP7228998B2 (en) speech synthesizer and program
US20230206897A1 (en) Electronic apparatus and method for controlling thereof
WO2021212954A1 (en) Method and apparatus for synthesizing emotional speech of specific speaker with extremely few resources
JP6111802B2 (en) Spoken dialogue apparatus and dialogue control method
CN116018638A (en) Synthetic data enhancement using voice conversion and speech recognition models
JP2024508033A (en) Instant learning of text-speech during dialogue
KR20150105075A (en) Apparatus and method for automatic interpretation
CN115668358A (en) Method and system for user interface adaptation for text-to-speech synthesis
CN113421550A (en) Speech synthesis method, device, readable medium and electronic equipment
Panda et al. An efficient model for text-to-speech synthesis in Indian languages
JP6013104B2 (en) Speech synthesis method, apparatus, and program
CN113450760A (en) Method and device for converting text into voice and electronic equipment
CN113948062B (en) Data conversion method and computer storage medium
TWM621764U (en) A system for customized speech
CN114446304A (en) Voice interaction method, data processing method and device and electronic equipment
Ghimire et al. Enhancing the quality of nepali text-to-speech systems
JP2021085943A (en) Voice synthesis device and program
TW202305644A (en) A method for generating customized speech