JP6474518B1

JP6474518B1 - Simple operation voice quality conversion system

Info

Publication number: JP6474518B1
Application number: JP2018126181A
Authority: JP
Inventors: 坪井純子
Original assignee: 坪井純子
Priority date: 2018-07-02
Filing date: 2018-07-02
Publication date: 2019-02-27
Anticipated expiration: 2038-07-02
Also published as: JP2020003762A

Abstract

【課題】
本発明は、誰でも手持ちの情報端末装置を利用して、カラオケの楽しみを簡易操作でバラエティーに富んだものとするとともに、従来にはなかったカラオケ装置の新たな楽しみ方を提供し、家族や友人等とのコミュニケーションを深めることができる技術の提供を課題とするものである。
【解決手段】
簡易な操作で声質を自在に変換可能なシステムであって、小型軽量で持ち運ぶことができる情報端末装置と、カラオケ装置とから構成され、前記情報端末装置は、表示機能に表されるアイコンの選択操作により、所望する声質変換を行う声質変換機能を備え、前記カラオケ装置は、前記情報端末装置が備えるモバイルデータ通信機能に対応した無線LANに接続して声質変換された音情報を入力し、音響機構を介してスピーカーから出力する手段を採用した。
【選択図】図１
【Task】
The present invention provides a variety of enjoyment of karaoke by simple operation using anyone's handheld information terminal device, and provides a new way of enjoying karaoke equipment that has not been available so far. The challenge is to provide technology that can deepen communication with friends.
[Solution]
A system that can freely convert voice quality with a simple operation, and is composed of a small and lightweight information terminal device that can be carried and a karaoke device, and the information terminal device selects an icon represented by a display function A voice quality conversion function for performing desired voice quality conversion by operation is provided. The karaoke apparatus is connected to a wireless LAN corresponding to a mobile data communication function provided in the information terminal apparatus, and inputs voice information whose voice quality has been converted. A means to output from the speaker through the mechanism was adopted.
[Selection] Figure 1

Description

本発明は、カラオケによる音楽を楽しむ技術に関し、詳しくは、携帯電話機のような身近に所有する情報端末装置を利用して、簡易な操作で、歌唱者の声を他人の声に変換させたり、動物の声や機械音の合いの手を音響装置から出力することができるシステムの技術に関する。 The present invention relates to a technique for enjoying music by karaoke, and more specifically, by using a familiar information terminal device such as a mobile phone, the voice of a singer can be converted into the voice of another person with a simple operation, The present invention relates to a technology of a system that can output an animal voice or a hand of mechanical sound from an audio device.

人の声は個人ごとに異なり、自分の声でしか話すことができず、聞き心地の良い声優や俳優の声、または厳格な印象を与える年配者や、可愛らしい子供の声、或いは、場面に応じて自分以外の声で歌いたいと、憧れを抱く人も多い。 Human voices vary from person to person and can only be spoken in their own voice, depending on the voice of the voice actor or actor who is comfortable to listen to, the elderly who gives a strict impression, the voice of a cute child, or the scene Many people have longing to sing with a voice other than their own.

また、日本人は、世代を問わず、カラオケ装置を利用して歌唱することを、日常の楽しみとしている人が多く存在する。カラオケ設置店等では、伴奏に合わせて歌声を披露し合い、同じ空間を楽しむといったことが、ごく当たり前になっている。しかしながら、同じ高さの声を同じ音量で歌った場合においても、人によって声の質は異なるため、ある人の声はのっぺりとした「広がりのない声」となるのに対し、ある人の声は「響きが空間に広がる声」となる。歌うことに興味がある人であれば誰でも、そんな「歌声の個人差」を感じるものである。そこで、従来から、カラオケ装置にはエコー機能（原音から遅れてくる「ディレイ音」を足す）が設けられ、広く普及している。しかし、これだけでは十分に満足できるものではなく、また、一部にはボイスチェンジャー機能を備えたカラオケ装置も登場しているが、係る機能を備えた装置で利用するしかなく、広く解放されているとはいえない。 In addition, there are many people who enjoy everyday singing using a karaoke device regardless of generation. In karaoke establishments, it is very common to sing along with accompaniment and enjoy the same space. However, even when singing the same level of voice at the same volume, the quality of the voice varies from person to person, so that the voice of a person becomes a soft “unspread voice”, whereas the voice of a person Becomes a "voice that resonates in space." Anyone who is interested in singing feels such a “single voice individual difference”. Therefore, conventionally, the karaoke apparatus is provided with an echo function (adding a “delay sound” delayed from the original sound) and is widely spread. However, this alone is not satisfactory, and some karaoke devices with a voice changer function have appeared, but they can only be used with devices with such a function and are widely released. That's not true.

さらに、カラオケでの別の楽しみ方として、他人になりきって歌う「ものまね」がある。声色を変えて「ものまね」ができたりする人は人気者になりやすく、皆を楽しませてくれる。しかし、「ものまね」の技術もなかなか難しく、万人がその技法で楽しめるとはいえない。従って、前記のような「歌声」や、「ものまね」の才能を持っていない人でも、身近な機器を利用し、簡易な操作で皆を楽しませることができる技術が求められているといえる。 Another way to enjoy karaoke is “imitation”, where you sing as if you were someone else. People who change their voices and make “imitations” tend to become popular and entertain everyone. However, the technology of “imitation” is quite difficult, and it cannot be said that everyone can enjoy it. Therefore, it can be said that there is a demand for a technology that enables even those who do not have the “singing voice” and “imitation” talents described above to enjoy everyone with simple operations using familiar devices.

係る問題を解決しようと、従来から種々の技術提案がされている。例えば、特許文献１には、発明の名称を「リアルタイム発話自然言語翻訳のための方法およびそのための装置」とする技術が開示され、公知技術となっている（特許文献１参照）。具体的には「発話言語のリアルタイム自動翻訳を実行するための方法および装置の提供」を課題とし、その解決手段は「発話言語入力が、一方の側で少なくとも１つのソース言語を含む形で受け取られ、自然言語処理テクノロジおよび言語翻訳処理テクノロジを使用して、他方の側で少なくとも１つのターゲット言語を含む形で送出される。発話言語入力は、１人のユーザから他のユーザに伝達する、音声、単語、文、アクセント、句、発音を含む。伝達発話言語入力は、ターゲット出力言語翻訳のために、一方の側で、自然言語分解テクノロジによって実行される分解プロセスを通過し、デジタル化グローバルソース分解言語フォーマットに符号化される。符号化分解デジタル化グローバルソース言語フォーマットは、他方の側によって受け取られ、統合デジタル化グローバルターゲット言語フォーマットに復号され、それは、他方の側で、自然言語統合テクノロジによって実行されるさらなる統合復号プロセスを通過し、少なくとも１つのターゲット言語を含む発話言語出力に翻訳される。」というものである。この技術は、言語の翻訳をリアルタイムに行えるという優れたものであり、これを翻訳ではなく日本語の標準語に対応する訛りや、赤ちゃん言葉、現代の女子高生などの間で使われている短縮言葉等にも変換する応用できるといえる。しかしながら、それをカラオケに合わせて行うとならば、変換時に生じるタイムラグや、カラオケのテンポに合わせて変換後の文字数の調節等の解決した専用の設備をカラオケ装置が有していなければならず、その様な装置を内蔵していない既存のカラオケ設備では対応できない。 Conventionally, various technical proposals have been made to solve such problems. For example, Patent Document 1 discloses a technique in which the name of the invention is “a method for real-time utterance natural language translation and an apparatus therefor”, which is a known technique (see Patent Document 1). Specifically, the problem is “providing a method and apparatus for performing real-time automatic translation of spoken language”, and the solution is “the spoken language input is received in a form including at least one source language on one side. Transmitted using natural language processing technology and language translation processing technology, including at least one target language on the other side, spoken language input communicated from one user to another, Includes speech, words, sentences, accents, phrases, and pronunciations.Transmitted spoken language input goes through a decomposition process performed by natural language decomposition technology on one side for target output language translation and digitized global Encoded to source decomposition language format, which is received by the other side. Is taken and decoded into an integrated digitized global target language format, which on the other side goes through a further integrated decoding process performed by natural language integration technology and is translated into spoken language output containing at least one target language It is said. This technology is excellent in that language translation can be performed in real time, and this is not used for translation, but it is a shortening used among Japanese language students who speak babies and modern high school students. It can be said that it can be applied to words and so on. However, if it is performed in accordance with karaoke, the karaoke device must have a dedicated facility that solves the problem such as the time lag that occurs during conversion and the adjustment of the number of characters after conversion in accordance with the tempo of karaoke, Existing karaoke facilities that do not have such devices are not compatible.

また、特許文献２には、発明の名称を「カラオケ用ボイスチェンジャー装置」とする技術が開示され、公知技術となっている（特許文献２参照）。具体的には「歌い手のフットスイッチ操作によるボイスチェンジャーである。」ことを課題とするもので、その解決手段は「マイクとカラオケ装置の結線の中間にフットスイッチ本体を設け、フットスイッチ本体内部に、ボイスチェンジャー回路を設ける」というものである。しかし、係る発明では、フットスイッチという特殊な装置を利用しなくてはならず、本発明が課題とする手軽に誰でも持っている装置を使ってという課題を解決するに至っていない。 Patent Document 2 discloses a technique in which the name of the invention is a “karaoke voice changer device”, which is a known technique (see Patent Document 2). Specifically, “It is a voice changer by the foot switch operation of a singer.” The solution is “providing a foot switch body in the middle of the connection between the microphone and the karaoke device, The voice changer circuit is provided. " However, in such an invention, a special device called a foot switch must be used, and the problem of using a device that anyone easily has, which is a problem of the present invention, has not been solved.

また、特許文献３には、発明の名称を「携帯型音声再生及び拡声装置」とする技術が開示され、公知技術となっている（特許文献３参照）。具体的には「携帯型カラオケ装置において、カラオケ使用に好適な簡便操作が可能なキー調整機能を実現する。」ことを課題とするもので「音声再生部及びスピーカーを収容したケース本体と、該ケース本体に対し一体的に連結された音声入力部とを有し、該音声再生部の音声記録媒体から音声再生信号を取出す音声再生手段、該音声入力部で入力される外音を入力音声信号に基づく信号を混同して増幅する信号混合増幅手段を備えた携帯型音声再生装置」というものである。しかし、係る発明では、操作が簡単になったものの、専用の装置が必要であり、誰もが、手持ちの携帯電話機等と既存のカラオケ装置を用いて、従来なかった楽しみを味わえるとはいえず、本発明の課題を解決するに至っていない。 Patent Document 3 discloses a technique in which the name of the invention is “portable audio reproduction and loudspeaker”, which is a known technique (see Patent Document 3). Specifically, in a portable karaoke apparatus, a key adjustment function capable of a simple operation suitable for karaoke use is realized. “A case main body containing an audio reproduction unit and a speaker; An audio input unit integrally connected to the case body, and an audio reproduction means for taking out an audio reproduction signal from an audio recording medium of the audio reproduction unit, and an external sound input by the audio input unit as an input audio signal Is a portable audio reproduction device including signal mixing and amplification means for mixing and amplifying signals based on the above. However, in the invention, although the operation is simplified, a dedicated device is necessary, and it cannot be said that everyone can enjoy unprecedented enjoyment using a handheld mobile phone or the like and an existing karaoke device. However, the problem of the present invention has not been solved.

特表２０１０―５２８３８９号公報Special table 2010-528389 特開平１０−２２２１８５号公報JP-A-10-222185 実開平４―１２６３００号公報Japanese Utility Model Publication No. 4-126300

本発明は、誰でも、自己の所有する携帯電話機等の情報端末装置をマイク代わりに用い、既存のカラオケ装置でも、バラエティーに富んだアレンジを施したり、声質を変化させて他人になりきって歌唱等を楽しむことができ、また、機械の苦手な方でも、自分が使い慣れた情報端末装置の表示画面のアイコン操作により、これらを簡単に実行することが可能な、簡易操作声質変換システムの提供を課題とするものである。 The present invention allows anyone to use their own information terminal device such as a mobile phone instead of a microphone, and even with an existing karaoke device, to give a variety of arrangements or to change the voice quality to sing Providing a simple operation voice quality conversion system that allows users who are not good at the machine to easily execute these by operating the icons on the display screen of the information terminal device they are familiar with It is to be an issue.

本発明は、簡易な操作で声質を自在に変換可能なシステムであって、小型軽量で持ち運ぶことができる情報端末装置とカラオケ装置とから構成され、前記情報端末装置は、表示機能に表されるアイコンの選択操作により、所望する声質変換を行う声質変換機能を備え、前記カラオケ装置は、前記情報端末装置が備えるモバイルデータ通信機能に対応した無線LANに接続して声質変換された音情報を入力し、音響機構を介してスピーカーから出力するという手段を採用した。 The present invention is a system in which voice quality can be freely converted by a simple operation, and is composed of an information terminal device and a karaoke device that can be carried in a small size and light weight, and the information terminal device is represented by a display function. A voice quality conversion function for performing desired voice quality conversion by an icon selection operation is provided, and the karaoke apparatus is connected to a wireless LAN corresponding to a mobile data communication function provided in the information terminal device and inputs voice information subjected to voice quality conversion. Then, a means of outputting from the speaker via an acoustic mechanism was adopted.

また、本発明は、前記声質変換機能により変換する前記声質変換が、入力された歌唱者とは異なる他人の声帯音源に変更させる手段を採用することもできる。 Further, the present invention may employ means for changing the voice quality conversion to be converted by the voice quality conversion function to a vocal cord sound source of another person different from the input singer.

また、本発明は、前記声質変換機能により変換する声質変換が、入力された歌唱者の声帯音源に基づいて動物の鳴き声、ロボット口調、楽器音、自然音、または機械音に変更させる手段を採用することもできる。 Further, the present invention employs means for changing the voice quality conversion performed by the voice quality conversion function into an animal sound, a robot tone, a musical instrument sound, a natural sound, or a mechanical sound based on the input vocal cord sound source of the singer. You can also

また、本発明は、前記声質変換機能が、前記情報端末装置に備えた角速度センサーを利用し、情報端末装置の回転、又は向きの変化を検知し、係る動きに応じて、打楽器、リズム楽器、または前記機械音に変換させる手段を採用することもできる。 Further, the present invention uses the angular velocity sensor provided in the information terminal device, the voice quality conversion function detects rotation of the information terminal device or a change in direction, and according to the movement, a percussion instrument, a rhythm instrument, Alternatively, a means for converting to the mechanical sound can be employed.

また、本発明は、前記声質変換機能によって変換された音を合いの手として付加する手段を採用することもできる。 The present invention can also employ means for adding the sound converted by the voice quality conversion function as a match.

本発明に係る簡易操作声質変換システムによれば、使い慣れた身近な携帯電話機等を用いて、その表示画面に表れるアイコンの操作により、上記のような歌唱に基づく声質の変更等を簡単に操作可能であり、電子機器の操作が苦手な人でも手軽に楽しめるといった優れた効果を発揮する。 According to the simple operation voice quality conversion system according to the present invention, it is possible to easily change the voice quality based on the singing as described above by operating an icon displayed on the display screen using a familiar mobile phone or the like. It also has an excellent effect that even people who are not good at operating electronic devices can enjoy it easily.

また、本発明に係る簡易操作声質変換システムによれば、自分の声質を他人の声質に変化させたり、一人カラオケでも、合いの手を入れたり、楽器の音や動物の声での歌唱としたり、誰でもカラオケを楽しむことが可能となる優れた効果を発揮する。 In addition, according to the simple operation voice quality conversion system according to the present invention, the voice quality of one's voice can be changed to the voice quality of another person, even in karaoke alone, a match can be made, or the voice of an instrument or the voice of an animal can be sung. However, it demonstrates an excellent effect that makes it possible to enjoy karaoke.

また、本発明に係る簡易操作声質変換システムによれば、情報端末装置が備える角速度センサー（ジャイロセンサー）により、情報端末装置の向きを変えたり、情報端末装置を振る動作を捉えて、あたかも体鳴楽器を演奏しているかのような効果を得ることができるという優れた効果を発揮する。 Further, according to the simple operation voice quality conversion system according to the present invention, the angular velocity sensor (gyro sensor) included in the information terminal device changes the direction of the information terminal device or captures the motion of shaking the information terminal device. It produces an excellent effect of being able to obtain the effect of playing an instrument.

本発明に係る簡易操作声質変換システムの基本的な構成を示す基本構成説明図である。It is basic composition explanatory drawing which shows the basic composition of the simple operation voice quality conversion system which concerns on this invention. 本発明に係る簡易操作声質変換システムの情報端末に表れる基本的な表示画面を示す基本表示画面説明図である。It is basic display screen explanatory drawing which shows the basic display screen which appears on the information terminal of the simple operation voice quality conversion system which concerns on this invention. 本発明に係る簡易操作声質変換システムの情報端末に表れる詳細な選択操作画面を示す詳細選択操作画面説明図である。It is detailed selection operation screen explanatory drawing which shows the detailed selection operation screen which appears on the information terminal of the simple operation voice quality conversion system which concerns on this invention. 本発明に係る声質変換の基本構成を説明する基本構成説明図である。It is a basic composition explanatory view explaining the basic composition of voice quality conversion concerning the present invention. 本発明に係る声質変換が他人の声、又は動物の鳴き声に変更する構成を説明する構成説明図である。It is a structure explanatory drawing explaining the structure which the voice quality conversion which concerns on this invention changes into the voice of others, or the cry of an animal. 本発明に係る声質変換機能に角速度センサーを利用した構成の構成説明図である。It is a structure explanatory drawing of the structure using an angular velocity sensor for the voice quality conversion function which concerns on this invention. 本発明に係る声質変換機能によって変換された音を合いの手として付加する構成説明図である。It is composition explanatory drawing which adds the sound converted by the voice quality conversion function which concerns on this invention as a match.

図１は、本発明に係る簡易操作声質変換システム１の基本的な構成を示す基本構成説明図である。図１に示すように、本システムを物品的に大別すると、情報端末装置１０とカラオケ装置２０から構成されている。 FIG. 1 is a basic configuration explanatory view showing a basic configuration of a simple operation voice quality conversion system 1 according to the present invention. As shown in FIG. 1, the system is roughly divided into an information terminal device 10 and a karaoke device 20 in terms of articles.

情報端末装置１０は、代表的なものとして、携帯電話機のように通信機能を有し、情報に触れることができる機器を意味する。なお、家庭で用いられる情報家電であるタブレットや、小型パーソナルコンピューター、或いはゲーム機、PDA（ＰｅｒｓｏｎａｌＤｉｇｉｔａｌＡｓｓｉｓｔａｎｔ）等を広く含むものである。 The information terminal device 10 typically means a device that has a communication function and can touch information, such as a mobile phone. Note that it includes a wide range of tablets, small personal computers, game machines, PDA (Personal Digital Assistant), etc., which are home information appliances used at home.

表示機能１１は、操作のためのアイコン１２を表示する機能であり、可能であればタッチパネル式としてアイコン１２に触れることにより、要求を指示できる液晶画面等の表示装置であることが望ましい。 The display function 11 is a function for displaying an icon 12 for operation, and is preferably a display device such as a liquid crystal screen capable of instructing a request by touching the icon 12 as a touch panel if possible.

アイコン１２は、音声の変換を希望する対象となる象徴的な絵柄を表示したもので、例えば、図２（ｂ）に示すように、男性と女性であるとか、子供と老人であるかなどの絵柄を用意し、男性が女性のアイコン１２を選択した場合、入力された男性の声帯音源３２を女性の声帯音源３２に変換する指令を声質変換機能１３へ出力する。なお、係るアイコン１２については、図２（ｂ）及び図３（ｂ）に示したものはあくまでも例示であって、これらに特に限定されるものではなく、他のキャラクター表示や文字表示などでもよい。 The icon 12 displays a symbolic pattern that is a target of voice conversion. For example, as shown in FIG. 2B, the icon 12 may be a man and a woman or a child and an elderly man. When a picture is prepared and the male selects the female icon 12, a command to convert the input male vocal cord sound source 32 into the female vocal cord sound source 32 is output to the voice quality conversion function 13. In addition, about the icon 12, the thing shown in FIG.2 (b) and FIG.3 (b) is an illustration to the last, Comprising: It does not specifically limit to these, Other character display, character display, etc. may be sufficient .

声質変換機能１３は、アイコン１２が操作されることで選択された声質へ変換する指令を受け、音声合成手法、コーパスベース音声合成、声質変換・モーフィング、韻律の生成、韻律コーパス等の手法から、変換された音源を得る。具体的には、現在ボイスチェンジャーとして機能しているソフトの利用することも好適である。単純なものとしては、ピッチ（声の高さや）や、フォルマント周波数を上下に変更するフィルター等を用いる。オクターバー、ピッチシフター、周波数シフター、或いはリングモジュレーターといった音程を部分的または全体的に変化させるものや、イコライザー等のように、原音に含まれる任意の周波数帯域の増幅や減衰をおこなう声質変換用の専用エフェクターの利用なども考え得る。 The voice quality conversion function 13 receives a command to convert the voice quality to the selected voice quality by operating the icon 12, and includes voice synthesis techniques, corpus-based voice synthesis, voice quality conversion / morphing, prosody generation, prosody corpus, etc. Get the converted sound source. Specifically, it is also preferable to use software that currently functions as a voice changer. As a simple one, a filter that changes the pitch (voice pitch) or formant frequency up and down is used. For voice quality conversion that amplifies or attenuates any frequency band included in the original sound, such as those that change the pitch partially or totally, such as an octaver, pitch shifter, frequency shifter, or ring modulator, or an equalizer The use of a dedicated effector can also be considered.

モバイルデータ通信機能１４は、情報端末装置１０から通信のためのデジタルデータを送受信可能な通信網と接続するための機能であり、無線LAN４０のアダプターの一種であるＷｉ-Ｆｉ（登録商標）（ＷｉｒｅｌｅｓｓＦｉｄｅｌｉｔｙ）や、デジタル機器用の近距離無線通信規格であるブルートゥース（登録商標）（Ｂｌｕｅｔｏｏｔｈ（登録商標））などの利用が考えられる。近年のカラオケ装置２０ではマイクMや操作部が無線通信によるものが多く、これらの通信網を利用することも有効な手段である。なお、これらの通信網による通信可能な範囲内に情報端末装置１０が入ると、表示部へアイコン１２を表示して、アプリケーションの起動を促すことが望ましい。 The mobile data communication function 14 is a function for connecting to a communication network capable of transmitting and receiving digital data for communication from the information terminal device 10, and is a Wi-Fi (registered trademark) (Wireless Fidelity) that is a kind of adapter of the wireless LAN 40. ) And Bluetooth (registered trademark), which is a short-range wireless communication standard for digital devices, can be considered. In recent karaoke apparatuses 20, the microphone M and the operation unit are often based on wireless communication, and using these communication networks is also an effective means. Note that when the information terminal device 10 is within the communication range of these communication networks, it is desirable to display an icon 12 on the display unit to prompt activation of the application.

テキストボックス１６は、特定する別人の氏名等入力することにより、声の特徴データを選択するための入力部であり、アイコン１２操作ではなくテキストの入力により行う。なお、特定の別人の特徴データは、予め情報端末装置１０の記憶装置に収納しておくか、インターネット上のサーバー等にアクセスして情報を入手できるようにすることも好適である。 The text box 16 is an input unit for selecting voice feature data by inputting the name of another person to be identified, and is performed by inputting text instead of operating the icon 12. Note that it is also preferable to store characteristic data of a specific person in advance in a storage device of the information terminal device 10 or to obtain information by accessing a server on the Internet.

カラオケ装置２０は、本体部、操作部、ディスプレイ部D、通信部、音響部、スピーカー２２等から構成される一般的なカラオケ装置２０であればよい。特に、ワイヤレスマイクＭからの声帯音源３２を入力してアンプで増幅させたり、スピーカー２２を介して出力させることができれば足りるが、インターネット等の回線網との通信機能を有していることが望ましい。近年では操作部となる液晶タッチパネルが取り外せるタイプのものがあり、本体部と通信していることから、これらの通信機能をモバイルデータ通信機能１４で利用することが望ましい。なお、情報端末装置１０の表示部に表される内容をディスプレイDにも表示する構成とすることがより望ましい。 The karaoke device 20 may be a general karaoke device 20 including a main unit, an operation unit, a display unit D, a communication unit, an acoustic unit, a speaker 22, and the like. In particular, it is sufficient if the vocal cord sound source 32 from the wireless microphone M can be input and amplified by an amplifier or output via the speaker 22, but it is desirable to have a communication function with a network such as the Internet. . In recent years, there is a type in which a liquid crystal touch panel serving as an operation unit can be removed, and since it communicates with the main body, it is desirable to use these communication functions in the mobile data communication function 14. In addition, it is more preferable that the content displayed on the display unit of the information terminal device 10 is displayed on the display D.

音響機構２１は、情報端末装置１０の備えるマイクＭから入力される人の声帯音源３２や、角速度式センサー１５によって作られた楽器等の音源などが、電気信号に変換され、モバイルデータ通信機能１４を介して入力された電気信号をスピーカー２２で再生するための音源として増幅し、混合（ミックス）や、効果（エフェクト）を加えるなどして出力する装置である。 The acoustic mechanism 21 converts a human vocal cord sound source 32 input from a microphone M included in the information terminal device 10, a sound source such as a musical instrument made by the angular velocity sensor 15, and the like into an electric signal, so that the mobile data communication function 14 This is an apparatus that amplifies an electrical signal input via the speaker 22 as a sound source for reproduction by the speaker 22 and outputs it by mixing (mixing) or adding an effect (effect).

スピーカー２２は、音源が変換された電気信号により、空気を振動させて拡声し、元の響きを再現するものである。本発明においては、特に限定するものではなく、一般的なカラオケ装置２０に備えらえているスピーカー２２を用いればよい。 The speaker 22 reproduces the original reverberation by vibrating the air by the electric signal converted from the sound source. In this invention, it does not specifically limit and what is necessary is just to use the speaker 22 with which the general karaoke apparatus 20 is equipped.

声質変換３０は、アイコン１２の操作により、音声合成手法、コーパスベース音声合成、声質変換・モーフィング、韻律の生成、韻律コーパス等の手法から、選択した声質変換３０によって変換された音源を得る。以下、係る声質変換手段の種類について説明する。 The voice quality conversion 30 obtains a sound source converted by the selected voice quality conversion 30 from techniques such as speech synthesis method, corpus-based speech synthesis, voice quality conversion / morphing, prosody generation, prosody corpus, etc., by operating the icon 12. Hereinafter, the kind of the voice quality conversion means will be described.

（合成手法）
人の発声では、声門から送り出される音源波形が、複雑に変化する声道で共振することによって様々な音を生成している。この過程は入力信号と共振回路による電気回路としてモデル化でき、声門における音源と声道での共振を分けて考える音声生成過程のモデルは、線形分離等価回路モデル（あるいはソース・フィルタモデル）と呼ばれ、現在の音声情報処理の基礎を与えている。音声をコンピュータによって合成する一つの方式は、生成過程モデルに基づいた手法である。すなわち、声道における共振特性を表現するディジタルフィルタを構成し、インパルス系列や白色雑音で近似した音源波形を入力することによって音声生成過程を模擬する手法であり、パラメータ合成方式（あるいはボコーダ方式）と呼ばれる。これとは異なるもう一つの音声合成方式は波形接続合成方式と呼ばれ、収録した音声の時間波形を多数蓄積しておき、適切な波形（あるいはその一部）を選択し、時間領域で接続して再生する方式である。 (Synthesis method)
In human utterance, the sound source waveform sent out from the glottis generates various sounds by resonating in a complexly changing vocal tract. This process can be modeled as an electric circuit with an input signal and a resonance circuit, and the model of the speech generation process that considers the resonance in the glottal source and the vocal tract separately is called the linearly separated equivalent circuit model (or source filter model). This gives the basics of current speech information processing. One method of synthesizing speech by a computer is a method based on a generation process model. In other words, it is a technique for simulating the speech generation process by constructing a digital filter that expresses the resonance characteristics in the vocal tract and inputting a sound source waveform approximated by an impulse sequence or white noise. be called. Another speech synthesis method that is different from this is called waveform connection synthesis method, where many time waveforms of recorded speech are stored, an appropriate waveform (or part of it) is selected and connected in the time domain. It is a method to play back.

（分析合成と規則合成）
一般の音声分析は、音声の時間波形を特徴パラメータの時系列に変換する処理をいい、通常、線形分離等価回路モデルに基づいた分析を、フレーム（微小な時間的区間）ごとに行い、声道の特性を表現する特徴パラメータの時系列を得て、特徴パラメータに基づいて構成されるディジタルフィルタを音源波形によって駆動することにより、特徴パラメータ時系列を音声波形に変換することができる。このように一度発声された音声に対して、音声分析によって得られたパラメータ時系列からもとの音声波形を復元する処理を分析合成という。 (Analytic synthesis and rule synthesis)
General speech analysis is a process that converts a time waveform of speech into a time series of feature parameters. Usually, an analysis based on a linear separation equivalent circuit model is performed for each frame (small time interval), and the vocal tract The characteristic parameter time series can be converted into a speech waveform by obtaining a time series of characteristic parameters expressing the characteristics of the above and driving a digital filter configured based on the characteristic parameters with a sound source waveform. The process of restoring the original speech waveform from the parameter time series obtained by speech analysis for speech once uttered in this way is called analysis synthesis.

歌唱者が発声していない文（単語など）の音声を合成するには、合成単位と呼ばれる音素などの短い単位ごとに、音声の特徴を表現した素片をあらかじめ作成しておき、合成すべき文を合成単位に分割し、合成単位系列に対する素片列を決定し音声を合成する。このように比較的短い合成単位を用いて任意の文や単語を合成する方式を一般に規則合成という。合成単位としては、音素のほか、音節や母音と子音の連鎖や2音素連鎖などが試みられている。 To synthesize the speech of a sentence (such as a word) that the singer has not uttered, create a segment that expresses the features of the speech for each short unit, such as a phoneme called a synthesis unit, and synthesize it The sentence is divided into synthesis units, the segment sequence for the synthesis unit sequence is determined, and the speech is synthesized. Such a method of synthesizing an arbitrary sentence or word using a relatively short synthesis unit is generally called rule synthesis. As synthesis units, phonemes, syllables, vowel-consonant chains, and 2-phoneme chains have been tried.

（パラメータ合成方式（ボコーダ方式））
パラメータ合成方式の音声合成では、音素の特徴を蓄積しておくために、音声波形を特徴パラメータに変換することが必要であり、様々な特徴パラメータを利用した音声合成が行われている。 (Parameter synthesis method (vocoder method))
In the speech synthesis of the parameter synthesis method, in order to accumulate phoneme features, it is necessary to convert speech waveforms into feature parameters, and speech synthesis using various feature parameters is performed.

（声道アナログ合成方式とフォルマント合声方式）
声道の形状を近似表現し声道内の音波の伝搬を模擬する方式の声道アナログ合成（articulatorysynthesis）や、声道における調音の特性を複数の共振回路を接続した電気回路で模擬する方式のフォルマント合成（formantsynthesis）によるフィルタを、それぞれ直接あるいは並列に接続して両者を組み合わせて合成フィルタを構成することも望ましい。その他のパラメータを用いた音声合成として、近年では、ケプストラムを特徴パラメータとする隠れマルコフモデル（ＨｉｄｄｅｎＭａｒｋｏｖＭｏｄｅｌ：以下ＨＭＭ）音声合成が、パラメータ合成の代表的な手法となっている。 (Voice tract analog synthesis method and formant voice method)
Vocal tract analog synthesis (articulatory synthesis) that simulates the propagation of sound waves in the vocal tract by approximating the shape of the vocal tract, and the method of simulating the characteristics of articulation in the vocal tract with an electrical circuit that connects multiple resonance circuits It is also desirable to form a synthesis filter by combining formant synthesis filters directly or in parallel and combining them. In recent years, as a speech synthesis using other parameters, a hidden Markov model (hereinafter referred to as HMM) speech synthesis using a cepstrum as a characteristic parameter has become a representative method of parameter synthesis.

（波形接続合成方式）
パラメータ合成方式が音声の生成過程を模擬して実現されているのに対し、波形接続合成（concatenativesynthesis）は、観測された多数の音声波形を合成単位に分割し、それぞれの時間波形を素片としてデータベースに蓄積しておき、合成すべき文に対して最適な素片系列をデータベースから取り出して時間領域で接続して再生することによって音声合成を実現する。 (Wave connection synthesis method)
Whereas the parameter synthesis method is realized by simulating the speech generation process, concatenative synthesis divides a large number of observed speech waveforms into synthesis units and uses each time waveform as a fragment. Speech synthesis is realized by accumulating in the database, extracting the optimum segment sequence for the sentence to be synthesized from the database, and connecting and reproducing in the time domain.

（素片の選択）
合成すべき対象の文がいくつかの合成単位から構成されているとき、最適な素片系列で定義されるコストを最小化する素片系列として決定される。ここで、音韻やアクセント型などの言語的な素性やフォルマント周波数などの物理量によって定義される。また、隣接する二つの素片と素片との接続部での物理的な（音響的な）ひずみの量を評価する。 (Selection of fragment)
When the sentence to be synthesized is composed of several synthesis units, it is determined as a segment sequence that minimizes the cost defined by the optimum segment sequence. Here, it is defined by linguistic features such as phonemes and accent types and physical quantities such as formant frequencies. In addition, the amount of physical (acoustic) strain at the connection between two adjacent pieces is evaluated.

（PSOLA）
波形接続合成では、データベース中に適切な素片がない場合に音質の劣化を引き起こし、特に、音声の韻律変化は多様であるため、基本周波数や時間長がターゲットの合成単位に十分に適合しないことが考えられ、この問題を解決するために、時間領域での処理によって、素片の基本周波数や時間長を変更する手法PSOLA（Pitch Synchronous Overlap and Add））がある。また、切り出した1ピッチ波形を必要に応じて繰り返し加算したり、間引いたりすることによって時間長を制御することも容易に行える。 (PSOLA)
In waveform connection synthesis, sound quality is degraded when there is no appropriate segment in the database. Especially, since the prosodic changes of speech are various, the fundamental frequency and time length do not sufficiently match the target synthesis unit. In order to solve this problem, there is a method PSOLA (Pitch Synchronous Overlap and Add) that changes the fundamental frequency and time length of a fragment by processing in the time domain. Also, the time length can be easily controlled by repeatedly adding or thinning out the cut 1 pitch waveform as necessary.

（パラメータ合成と波形接続合成の比較）
パラメータ合成では、調音特性を表現したフィルタを音源波形で駆動する処理を必要とするため、やや機械的な音質となるが、素片境界での接続がスムーズであり、基本周波数や時間長の制御を容易に行うことができる。近年では声質や感情の制御も可能となってきている。 (Comparison between parameter synthesis and waveform connection synthesis)
Parameter synthesis requires a process that drives a filter that expresses articulation characteristics with a sound source waveform, so the sound quality is somewhat mechanical, but the connection at the segment boundary is smooth, and the fundamental frequency and time length are controlled. Can be easily performed. In recent years, voice quality and emotion can be controlled.

（コーパスベース音声合成）
犬、猫などの音声合成にはコーパスベース音声合成を用いることが好適である。コーパスベース音声合成とは、音声コーパス（データベース）を用意し、その中から音声素片を抽出し、つなぎ合わせることによって音声を合成する方式で、「単位選択型音声合成」あるいは「波形（素片）接続型音声合成」とも呼ばれる。本発明における声質変換において、例えば、犬や猫の鳴き声、ドラムやマラカス或いはバイクの排気音などのように、音素の並び方、アクセント・イントネーションなどの多様性が小さい場合は、自然性の高い実用レベルの音質が得られることとなる。但し、音声情報をデータベースとして格納しておくための記憶容量を確保する必要がある。 (Corpus-based speech synthesis)
It is preferable to use corpus-based speech synthesis for speech synthesis of dogs and cats. Corpus-based speech synthesis is a method of synthesizing speech by preparing speech corpora (database), extracting speech segments from them, and joining them together. ) Connected speech synthesis ". In the voice quality conversion in the present invention, when the diversity of phoneme arrangement, accent / intonation, etc. is small, such as dog or cat squeal, drums, maracas or motorcycle exhaust, etc., it is a practical level with high naturalness. Sound quality will be obtained. However, it is necessary to secure a storage capacity for storing audio information as a database.

（声質変換・モーフィング）
声質変換３０は、ある歌唱者の声質を、異なる他人の声質に変換するもので、音声に含まれる言語（音韻）情報を変えずに声質や個人性などの非言語情報を制御する装置である。既存のテキスト音声合成技術は、合成音声の音韻性や明瞭性といった言語情報の伝達という観点からは既に十分な性能が得られているが、任意の個人性、歌唱様式、感情などの非言語・パラ言語情報の付加・制御といった観点からはまだ不十分な点が多い。これに対し声質変換は、容易に非言語情報を制御可能にする技術として考えられ、異なる歌唱者間で任意の中間的な仮想歌唱者の音声を生成する音声モーフィング（voice morphing）を始め、対話システム、医療福祉機器、玩具、ゲーム、娯楽など様々な応用が期待されている。音声に含まれる個人性は、音声のスペクトル概形及び韻律に現れる特徴に依存する。スペクトル特徴は歌唱者個人の調音器官の特性、すなわち声帯や声道形状などの身体的特徴に依存して決まり、主に歌唱者の声質（voice quality）の違いとして現れる、一方、韻律特徴はイントネーションやアクセント、声の高さ、話速、音韻継続長などの違いとなって現れる。このため声質変換を実現するためには、下記のようなスペクトル特徴、及び韻律特徴の両方の変換が必要となる。 (Voice conversion / morphing)
The voice quality conversion 30 converts the voice quality of a singer into the voice quality of another person, and is a device that controls non-linguistic information such as voice quality and personality without changing the language (phonological) information included in the voice. . Existing text-to-speech synthesis technology has already achieved sufficient performance from the viewpoint of the transmission of linguistic information such as phonology and clarity of synthesized speech. There are still many inadequacies from the viewpoint of adding and controlling paralinguistic information. Voice quality conversion, on the other hand, is considered as a technology that makes it easy to control non-linguistic information. Voice morphing, which generates voices of any intermediate virtual singer between different singers, is used for dialogue. Various applications such as systems, medical welfare equipment, toys, games, and entertainment are expected. The individuality included in the speech depends on the spectral outline of the speech and the features that appear in the prosody. Spectral features depend on the characteristics of the individual articulatory organs of the singer, that is, physical features such as vocal cords and vocal tract shape, and appear mainly as a difference in the voice quality of the singer, while prosodic features are intonation. And accents, voice pitch, speaking speed, phoneme duration, etc. For this reason, in order to realize voice quality conversion, conversion of both spectral features and prosodic features as described below is required.

（スペクトル変換）
ある歌唱者（元歌唱者）から異なる歌唱者（目標歌唱者）への声質変換を考え、時刻毎における元歌唱者の音声のスペクトル特徴量（例えばメルケプストラム係数ベクトルや線スペクトル周波数ベクトルなど）とそれに対応する目標歌唱者の特徴量とし、スペクトル特徴の変換に着目した声質変換は、元歌唱者のスペクトル特徴量をもとに目標歌唱者に似せた特徴量を出力する変換関数を求めることにより定式化できる。ベクトル量子化（VQ）に基づく手法では、元歌唱者と目標歌唱者が同一内容（音韻系列）を発声した学習用の音声データ（パラレルデータ）を用いてそれぞれの歌唱者の特徴量ベクトル量子化コードブックを作成し、元歌唱者と目標歌唱者の間でベクトル量子化後の特徴量（コードベクトル）の対応関係の度数分布を求め、声質変換を行う際には、元歌唱者の入力音声の特徴量をベクトル量子化したコードベクトルのインデックスを求め、変換後の特徴量とする。 (Spectral conversion)
Considering the voice quality conversion from one singer (former singer) to a different singer (target singer), the spectral features of the original singer's voice at each time (for example, mel cepstrum coefficient vector and line spectrum frequency vector) Voice quality conversion that focuses on the conversion of spectral features is the feature quantity of the target singer corresponding to it, by obtaining a conversion function that outputs the characteristic quantity resembling the target singer based on the spectral feature quantity of the original singer It can be formulated. In the method based on vector quantization (VQ), feature vector quantization of each singer using speech data (parallel data) for learning that the original singer and the target singer uttered the same content (phoneme sequence). When creating a codebook, finding the frequency distribution of the correspondence between feature quantities (code vectors) after vector quantization between the original singer and the target singer, and performing voice quality conversion, the input voice of the original singer An index of a code vector obtained by vector quantization of the feature quantity is obtained and used as a converted feature quantity.

（韻律変換）
韻律特徴は声質とともに歌唱者の個人性を左右する重要な要因である。しかし、スペクトル特徴の変換に比べると、韻律特徴の変換に関する手法は限られており、スペクトル変換と同様なベクトル量子化に基づく手法、基本周波数や話速の平均値を単純に目標歌唱者の学習データの平均値に合わせる手法、分散を考慮した基本周波数の線形変換手法、HMM音声合成を用いた韻律生成に基づく手法などがある。 (Prosody conversion)
Prosodic features are important factors that affect the individuality of the singer as well as the voice quality. However, compared to the conversion of spectral features, there are limited methods for converting prosodic features, and a method based on vector quantization similar to spectral conversion, simply learning the average value of the fundamental frequency and speech speed for the target singer. There are a method of matching to the average value of data, a linear conversion method of fundamental frequency considering dispersion, a method based on prosody generation using HMM speech synthesis, and the like.

（スペクトル・韻律変換）
声質変換において言語情報があらかじめ与えられる場合には、HMM音声合成と歌唱者適応に基づいてスペクトル特徴と韻律特徴双方の変換が可能である。HMM音声合成では、スペクトル、基本周波数、音韻継続長を同時にモデル化した音声単位HMMを用いており、音声認識で用いられる歌唱者適応と同様な手法に基づいて音声単位HMMをモデル適応することにより、スペクトル特徴と韻律特徴を含む任意の歌唱者性の音声を生成できる。元歌唱者のモデルとして複数歌唱者の音声データから学習した平均声モデルを用いることで、任意の目標歌唱者が発声した少量の適応データがあれば、当該目標歌唱者に近い歌唱者性を持った音声が合成できる6)ほか、発話様式や感情表現を含むパラ言語・非言語情報の制御への応用も容易である。 (Spectrum / Prosody conversion)
When language information is given in advance in voice quality conversion, both spectral features and prosodic features can be converted based on HMM speech synthesis and singer adaptation. HMM speech synthesis uses a speech unit HMM in which the spectrum, fundamental frequency, and phoneme duration are modeled at the same time. By adapting the speech unit HMM based on a method similar to singer adaptation used in speech recognition, It is possible to generate a voice of any singer character including spectral features and prosodic features. By using the average voice model learned from the voice data of multiple singers as the model of the former singer, if there is a small amount of adaptation data uttered by any target singer, it has a singer character close to the target singer 6), and can be easily applied to control of paralingual and non-linguistic information including utterance styles and emotional expressions.

（韻律の生成）
声の高さ、速さ、強さを表す韻律は、各々、主に声帯の基本周波数、音素に対応するセグメントの持続時間長（以下、音韻長と呼ぶ）、音声振幅といった音響特徴量によって担われる。合成音声に適切な韻律を与えるために、韻律制御の仕組みについての分析、数理モデル化がなされている。以下に示すように、各特徴量値は、入力テキストなどから得られる言語情報をもとに、実際の音声データに基づく統計的なモデルを用いて算出される。 (Prosody generation)
The prosody that represents the pitch, speed, and strength of the voice mainly depends on the acoustic features such as the fundamental frequency of the vocal cords, the duration of the segment corresponding to the phoneme (hereinafter called the phoneme length), and the speech amplitude. Is called. In order to give an appropriate prosody to the synthesized speech, analysis of the mechanism of prosodic control and mathematical modeling are performed. As shown below, each feature value is calculated using a statistical model based on actual speech data based on language information obtained from input text or the like.

（声帯の基本周波数の制御）
声の高さは、アクセントやイントネーションを表出するうえで重要であり、声帯の基本周波数の制御によって実現される。特に、日本語のアクセントはいわゆる高低アクセントであり、局所的な基本周波数の起伏として実現され、イントネーションとしての基本周波数の制御がなされる。 (Control of fundamental frequency of vocal cords)
Voice pitch is important for expressing accents and intonation, and is achieved by controlling the fundamental frequency of the vocal cords. In particular, Japanese accents are so-called high and low accents, which are realized as local fundamental frequency undulations, and the fundamental frequency is controlled as intonation.

（韻律コーパス）
音声認識ないし合成に利用される音声コーパスは、主に音声の分節的特徴（子音や母音とその結合に関する特徴）に着目して設計される。これに対し、韻律コーパスは音声の韻律的特徴に着目して設計されたコーパスである。韻律的特徴には、アクセント、ストレス、声調、プロミネンス（卓立）、イントネーションなどが含まれる。言語学的には持続時間長の情報も韻律的特徴の一部であるが、これは通常の音声コーパスでも提供される情報である。 (Prosodic corpus)
A speech corpus used for speech recognition or synthesis is designed mainly by focusing on segmental features of speech (features related to consonants and vowels and their combination). On the other hand, the prosodic corpus is a corpus designed by paying attention to the prosodic features of speech. Prosodic features include accent, stress, tone, prominence, intonation and the like. Linguistically, duration information is also part of the prosodic features, but this is also information provided in a normal speech corpus.

（韻律的特徴の階層構造）
音声の韻律的構造の記述にあたっては、通常、階層構造を想定する。日本語の場合、最下位の階層は通常、一定の時間的長さをもった音の分節単位であり、そこから、音節、語、アクセント句、中間句などを経て、最上位階層である発話に至る構造を想定する。即ち、語と語の交互作用を考える必要がありる。多くの場合に文節に該当する大きさの単位となるアクセント句や、アクセント句と発話の中間にある中間句を把握する。 (Hierarchical structure of prosodic features)
In describing the prosodic structure of speech, a hierarchical structure is usually assumed. In Japanese, the lowest hierarchy is usually a segment of sound with a certain length of time, and from there, the utterance that is the highest hierarchy through syllables, words, accent phrases, intermediate phrases, etc. Assuming a structure that leads to That is, it is necessary to consider the interaction between words. In many cases, an accent phrase, which is a unit of a size corresponding to a phrase, or an intermediate phrase between an accent phrase and an utterance is grasped.

音情報３１は、前記声質変換３０のうちのいずれかの手法によるか、または、これらの組み合わせによる手法から得られた音源であり、音響装置によって増幅され、スピーカー２２を介して人間の聴覚で認識される音の情報である。 The sound information 31 is a sound source obtained by any one of the methods of the voice quality conversion 30 or a combination thereof, and is amplified by an acoustic device and recognized by human hearing via the speaker 22. Is the information of the sound to be played.

声帯音源３２は、人の咽頭腔と呼ばれる共鳴腔で発生した響きが発せられた音源をいい、フォルマント周波数の分布によって決定される個人の声の特徴となる音源である。係る音源は波形符号化方式、ボコーダ法式、ハイブリット方式の信号化等により生成される。 The vocal cord sound source 32 is a sound source in which a sound generated in a resonance cavity called a human pharyngeal cavity is emitted, and is a sound source that is a feature of an individual voice determined by a distribution of formant frequencies. Such a sound source is generated by a waveform coding method, a vocoder method method, a hybrid method signalization, or the like.

無線LAN４０は、現在規格化されているＩＥＥＥ８０２．１１に代表される共通規格による通信網（Ｗｉ―Ｆｉ（登録商標））等の無線通信網を意味し、有線での接続を除くものではないが、係る規格の通信により高速化に対応可能とすることが望ましい。また、同じ無線規格であるＩＥＥＥ８０２．１５．１の通信網のブルートゥース（登録商標）を利用してもよい。ブルートゥース（登録商標）は、１０メートル程度であれば通信が可能であり、最近のスマートフォンには通常、前記の通信網に対応したモバイルデータ通信機能１４が備わっていることから、カラオケ装置２０との接続は容易である。 The wireless LAN 40 means a wireless communication network such as a communication network (Wi-Fi (registered trademark)) based on a common standard represented by IEEE 802.11, which is currently standardized, and does not exclude wired connections. It is desirable to be able to cope with higher speeds by communication of such standards. Further, Bluetooth (registered trademark) of a communication network of IEEE802.15.1 which is the same wireless standard may be used. Bluetooth (registered trademark) can communicate if it is about 10 meters, and since a recent smartphone usually has a mobile data communication function 14 corresponding to the communication network, Connection is easy.

図２は、本発明に係る声質変換３０が他人の声帯音源３２へ変更する構成を説明する構成説明図であり、図２（ａ）は、携帯情報端末１０がメニュー画面を表示している状態を示し、図２（ｂ）は、本発明に係る簡易操作声質変換システム１が起動された状態でのメニュー画面を示している状態を示し、図２（ｃ）、図２（ｄ）、図２（ｅ）は、メニュー画面から選択された各アイコン１２から、更に詳細な設定を行う画面を示している。なお、図面に表したものは、あくまでも例示したものであり、係る表示に限定されるものではない。なお、階層も同様である。また、図３は、本発明に係る簡易操作声質変換システム１の情報端末に表れる詳細な選択操作画面を示す詳細選択操作画面説明図であり、図３（ａ）は、図２（ｂ）と同様に、本発明に係る簡易操作声質変換システム１が起動された状態でのメニュー画面を表している状態を示し、図３（ｂ）は、声質変換１３が、他人の声帯音源３２に変更する画面を示し、図３（ｃ）は、動物の鳴き声３３による声帯音源３２へ変更する画面を示し、図３（ｂ）は、楽器音３５や機械音３７等から生ずる音源へと変更する画面を示し、図３（ｅ）は、階層的に更に次の詳細な選択画面へと連続して画面が展開される一つの例を示している。 FIG. 2 is an explanatory diagram illustrating a configuration in which the voice quality conversion 30 according to the present invention is changed to another person's vocal cord sound source 32, and FIG. 2 (a) is a state in which the portable information terminal 10 displays a menu screen. FIG. 2 (b) shows a state where the menu screen is shown in a state where the simplified operation voice quality conversion system 1 according to the present invention is activated, and FIG. 2 (c), FIG. 2 (d), FIG. 2 (e) shows a screen for performing further detailed setting from each icon 12 selected from the menu screen. In addition, what was represented to drawing is an illustration to the last, and is not limited to the display which concerns. The hierarchy is the same. FIG. 3 is an explanatory diagram of a detailed selection operation screen showing a detailed selection operation screen appearing on the information terminal of the simplified operation voice quality conversion system 1 according to the present invention, and FIG. 3 (a) is the same as FIG. 2 (b). Similarly, the state which represents the menu screen in the state by which the simple operation voice quality conversion system 1 which concerns on this invention was started is shown, FIG.3 (b) changes the voice quality conversion 13 to the vocal cord sound source 32 of others. FIG. 3 (c) shows a screen for changing to a vocal cord sound source 32 by an animal cry 33, and FIG. 3 (b) shows a screen for changing to a sound source generated from an instrument sound 35, a mechanical sound 37, or the like. FIG. 3 (e) shows an example in which the screen is expanded continuously to the next more detailed selection screen in a hierarchical manner.

図４は、本発明に係る声質変換の基本構成を説明する基本構成説明図である。まず、アプリケーションの起動を、アイコン１２より、携帯情報端末１０のメニュー画面(図２（ａ））から選択して行い、本発明に係る簡易操作声質変換システム１を起動し、本システムにおけるメニュー画面（図２（ｂ））から、所望する声質の選択を行い、同時にカラオケ装置２０の有するモバイルデータ通信機能１４によって、相互通信できるよう接続し、歌唱による声帯音源３２の入力を携帯情報端末１０のマイクＭを介して行い、声質変換機能１３により、声質変換１３を実行し、音響装置２１を用いてスピーカー２２から出力するという基本構成を採用している。 FIG. 4 is a basic configuration explanatory diagram illustrating a basic configuration of voice quality conversion according to the present invention. First, the application is activated by selecting from the menu screen (FIG. 2A) of the portable information terminal 10 using the icon 12 to activate the simplified operation voice quality conversion system 1 according to the present invention. (FIG. 2 (b)), the desired voice quality is selected, and at the same time, the mobile data communication function 14 of the karaoke apparatus 20 is connected so that they can communicate with each other. A basic configuration is adopted in which the voice quality conversion function 13 is executed through the microphone M, the voice quality conversion function 13 is executed, and the sound device 21 is used to output from the speaker 22.

図５は、本発明に係る声質変換３０が、他人の声、又は動物の鳴き声に変更する構成を説明する構成説明図であり、図５（ａ）は、本発明に係る声質変換３０が、他人の声帯音源３２に変更する構成を説明する構成説明図であり、図５（ｂ）は、本発明に係る声質変換が動物の鳴き声に変更する構成を説明する構成説明図である。前記音声合成等により、変換される他人の声の特徴を有した声帯音源３２や、動物の鳴き声等に変更するものであり、前記多種多様な音声データの合成や、音素の選択等の手法により得た別人の声を音響装置２１から出力する構成である。 FIG. 5 is a configuration explanatory diagram for explaining a configuration in which the voice quality conversion 30 according to the present invention is changed to another person's voice or an animal call, and FIG. 5 (a) is a diagram illustrating the voice quality conversion 30 according to the present invention. FIG. 5B is a configuration explanatory diagram illustrating a configuration for changing to a vocal cord sound source 32 of another person, and FIG. 5B is a configuration explanatory diagram illustrating a configuration in which the voice quality conversion according to the present invention is changed to an animal call. It is changed to a vocal cord sound source 32 having the characteristics of the other person's voice to be converted by the voice synthesis or the like, an animal call or the like, and by a method such as synthesis of the various voice data or selection of phonemes. In this configuration, the obtained voice of another person is output from the sound device 21.

動物の鳴き声３３は、猫、犬、ライオンや狼等の様々な動物の鳴き声３３となる声質変換を行う。係る変換の選択には各動物をモチーフとしたアイコン１２を用意して一目で判るようにする。例えば、図３（ｃ）に示したように、情報端末装置１０の表示機能１１に表れる犬と猫をモチーフにしたアイコン１２を操作すると、動物モードの画面に切り替わり、次の画面では犬、猫、カラス、ゾウ等々、種々の動物を選択でき、入力された声帯音源３２に基づき、犬を選択したならば、「ワンワンワン」といった音源に変更するエフェクター等の声質変換機能１３を介して変更された後の声帯音源３２である。 The animal cry 33 performs voice quality conversion to be the cry 33 of various animals such as cats, dogs, lions and wolves. For the selection of such conversion, an icon 12 having each animal as a motif is prepared so that it can be recognized at a glance. For example, as shown in FIG. 3C, when an icon 12 having a dog and cat motif appearing in the display function 11 of the information terminal device 10 is operated, the screen is switched to the animal mode screen. Various animals such as crows, elephants, etc. can be selected, and if a dog is selected based on the input vocal cord sound source 32, it is changed via the voice quality conversion function 13 such as an effector that changes to a sound source such as “One One One”. It is the vocal cord sound source 32 after a while.

ロボット口調３４は、しゃべるロボットが発する音声となる声質変換３０を行う。係る変換の選択には、図２（ｂ）や、図３（ｄ）に示すようなロボットをモチーフとしたアイコン１２を用意して一目で判るようにする。例えばロッボットをモチーフにしたアイコン１２を操作すると、入力された声帯音源３２に基づき、ロボットの一音ずつ区切られて発音する声の「ワ・タ・シ・ハ・・・」といった音源を二つ組み合わせ、その声の音の高さをトーンジェネレーターの周波数をそれぞれ別の高さに組み合わせることによってロボットらしい声となる等の声質変換機能１３を用いた後に出力される音情報３１である。 The robot tone 34 performs a voice quality conversion 30 that is a voice uttered by the talking robot. For the selection of such conversion, an icon 12 having a robot motif as shown in FIG. 2B or FIG. 3D is prepared so that it can be seen at a glance. For example, when an icon 12 with a robotic motif is operated, two sound sources such as “Wa, Ta, Si, Ha ...” of a voice that is generated by dividing each robot sound based on the input vocal cord sound source 32 are generated. This is the sound information 31 output after using the voice quality conversion function 13 such as combining and combining the pitch of the voice with the tone generator frequency to a different pitch to make it sound like a robot.

楽器音３５は、ピアノ、弦楽器、管楽器等あらゆる楽器の音となる声質変換３０を行う。係る変換の選択には各楽器をモチーフとしたアイコン１２を用意して一目で判るようにする。例えば、図３（ｃ）に示すように、ギターをモチーフにしたアイコン１２を操作すると、入力された声帯音源３２の音階、または音程に基づき、ギターの「ポロンポロン」といった美しい音源に変更する声質変換機能１３を用いた後に出力される音情報３１である。 The musical instrument sound 35 performs a voice quality conversion 30 that is the sound of any musical instrument such as a piano, a stringed instrument, or a wind instrument. For the selection of such conversion, an icon 12 having each musical instrument as a motif is prepared so that it can be recognized at a glance. For example, as shown in FIG. 3 (c), when the icon 12 having a guitar motif is operated, the voice quality conversion is changed to a beautiful sound source such as “polon pollon” of the guitar based on the scale or pitch of the input vocal cord sound source 32. The sound information 31 is output after the function 13 is used.

自然音３６は、風の音、波の音、川のせせらぎ音等、自然界に存在する音となる声質変換３０を行う。係る変換の選択には、例えば、図面には示していないが、図３（ｅ）に示したものと同様に、それぞれを文字で表したアイコン１２を用意して一目で判るようにしてもよい。例えば、波の文字のアイコン１２を操作すると、入力された声帯音源３２の音階、または音程に基づき、波の「ザザザー」といった音源に変更する声質変換機能１３を用い後に出力される音情報３１である。 The natural sound 36 performs voice quality conversion 30 that is a sound existing in the natural world, such as a wind sound, a wave sound, and a river sound. For the selection of such conversion, for example, although not shown in the drawing, like the one shown in FIG. 3E, icons 12 representing the respective characters may be prepared so as to be recognized at a glance. . For example, when the wave character icon 12 is operated, sound information 31 output later using the voice quality conversion function 13 for changing to a sound source such as “Zazaza” of the wave based on the scale or pitch of the input vocal cord sound source 32. is there.

機械音３７は、オートバイや汽笛等の機械的なものから発せられる音となる声質変換３０を行う。係る変換の選択には、例えば、図２（ｂ）及び、図３（ｄ）に示すように、オートバイや船をモチーフに表したアイコン１２を用意して一目で判るようにしてもよい。例えば、オートバイの文字のアイコン１２を操作すると、入力された声帯音源３２の音階、または音程に基づき、オートバイの「ブォンブォン」といった音源に変更する声質変換機能１３を用いた後に出力される音情報３１である。 The mechanical sound 37 performs a voice quality conversion 30 that is a sound emitted from a mechanical object such as a motorcycle or a whistle. For the selection of such conversion, for example, as shown in FIG. 2B and FIG. 3D, an icon 12 representing a motorcycle or a ship as a motif may be prepared so as to be recognized at a glance. For example, when the motorcycle character icon 12 is operated, sound information 31 output after using the voice quality conversion function 13 for changing to a sound source such as “Boom Buon” of a motorcycle based on the scale or pitch of the input vocal cord sound source 32. It is.

図６は、本発明に係る声質変換機能１３に角速度センサー１５を利用した構成の構成説明図である。声帯音源３２に基づいて声質変換３０を行うルートと、予め登録しておいた調音特性を表現するフィルタを、声帯音源３２で駆動する処理を行うパラメータ合成や、音声コーパス（データベース）を用意して、その中に用意したドラムやシンバルといった打楽器３８の音声素片を抽出し、つなぎ合わせることによって音声を合成するコーパスベース音声合成などを行うルートを用い、これらを結合するタイミングを角速度センサー１５の動きに対応させて音情報３１を出力する構成である。 FIG. 6 is an explanatory diagram of a configuration in which the angular velocity sensor 15 is used in the voice quality conversion function 13 according to the present invention. Prepare a route for performing the voice quality conversion 30 based on the vocal cord sound source 32 and a parameter synthesis or a voice corpus (database) for driving the vocal cord sound source 32 to drive a filter expressing the articulation characteristics registered in advance. The voice velocity of the percussion instrument 38 such as drums and cymbals prepared therein is extracted, and a route for synthesizing the speech by synthesizing the speech is used. The sound information 31 is output corresponding to the above.

角速度センサー１５は、携帯電話機や小型ゲーム機等のジャイロ機能に用いられている回転や向きの変化を捉えるものであり、例えば携帯電話機を振ったときに連続したリズムで携帯電話機の動く方向が変化する動作を捉えて、その方向の変化のタイミングに合わせて、リズム楽器や打楽器等の所定の音源を再生することを可能とするセンサーである。具体的には、所定の速度で振動する質量に角速度が加わったとき振動の直行方向に働くコリオリの力を利用したもので、一般にはジャイロセンサーともいわれるものである。近年の携帯電話機やゲーム機には内設されているものが多く、これを利用することが有効である。 The angular velocity sensor 15 captures changes in rotation and orientation used in gyro functions of mobile phones, small game machines, and the like. For example, when the mobile phone is shaken, the moving direction of the mobile phone changes with a continuous rhythm. It is a sensor that makes it possible to reproduce a predetermined sound source such as a rhythm instrument or a percussion instrument in accordance with the timing of the change of the direction. Specifically, it uses a Coriolis force that works in the direction of vibration when an angular velocity is applied to a mass that vibrates at a predetermined velocity, and is generally called a gyro sensor. Many recent mobile phones and game machines are installed internally, and it is effective to use them.

打楽器３８は、ドラムやシンバルといった、打つことで個体そのものから音を出す楽器の事を意味するが、特に本発明での打楽器３８では明確な音階を持たないもの全般をいうものとする。なお、係る音源は、声帯音源３２に基づくものではなく、予め登録しておいた調音特性を表現するフィルタを、声帯音源３２で駆動する処理を行うパラメータ合成や、音声コーパス（データベース）を用意して、その中に用意したドラムやシンバルといった打楽器３８の音声素片を抽出し、つなぎ合わせることによって音声を合成するコーパスベース音声合成を用い、角速度センサー１５の動作に応じて出力される音情報３１である。 The percussion instrument 38 means a musical instrument that produces sound from the individual itself when it is struck, such as a drum or a cymbal. In particular, the percussion instrument 38 in the present invention refers to all instruments that do not have a clear scale. The sound source is not based on the vocal cord sound source 32, and a parameter synthesis for performing a process of driving the filter expressing the articulation characteristics registered in advance by the vocal cord sound source 32 and a voice corpus (database) are prepared. Then, sound information 31 output in accordance with the operation of the angular velocity sensor 15 using corpus-based speech synthesis in which speech units of the percussion instrument 38 such as drums and cymbals prepared therein are extracted and joined together. It is.

体鳴楽器３９、マラカスのように「振る」、「擦る」といった動作で音を出す楽器の事を意味し、特に本発明での体鳴楽器３９では、前記打楽器３８と同様に、明確な音階を持たないものをいうものとする。なお、係る音源は、声帯音源３２に基づくものではなく、予め登録しておいた調音特性を表現するフィルタを、声帯音源３２で駆動する処理を行うパラメータ合成や、音声コーパス（データベース）を用意して、その中に用意したマラカスのような体鳴楽器３９の音声素片を抽出し、つなぎ合わせることによって音声を合成するコーパスベース音声合成を用い、角速度センサー１５の動作に応じて出力される音情報３１である。 This means a body instrument 39, a musical instrument that produces sounds by "swinging" and "scraping" like maracas. In particular, the body instrument 39 in the present invention has a clear scale, like the percussion instrument 38. It shall mean something that does not have. The sound source is not based on the vocal cord sound source 32, and a parameter synthesis for performing a process of driving the filter expressing the articulation characteristics registered in advance by the vocal cord sound source 32 and a voice corpus (database) are prepared. The sound output from the angular velocity sensor 15 according to the operation of the angular velocity sensor 15 is extracted by using the corpus-based speech synthesis that extracts the speech units of the body sounding instrument 39 such as maracas prepared therein and combines them. Information 31.

図７は、本発明に係る声質変換機能１３によって変換された音を「合いの手」として付加する構成を示した構成説明図である。 FIG. 7 is an explanatory diagram showing a configuration in which the sound converted by the voice quality conversion function 13 according to the present invention is added as a “matching hand”.

合いの手２３は、アイコン１２の選択により選択された、動物の鳴き声３３、楽器音３５、自然音３６、機械音３７におけるそれぞれ声質変換機能１３により得られる音声素片として音声コーパス（データベース）として記憶し、１小節毎等の区切り時のピッチに合わせた音声素片を声帯音源３２に追加して出力する。具体的には、例えば、図２（ｂ）及び図３（ｃ）に示す猫のアイコン１２を選び、続いて図３（ｅ）で合いの手２３のアイコン１２を選び、その後、図面には示していないが、合いの手を入れるタイミングを４小節ごととする選択することによって、猫の声による「ニャン、ニャン」という合いの手２３が、４小節ごとに音響機器２１から出力される。また、別の使い方として、歌唱者の声帯音源３２をそのまま利用する一人輪唱や、別の人の声質による一人輪唱、或いは多数の他人の声質による合唱などを楽しむことも可能である。 The matching hand 23 is stored as a speech corpus (database) as a speech segment obtained by the voice quality conversion function 13 for each of the animal sound 33, the instrument sound 35, the natural sound 36, and the mechanical sound 37 selected by the selection of the icon 12. A speech segment that matches the pitch at the time of delimitation for each measure is added to the vocal cord sound source 32 and output. Specifically, for example, the cat icon 12 shown in FIGS. 2 (b) and 3 (c) is selected, and then the icon 12 of the matching hand 23 is selected in FIG. 3 (e). However, by selecting the timing at which the matching hand is put into every four bars, the matching hand 23 “Nyan, Nyan” by the voice of the cat is output from the audio device 21 every four bars. Moreover, as another usage, it is also possible to enjoy solo singing using the vocal band sound source 32 of the singer as it is, single singing based on the voice quality of another person, or chorusing based on the voice quality of many others.

本発明に係る簡易操作声質変換システム１は、日本人の生活に密着した娯楽であるカラオケをより幅広く楽しむことが可能であり、機械の操作が苦手な方でも簡易な操作によってバラエティーに富んだアレンジが楽しめることから、カラオケの設置店や、カラオケ装置の製造事業者、携帯情報端末製造事業者等、幅広い分野での産業上の利用可能性は極めて高い。 The simple operation voice quality conversion system 1 according to the present invention can enjoy a wide variety of karaoke, which is an entertainment closely related to the lives of Japanese people, and provides a variety of arrangements by simple operations even for those who are not good at operating machines. Therefore, the industrial applicability in a wide range of fields such as a karaoke installation shop, a karaoke device manufacturer, and a portable information terminal manufacturer is extremely high.

１簡易操作声質変換システム１
１０情報端末装置
１１表示機能
１２アイコン
１３声質変換機能
１４モバイルデータ通信機能
１５角速度センサー
１６テキストボックス
２０カラオケ装置
２１音響機構
２２スピーカー
２３合いの手
３０声質変換
３１音情報
３２声帯音源
３３動物の鳴き声
３４ロボット口調
３５楽器音
３６自然音
３７機械音
３８打楽器
３９体鳴楽器
４０無線LAN
Ｍマイク
Ｄディスプレイ 1 Simple operation voice quality conversion system 1
DESCRIPTION OF SYMBOLS 10 Information terminal device 11 Display function 12 Icon 13 Voice quality conversion function 14 Mobile data communication function 15 Angular velocity sensor 16 Text box 20 Karaoke apparatus 21 Acoustic mechanism 22 Speaker 23 Matching hand 30 Voice quality conversion 31 Sound information 32 Voice band sound source 33 Animal sound 34 Robot tone 35 Instrument Sound 36 Natural Sound 37 Mechanical Sound 38 Percussion Instrument 39 Body Sound Instrument 40 Wireless LAN
M Microphone D Display

Claims

It is a system that can convert voice quality freely with simple operation,
A familiar information terminal device (10) that can be carried in a small, lightweight,
Karaoke device (20) and
The information terminal device (10) includes a voice quality conversion function (13) that performs a desired voice quality conversion (30) by selecting an icon (12) represented in the display function (11).
The voice quality conversion (30) converts the feature of the input voice of the original singer into the feature of the voice of the target singer by converting either or both of the spectral feature and the prosody feature,
The karaoke device (20), the information terminal device (10) automatically said information terminal device and more linked with the icon operation in the wireless LAN (40) corresponding to the mobile data communication function (14) provided in the (10 ) is connected to,
When the voice of the former singer is input from the microphone (M), sound information (31) based on the vocal cord sound source (32) converted by the voice quality conversion is output from the speaker (22) via the acoustic mechanism (21). simple operation voice quality conversion system, characterized in that it is (1).

The voice conversion function (13) the voice quality conversion (30) for converting the, animal sounds based on the glottal source of the original singer input (32) (33), the robot tone (34), the instrument sound ( 35) The simple operation voice quality conversion system (1) according to claim 1, wherein the system is changed to a natural sound (36) or a mechanical sound (37).

Wherein the voice conversion for converting the voice conversion function (13) (30), using the angular velocity sensor (15) provided in the information terminal device (10), the rotation of the information terminal device (10), or the orientation of the It detects the change, according to the movement of, percussion (38), simple operation voice according to claim 1 rhythm instrument (39), or which is characterized in that converts the machine械音(37) Conversion system (1).

The simple operation voice quality conversion system (1) according to any one of claims 1 to 3, wherein the sound converted by the voice quality conversion function (13) is added as a match (23).