JP6790732B2

JP6790732B2 - Signal processing method and signal processing device

Info

Publication number: JP6790732B2
Application number: JP2016214891A
Authority: JP
Inventors: 嘉山　啓; 啓嘉山
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2016-11-02
Filing date: 2016-11-02
Publication date: 2020-11-25
Anticipated expiration: 2036-11-02
Also published as: JP2018072699A

Description

本発明は、歌唱音声についての信号処理技術に関する。 The present invention relates to a signal processing technique for singing voice.

近年、プロ歌手ではない者が自らの歌唱する様子を動画に収録して動画投稿サイト等にアップロードすることが一般に行われている。このような動画は「歌ってみた動画」と呼ばれ、動画投稿サイトにおける人気ジャンルの１つとなっている。 In recent years, it has been common practice to record a video of a person who is not a professional singer singing and upload it to a video posting site or the like. Such videos are called "singing videos" and are one of the popular genres on video posting sites.

特開２０００−００３２００号公報Japanese Unexamined Patent Publication No. 2000-003200

歌ってみた動画の投稿者はカラオケ曲の歌唱と同じような感覚で動画投稿を行っている場合が多い。しかし、動画投稿サイトへ投稿された動画は、カラオケ曲の歌唱とは異なり、不特定多数のユーザが閲覧し得るものである。このため、歌唱技術が十分ではなく、聴くに堪えない歌唱となっている場合には動画を視聴したユーザに不快感を抱かせ、遠慮のない手厳しいコメントが殺到し「炎上」と呼ばれる状態になることがある。このような状態になると以後の動画投稿が困難になるため、歌ってみた動画の投稿者の中には上手く歌っているという印象を聴き手に与える歌唱音声に修正して投稿することを望む者がいる。しかし、従来、このようなニーズに応える技術は無かった。 In many cases, the poster of the video that I tried to sing is posting the video in the same way as singing a karaoke song. However, unlike the singing of karaoke songs, the videos posted on the video posting site can be viewed by an unspecified number of users. For this reason, if the singing technique is not sufficient and the singing is unbearable to listen to, the user who watched the video will feel uncomfortable, and will be flooded with harsh comments without hesitation, resulting in a state called "flaming". Sometimes. In such a situation, it will be difficult to post the video after that, so some of the posters of the video that I tried to sing want to correct it to a singing voice that gives the listener the impression that they are singing well. There is. However, conventionally, there has been no technology that meets such needs.

例えば、歌唱音声の印象を変える技術の一例としては特許文献１に開示の技術が挙げられる。特許文献１には、男性の音声にピッチ変換を施し、さらに変換後の音声のフォルマントに応じた気息性雑音を付加することで自然な女性の音声に変換する技術が開示されている。しかし、特許文献１に開示の技術では、歌唱の巧拙に関する印象を変えることはできない。 For example, as an example of a technique for changing the impression of a singing voice, a technique disclosed in Patent Document 1 can be mentioned. Patent Document 1 discloses a technique of performing pitch conversion on a male voice and further adding breathing noise according to the formant of the converted voice to convert it into a natural female voice. However, the technique disclosed in Patent Document 1 cannot change the impression of singing skill.

本発明は以上に説明した課題に鑑みて為されたものであり、歌唱者の個性を残しつつ、歌唱音声の巧拙に関する印象を変えることを可能にする技術を提供することを目的とする。 The present invention has been made in view of the above-described problems, and an object of the present invention is to provide a technique capable of changing the impression of singing voice skill while retaining the individuality of the singer.

上記課題を解決するために本発明は、以下の特定ステップおよび修正ステップを有する信号処理方法を提供する。特定ステップは、歌唱音声を表す歌唱音声データから当該歌唱音声における有声音区間を修正対象区間として特定するステップである。修正ステップは、特定ステップにて特定された修正対象区間について、第３フォルマント周辺のスペクトル包絡の形状を変えない範囲で第３フォルマント周辺の周波数成分の振幅を引き上げるまたは引き下げる修正を上記歌唱音声データに施す。 In order to solve the above problems, the present invention provides a signal processing method having the following specific steps and modification steps. The specific step is a step of specifying a voiced sound section in the singing voice as a correction target section from the singing voice data representing the singing voice. In the correction step, for the correction target section specified in the specific step, the correction to raise or lower the amplitude of the frequency component around the third formant within the range that does not change the shape of the spectral envelope around the third formant is applied to the above singing voice data. Give.

下手な歌唱であるとの印象を聴き手に与える原因の１つとして、有声音区間における第３フォルマント周辺の周波数成分の不足（すなわち、当該周波数成分の振幅が小さいこと）が挙げられる。第３フォルマント周辺の周波数成分が十分であれば、オペラ歌手が歌っているかのようなボリューム感のある歌唱音声（ハリのある歌唱音声、朗々と響く歌唱音声、豊で深みのある歌唱音声などと表現される場合もある）、すなわち上手な歌唱と感じられるが、第３フォルマント周辺の周波数成分が不足すると、ハリや深みのない貧相な歌唱、すなわち下手な歌唱と感じられるからである。 One of the causes that gives the listener the impression that the song is poorly sung is the lack of frequency components around the third formant in the voiced sound section (that is, the amplitude of the frequency components is small). If the frequency component around the 3rd formant is sufficient, the singing voice with a sense of volume as if an opera singer is singing (a singing voice with elasticity, a singing voice that resonates cheerfully, a singing voice with richness and depth, etc.) It may be expressed), that is, it feels like a good singing, but if the frequency component around the third formant is insufficient, it feels like a poor singing with no tension or depth, that is, a poor singing.

本発明によれば、有声音区間における第３フォルマント周辺の周波数成分の振幅を引き上げることで、修正前に比較してより上手な歌唱であるという印象を聴き手に与えることが可能になり、逆に有声音区間における第３フォルマント周辺の周波数成分の振幅を引き下げることで、修正前に比較してより下手な歌唱音声（換言すれば、素人っぽい歌唱音声）であるという印象を聴き手に与えることが可能になる。また、本発明によれば有声音区間についてのみ修正が施され、有音区間であっても無声音区間には修正は施されず、歌い手の個性が残る。また、有声音区間についても修正前のスペクトル包絡の形状を変えない範囲で第３フォルマント周辺の周波数成分の振幅を引き上げる（または引き下げる）ため、有声音区間についても当該スペクトル包絡の形状に起因する歌い手の個性が完全に消え去る訳ではない。このように、本発明によれば、歌唱者の個性を残しつつ、歌唱音声の巧拙に関する印象を変えることが可能になる。 According to the present invention, by increasing the amplitude of the frequency component around the third formant in the voiced sound section, it is possible to give the listener the impression that the singing is better than before the correction, and vice versa. By lowering the amplitude of the frequency component around the third formant in the voiced sound section, the listener is given the impression that the singing voice is worse than before the correction (in other words, an amateurish singing voice). Will be possible. Further, according to the present invention, only the voiced sound section is modified, and even if it is the voiced section, the unvoiced sound section is not modified, and the individuality of the singer remains. In addition, since the amplitude of the frequency component around the third formant is raised (or lowered) within the range that does not change the shape of the spectral envelope before modification for the voiced sound section, the singer due to the shape of the spectral envelope is also used for the voiced sound section. Does not completely eliminate the individuality of. In this way, according to the present invention, it is possible to change the impression regarding the skill of the singing voice while retaining the individuality of the singer.

また、上記課題を解決するために本発明は、以下の特定手段と修正手段とを有する信号処理装置を提供する。特定手段は、歌唱音声を表す歌唱音声データから当該歌唱音声における有声音区間を修正対象区間として特定する。修正手段は、特定手段により特定された修正対象区間について、第３フォルマント周辺のスペクトル包絡の形状を変えない範囲で第３フォルマント周辺の周波数成分の振幅を引き上げるまたは引き下げる修正を上記歌唱音声データに施す。このような信号処理装置によっても、歌い手の個性を残しつつ、歌唱音声の巧拙についての印象を変えることが可能になる。 Further, in order to solve the above problems, the present invention provides a signal processing device having the following specific means and correction means. The specific means specifies a voiced sound section in the singing voice as a correction target section from the singing voice data representing the singing voice. The correction means applies a correction to the singing voice data to increase or decrease the amplitude of the frequency component around the third formant within a range that does not change the shape of the spectral envelope around the third formant for the correction target section specified by the specific means. .. Even with such a signal processing device, it is possible to change the impression of the skill of the singing voice while retaining the individuality of the singer.

また、本発明の別の態様としては、ＣＰＵ（Central Processing Unit）などの一般的なコンピュータに上記信号処理方法を実行させるプログラム（換言すれば、コンピュータを上記特定手段および上記修正手段として機能させるプログラム）を提供する態様が考えられる。このような態様によれば一般的なコンピュータを本発明の信号処理装置として機能させることが可能になり、このような態様によっても、歌い手の個性を残しつつ、歌唱音声の巧拙についての印象を変えることが可能になる。なお、上記プログラムの具体的な提供（配布）態様としては、ＣＤ−ＲＯＭ（Compact Disk-Read Only Memory）やフラッシュＲＯＭなどのコンピュータ読み取り可能な記録媒体に上記プログラムを書き込んで配布する態様や、インターネットなどの電気通信回線経由のダウンロードにより配布する態様が挙げられる。 Further, as another aspect of the present invention, a program for causing a general computer such as a CPU (Central Processing Unit) to execute the signal processing method (in other words, a program for causing the computer to function as the specific means and the modification means). ) Is conceivable. According to such an aspect, a general computer can function as the signal processing device of the present invention, and even such an aspect changes the impression of the skill of the singing voice while retaining the individuality of the singer. Will be possible. Specific provision (distribution) modes of the above program include writing and distributing the above program on a computer-readable recording medium such as a CD-ROM (Compact Disk-Read Only Memory) or a flash ROM, or the Internet. The mode of distribution by downloading via a telecommunication line such as.

本発明の一実施形態による信号処理装置１０Ａの構成例を示す図である。It is a figure which shows the structural example of the signal processing apparatus 10A by one Embodiment of this invention. 同信号処理装置１０Ａの制御部１００が歌唱音声修正プログラム１３４０Ａにしたがって実行する歌唱音声修正処理の流れを示すフローチャートである。It is a flowchart which shows the flow of the singing voice correction processing which the control unit 100 of the signal processing apparatus 10A executes according to a singing voice correction program 1340A. 補正量データの一例を示す図である。It is a figure which shows an example of the correction amount data. 本実施形態の効果を説明するための図である。It is a figure for demonstrating the effect of this embodiment.

以下、図面を参照しつつ本発明の実施形態を説明する。
（Ａ：実施形態）
図１は、本発明の一実施形態による信号処理装置１０Ａの構成例を示す図である。信号処理装置１０Ａは、例えばパーソナルコンピュータであり、図１に示すように、制御部１００、外部機器インタフェース（以下、「Ｉ／Ｆ」と略記）部１１０、通信Ｉ／Ｆ部１２０、記憶部１３０、およびこれら各構成要素間のデータ授受を仲介するバス１４０を有する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.
(A: Embodiment)
FIG. 1 is a diagram showing a configuration example of a signal processing device 10A according to an embodiment of the present invention. The signal processing device 10A is, for example, a personal computer, and as shown in FIG. 1, the control unit 100, the external device interface (hereinafter abbreviated as “I / F”) unit 110, the communication I / F unit 120, and the storage unit 130. , And a bus 140 that mediates the transfer of data between each of these components.

信号処理装置１０Ａは、動画投稿サイトへの歌ってみた動画の投稿の際にその投稿者によって使用される。歌ってみた動画とは、その動画の投稿者等が自らの歌唱する様子を収録して得られた動画である。また、動画投稿サイトへの動画の投稿とは、動画投稿サイトのサーバ装置へ動画データをアップロードすることを言う。歌ってみた動画の動画データには、歌唱対象となった歌唱曲全体の歌唱音声（例えば、１曲分の歌唱音声）を表す歌唱音声データが含まれている。なお、このような歌唱音声データの具体例としては、歌唱音声の音波形を所定のサンプリング周期でサンプリングして得られるサンプル列が挙げられる。 The signal processing device 10A is used by the poster when posting a sung video to a video posting site. The video that I tried to sing is a video that was obtained by recording how the posters of the video sang. Posting a video to a video posting site means uploading video data to the server device of the video posting site. The moving image data of the moving image of the singing includes singing sound data representing the singing sound of the entire singing song to be sung (for example, the singing sound of one song). As a specific example of such singing voice data, there is a sample sequence obtained by sampling the sound wave shape of the singing voice at a predetermined sampling cycle.

信号処理装置１０Ａは、歌唱音声データを処理対象として本実施形態の特徴を顕著に示す信号処理である歌唱音声修正処理を実行する装置である。歌唱音声修正処理とは、歌唱音声データの表す歌唱音声の歌い手の個性を残しつつ、上手な歌唱であるとの印象が聴き手に与えられるように当該歌唱音声データを修正する処理である。歌ってみた動画の投稿者は、動画データのアップロードに先立ってその動画データに含まれている歌唱音声データに対して上記歌唱音声修正処理を施すことで、上手な歌唱であるという印象を聴き手に与える歌唱音声に修正して投稿することができる。以下、信号処理装置１０Ａを構成する各部の役割について説明する。 The signal processing device 10A is a device that executes a singing voice correction process, which is a signal processing that remarkably shows the characteristics of the present embodiment, with the singing voice data as a processing target. The singing voice correction process is a process of correcting the singing voice data so as to give the listener the impression that the singing is good while retaining the individuality of the singing voice represented by the singing voice data. The poster of the video that I tried to sing gives the impression that the singing voice is good by applying the above singing voice correction processing to the singing voice data included in the video data prior to uploading the video data. It can be modified and posted to the singing voice given to. Hereinafter, the roles of the respective parts constituting the signal processing device 10A will be described.

制御部１００は、例えばＣＰＵである。制御部１００は、記憶部１３０（正確には、不揮発性記憶部１３４）に予め記憶されているプログラムにしたがって作動することにより信号処理装置１０Ａの制御中枢として機能する。不揮発性記憶部１３４に予め記憶されている各種プログラムにしたがって制御部１００が実行する処理の詳細については後に明らかにする。 The control unit 100 is, for example, a CPU. The control unit 100 functions as a control center of the signal processing device 10A by operating according to a program stored in advance in the storage unit 130 (more precisely, the non-volatile storage unit 134). The details of the processing executed by the control unit 100 according to various programs stored in advance in the non-volatile storage unit 134 will be clarified later.

外部機器Ｉ／Ｆ部１１０は、ＵＳＢ（Universal Serial
Bus）インタフェースやシリアルインタフェース、パラレルインタフェースなどの他の電子機器を接続するためのインタフェースの集合体である。外部機器Ｉ／Ｆ部１１０は、自身に接続された他の電子機器から受け取ったデータを制御部１００へ引き渡す一方、制御部１００から与えられたデータを当該他の電子機器へ出力する。本実施形態では、歌ってみた動画における歌唱音声を表す歌唱音声データを格納した記録媒体が外部機器Ｉ／Ｆ部１１０に接続され、制御部１００は当該記録媒体に格納されている歌唱音声データを処理対象として読み出し、歌唱音声修正処理を実行する。 The external device I / F unit 110 is a USB (Universal Serial).
Bus) A collection of interfaces for connecting other electronic devices such as interfaces, serial interfaces, and parallel interfaces. The external device I / F unit 110 delivers the data received from another electronic device connected to itself to the control unit 100, and outputs the data given by the control unit 100 to the other electronic device. In the present embodiment, a recording medium that stores singing voice data representing the singing voice in the moving image that has been sung is connected to the external device I / F unit 110, and the control unit 100 uses the singing voice data stored in the recording medium. Read as a processing target and execute singing voice correction processing.

通信Ｉ／Ｆ部１２０は例えばＮＩＣ（Network Interface Card）である。通信Ｉ／Ｆ部１２０は、例えばＬＡＮ（Local Area Network）ケーブルなどの通信線およびルータ等の中継装置を介して、インターネットなどの電気通信回線に接続されている。通信Ｉ／Ｆ部１２０は、自身の接続先の電気通信回線から送信されてくるデータを受信して制御部１００へ引き渡す一方、制御部１００から引き渡されたデータを当該電気通信回線へと送出する。例えば、制御部１００は、ユーザの指示に応じて、歌唱音声修正処理を経た歌唱音声データを含む動画データを通信Ｉ／Ｆ部１２０を介して動画投稿サイトのサーバ装置へ送信する。これにより歌ってみた動画の投稿が実現される。 The communication I / F unit 120 is, for example, a NIC (Network Interface Card). The communication I / F unit 120 is connected to a telecommunication line such as the Internet via a communication line such as a LAN (Local Area Network) cable and a relay device such as a router. The communication I / F unit 120 receives the data transmitted from the telecommunication line to which it is connected and delivers it to the control unit 100, while transmitting the data delivered from the control unit 100 to the telecommunication line. .. For example, the control unit 100 transmits video data including singing voice data that has undergone singing voice correction processing to the server device of the video posting site via the communication I / F unit 120 in response to a user's instruction. This makes it possible to post a video that I tried to sing.

記憶部１３０は、図１に示すように、揮発性記憶部１３２と不揮発性記憶部１３４とを有する。揮発性記憶部１３２は、例えばＲＡＭ（Random Access Memory）である。揮発性記憶部１３２は、プログラムを実行する際のワークエリアとして制御部１００により利用される。不揮発性記憶部１３４は、例えばフラッシュＲＯＭ（Read Only Memory）やハードディスクである。不揮発性記憶部１３４には、歌唱音声修正処理を制御部１００に実行させるための歌唱音声修正プログラム１３４０Ａが予め格納されている。また、図１では詳細な図示を省略したが、不揮発性記憶部１３４には、カーネルプログラムと通信制御プログラムが予め格納されている。カーネルプログラムは、ＯＳ（Operating System）を制御部１００に実現させるためのプログラムである。通信制御プログラムは、例えばＦＴＰ（File Transfer
Protocol）などの所定の通信プロトコルにしたがって、動画データを動画投稿サイトのサーバ装置へアップロードする処理を制御部１００に実行させるプログラムである。 As shown in FIG. 1, the storage unit 130 includes a volatile storage unit 132 and a non-volatile storage unit 134. The volatile storage unit 132 is, for example, a RAM (Random Access Memory). The volatile storage unit 132 is used by the control unit 100 as a work area when executing a program. The non-volatile storage unit 134 is, for example, a flash ROM (Read Only Memory) or a hard disk. The non-volatile storage unit 134 stores in advance a singing voice correction program 1340A for causing the control unit 100 to execute the singing voice correction processing. Further, although detailed illustration is omitted in FIG. 1, a kernel program and a communication control program are stored in advance in the non-volatile storage unit 134. The kernel program is a program for realizing the OS (Operating System) in the control unit 100. The communication control program is, for example, FTP (File Transfer).
This is a program that causes the control unit 100 to execute a process of uploading video data to a server device of a video posting site according to a predetermined communication protocol such as Protocol).

信号処理装置１０Ａの電源（図１では図示略）が投入されると、制御部１００は、まず、カーネルプログラムを不揮発性記憶部１３４から揮発性記憶部１３２へ読み出し、当該カーネルプログラムの実行を開始する。カーネルプログラムにしたがって作動し、ＯＳを実現している状態の制御部１００は、外部機器Ｉ／Ｆ部１１０に接続された操作入力装置（例えば、マウスやキーボード、図１では図示略）に対する操作により実行を指示されたプログラムを、不揮発性記憶部１３４から揮発性記憶部１３２へ読み出し、当該プログラムの実行を開始する。 When the power of the signal processing device 10A (not shown in FIG. 1) is turned on, the control unit 100 first reads the kernel program from the non-volatile storage unit 134 to the volatile storage unit 132, and starts executing the kernel program. To do. The control unit 100, which operates according to the kernel program and realizes the OS, operates by operating an operation input device (for example, a mouse or keyboard, not shown in FIG. 1) connected to the external device I / F unit 110. The program instructed to be executed is read from the non-volatile storage unit 134 to the volatile storage unit 132, and the execution of the program is started.

操作入力装置に対する操作により、歌唱音声修正プログラム１３４０Ａの実行を指示された場合には、制御部１００は歌唱音声修正プログラム１３４０Ａを不揮発性記憶部１３４から揮発性記憶部１３２へ読み出して当該プログラムの実行を開始する。歌唱音声修正プログラム１３４０Ａにしたがって作動している制御部１００は、歌唱音声修正処理を実行する。図２は、歌唱音声修正処理の流れを示すフローチャートである。図２に示すように、歌唱音声修正処理は、特定ステップＳＡ１００と、修正ステップＳＡ１１０の２つのステップを含む。 When the execution of the singing voice modification program 1340A is instructed by the operation on the operation input device, the control unit 100 reads the singing voice modification program 1340A from the non-volatile storage unit 134 to the volatile storage unit 132 and executes the program. To start. The control unit 100 operating according to the singing voice correction program 1340A executes the singing voice correction processing. FIG. 2 is a flowchart showing the flow of the singing voice correction process. As shown in FIG. 2, the singing voice correction process includes two steps, a specific step SA100 and a correction step SA110.

特定ステップＳＡ１００は、歌唱音声修正処理の処理対象の歌唱音声データの表す歌唱音声から、上手な歌唱であるとの印象を聴き手に与えるための修正を施す区間である修正対象区間を特定するステップである。本実施形態では、制御部１００は、上記歌唱音声における有声音区間を修正対象区間として特定する。有声音区間とは有声音の発音されている区間のことであり、本実施形態における有声音とは母音のことである。本実施形態では、母音のみを有声音として扱うが、母音の他に特定の子音（破裂音のうち[b][d][g]、摩擦音のうち[v][z]、鼻音 [m][n]、および流音[l][r]）を含めても良い。 The specific step SA100 is a step of specifying a correction target section, which is a section for making corrections to give the listener an impression of good singing from the singing voice represented by the singing voice data to be processed in the singing voice correction processing. Is. In the present embodiment, the control unit 100 specifies the voiced sound section in the singing voice as the correction target section. The voiced sound section is a section in which a voiced sound is pronounced, and the voiced sound in the present embodiment is a vowel. In this embodiment, only vowels are treated as voiced sounds, but in addition to vowels, specific consonants ([b] [d] [g] among plosives, [v] [z] among fricatives, and nasal sounds [m] [n], and liquid consonants [l] [r]) may be included.

歌唱音声における有声音区間を特定するために、制御部１００は、処理対象の歌唱音声データを所定時間長のフレームに区切って時間周波数変換を施し、周波数領域のデータに変換し、フレーム毎にピッチ（基本周波数）抽出を試みる。有声音であればピッチが存在する一方、無声音或いは無音であればピッチは存在しないからである。次いで、制御部１００は、上記の要領で特定した有声音区間を修正対象区間とし、処理対象の歌唱音声データの先頭を時刻の起算点とした場合における修正対象区間の開始時刻および終了時刻を示すデータを修正対象区間毎に揮発性記憶部１３２へ書き込む。 In order to specify the voiced sound section in the singing voice, the control unit 100 divides the singing voice data to be processed into frames having a predetermined time length, performs time-frequency conversion, converts the data into frequency domain data, and pitches each frame. (Fundamental frequency) Try to extract. This is because if it is a voiced sound, there is a pitch, but if it is unvoiced or silent, there is no pitch. Next, the control unit 100 indicates the start time and end time of the correction target section when the voiced sound section specified in the above procedure is set as the correction target section and the beginning of the singing voice data to be processed is the start point of the time. Data is written to the volatile storage unit 132 for each correction target section.

修正ステップＳＡ１１０は、特定ステップＳＡ１００にて特定された修正対象区間の各々について、第３フォルマント周辺の周波数成分の振幅を、当該修正対象区間におけるスペクトル包絡線の形状を包絡の形状を変えない範囲で引き上げるステップである。フォルマントとは、言葉を発している人の音声のスペクトルに現れる時間的に移動している複数のピークのことをいい、第３フォルマントとは、３番目に周波数の低いピークのことを言う。一般に、第３フォルマントおよびその周辺（両者をまとめて「第３フォルマント周辺」と呼ぶ）の周波数成分の振幅が十分であれば、オペラ歌手が歌っているかのようなボリューム感のある歌唱音声（ハリのある歌唱音声、朗々と響く歌唱音声、豊で深みのある歌唱音声などと表現される場合もある）、すなわち上手な歌唱と感じられるが、第３フォルマント周辺の周波数成分が不足すると、ハリや深みのない貧相な歌唱、すなわち下手な歌唱と感じられる。このため、本実施形態では修正対象区間における第３フォルマント周辺の周波数成分の振幅を引き上げるのである。また、第３フォルマント周辺の各周波数成分の振幅の修正量をスペクトル包絡の形状を変えない範囲に限定するのは、スペクトル包絡の形状に起因した歌い手の個性が損なわれないようにするためである。 The correction step SA110 determines the amplitude of the frequency component around the third formant for each of the correction target sections specified in the specific step SA100, and the shape of the spectral envelope in the correction target section within a range that does not change the shape of the envelope. It is a step to pull up. The formant refers to a plurality of temporally moving peaks appearing in the spectrum of the voice of the person speaking the word, and the third formant refers to the peak having the third lowest frequency. In general, if the amplitude of the frequency component of the third formant and its surroundings (collectively referred to as the "third formant periphery") is sufficient, the singing voice has a voluminous feel as if an opera singer is singing. It may be expressed as a singing voice with a certain sound, a singing voice that resonates cheerfully, a rich and deep singing voice, etc.), that is, it seems to be a good singing, but if the frequency component around the third formant is insufficient, it will become firm. It feels like a poor singing with no depth, that is, a poor singing. Therefore, in the present embodiment, the amplitude of the frequency component around the third formant in the correction target section is increased. Further, the amount of correction of the amplitude of each frequency component around the third formant is limited to the range in which the shape of the spectral envelope is not changed in order to prevent the individuality of the singer from being impaired due to the shape of the spectral envelope. ..

本実施形態の歌唱音声修正プログラムには、第３フォルマント周辺のスペクトル包絡線の形状を変えない範囲で各周波数成分の振幅を引き上げる際の補正量（元々の振幅に対する比率）を規定する補正量データ（図３参照）が予め埋め込まれている。なお、図３における周波数範囲Ｑおよび各周波数成分の補正量Ｇについては適宜実験等を行って好適な値を設定すれば良い。制御部１００は、修正対象区間毎にその修正対象区間の波形データを周波数領域のデータに変換し、当該周波数区間における第３フォルマントを上記周波数範囲Ｑの中心周波数Ｆに対応させ、第３フォルマント周辺の周波数成分に補正量データに応じたＥＱ処理（音声の調和成分と非調和成分の両方の振幅を修正する処理）を施して各周波数成分の振幅を引き上げる。 In the singing voice correction program of the present embodiment, correction amount data that defines a correction amount (ratio to the original amplitude) when raising the amplitude of each frequency component within a range that does not change the shape of the spectral envelope around the third formant. (See FIG. 3) is pre-embedded. The frequency range Q and the correction amount G of each frequency component in FIG. 3 may be appropriately set by conducting experiments and the like. The control unit 100 converts the waveform data of the correction target section into frequency domain data for each correction target section, associates the third formant in the frequency section with the center frequency F of the frequency range Q, and surrounds the third formant. The frequency component of is subjected to EQ processing (processing for correcting the amplitudes of both the harmonic component and the non-harmonious component of the voice) according to the correction amount data, and the amplitude of each frequency component is raised.

図４は、本実施形態の効果を説明するための図である。図４では、ある修正対象区間についての修正前のスペクトル包絡線が点線で描画されており、同修正後のスペクトル包絡線が実線で描画されている。また、図４では、修正対象の歌唱音声の楽譜を構成する音符が矩形で描画されており、図４における周波数ｆ３ｓからｆ３ｅの周波数区間が第３フォルマント周辺の周波数区間であり、当該周波数区間の中心周波数が第３フォルマントである。本実施形態の信号処理装置１０Ａによれば、当該周波数区間に属する周波数成分の振幅が補正量データに応じて補正され、図４にて実線で示すスペクトル包絡線に修正される。その結果、オペラ歌手が歌っているかのようなボリューム感のある上手な歌唱音声に修正される。なお、図４では修正対象区間以外については第３フォルマント周辺の周波数成分の振幅の修正が行われないため、実線のスペクトル包絡線と点線のスペクトル包絡線が重なっている。 FIG. 4 is a diagram for explaining the effect of the present embodiment. In FIG. 4, the spectrum envelope before the correction for a certain correction target section is drawn with a dotted line, and the spectrum envelope after the correction is drawn with a solid line. Further, in FIG. 4, the notes constituting the score of the singing voice to be corrected are drawn in a rectangular shape, and the frequency section of the frequencies f3s to f3e in FIG. 4 is the frequency section around the third formant, and the frequency section of the frequency section. The center frequency is the third formant. According to the signal processing device 10A of the present embodiment, the amplitude of the frequency component belonging to the frequency section is corrected according to the correction amount data, and is corrected to the spectral envelope shown by the solid line in FIG. As a result, it is corrected to a good singing voice with a sense of volume as if an opera singer is singing. In FIG. 4, since the amplitude of the frequency component around the third formant is not corrected except for the section to be corrected, the solid line spectrum envelope and the dotted line spectrum envelope overlap.

本実施形態の信号処理装置１０Ａによれば動画投稿サイトへ投稿する「歌ってみた動画」の歌唱音声データを、より上手な印象を聴き手に与えるものに修正して動画投稿を行うことが可能になる。加えて、本実施形態では、有声音区間にのみ修正が施され、修正が施されない区間には歌い手の個性が残っている。また、修正が施された区間についても、歌い手の個性が完全に消え去る訳ではない。第３フォルマント周辺のスペクトル包絡線の形状は修正の前後で維持されているからである。このように、本実施形態によれば、歌い手の個性を残しつつ、歌唱音声の巧拙についての印象を変えることが可能になる。なお、本実施形態では音声の調和成分と非調和成分の両方を振幅を修正したが、調和成分と非調和成分とを分離し、前者の振幅のみを修正することで、より高い効果（より上手な印象を聴き手に与えること）を実現しても良い。 According to the signal processing device 10A of the present embodiment, it is possible to modify the singing audio data of the "sung video" to be posted to the video posting site to give a better impression to the listener and post the video. become. In addition, in the present embodiment, only the voiced sound section is modified, and the individuality of the singer remains in the unmodified section. In addition, the individuality of the singer does not completely disappear even in the modified section. This is because the shape of the spectral envelope around the third formant is maintained before and after the modification. In this way, according to the present embodiment, it is possible to change the impression of the skill of the singing voice while retaining the individuality of the singer. In this embodiment, the amplitudes of both the harmonious component and the anharmonic component of the voice are corrected, but by separating the harmonious component and the anharmonic component and correcting only the former amplitude, a higher effect (better) is achieved. (To give the listener an impression) may be realized.

（Ｂ：その他の実施形態）
以上、本発明の実施形態について説明したが、この実施形態に以下の変形を加えても勿論良い。
（１）上記実施形態では、歌唱の後に歌唱音声修正処理を実行する態様、すなわち、歌唱に対して歌唱音声修正処理を非リアルタイム処理として実行する場合について説明したが、歌唱と歌唱音声修正処理を並列に、すなわち歌唱に対して歌唱音声修正処理をリアルタイム処理として実行するようにしても良い。具体的には、信号処理装置１０Ａの外部機器Ｉ／Ｆ部１１０にマイクロホンを接続し、当該マイクロホンを介して処理対象の歌唱音声データを信号処理装置１０Ａに入力するようにすれば良い。また、この場合、当該歌唱音声データの表す歌唱音声（すなわち、未修正の歌唱音声）または修正後の歌唱音声を、歌唱者へフィードバックするためのヘッドホンスピーカを外部機器Ｉ／Ｆ部１１０に接続しても良い。 (B: Other embodiments)
Although the embodiment of the present invention has been described above, it is of course possible to add the following modifications to this embodiment.
(1) In the above embodiment, a mode in which the singing voice correction processing is executed after singing, that is, a case where the singing voice correction processing is executed as a non-real-time processing for the singing has been described, but the singing and singing voice correction processing are performed. In parallel, that is, the singing voice correction process may be executed as a real-time process for the singing. Specifically, a microphone may be connected to the external device I / F unit 110 of the signal processing device 10A, and the singing voice data to be processed may be input to the signal processing device 10A via the microphone. Further, in this case, a headphone speaker for feeding back the singing voice represented by the singing voice data (that is, the uncorrected singing voice) or the corrected singing voice to the singer is connected to the external device I / F unit 110. You may.

（２）上記実施形態では、特定ステップにて特定された修正対象区間については修正ステップによる修正を常に施す場合について説明した。しかし、特定ステップにて特定された修正対象区間のうちから第３フォルマント周辺の周波数成分の振幅を修正する修正対象区間（或いは修正しない修正対象区間）を操作入力手段に対する操作等によってユーザに選択させても良く、また、修正対象区間毎に修正の程度をユーザに指定させても良い。 (2) In the above embodiment, the case where the correction target section specified in the specific step is always corrected by the correction step has been described. However, the user is allowed to select a correction target section (or a correction target section that is not corrected) for correcting the amplitude of the frequency component around the third formant from the correction target sections specified in the specific step by operating the operation input means or the like. Alternatively, the user may be allowed to specify the degree of correction for each correction target section.

（３）上記実施形態では、歌い手の個性を残しつつ、上手な歌唱であるとの印象を聴き手に与えるように歌唱音声データを修正する場合について説明したが、下手な歌唱であるとの印象を与えるように歌唱音声データを修正しても良い。例えば、修正対象区間における第３フォルマント周辺の周波数成分の振幅を、スペクトル包絡線の形状を変えない範囲で引き下げても良い。敢えて下手な歌唱音声に修正することで素人っぽさを強調するなど、演出の幅が広がるからである。 (3) In the above embodiment, the case where the singing voice data is modified so as to give the listener the impression that the singing is good while retaining the individuality of the singer has been described, but the impression that the singing is poor. The singing voice data may be modified so as to give. For example, the amplitude of the frequency component around the third formant in the section to be corrected may be reduced within a range that does not change the shape of the spectral envelope. This is because the range of production is widened, such as emphasizing the amateurishness by deliberately modifying the singing voice to be poor.

（４）上記実施形態では、歌ってみた動画の投稿者の使用するパーソナルコンピュータを本発明の信号処理装置として機能させたが、動画投稿サイトにおけるサーバ装置に歌唱音声修正プログラムをインストールしておき、当該サーバ装置を本発明の信号処理装置として機能させても良い。また、上記実施形態では、本発明の特徴を顕著に示す歌唱音声修正処理を制御部１００に実行させる歌唱音声修正プログラムが不揮発性記憶部１３４に予めインストールされていたが、歌唱音声修正プログラムを単体で提供しても良い。また、特定ステップにおける処理を実行する特定手段と修正ステップにおける処理を実行する修正手段の各々を電子回路等のハードウェアで実現し、これらハードウェアを組み合わせて本発明の信号処理装置を構成しても良い。 (4) In the above embodiment, the personal computer used by the poster of the singing video is made to function as the signal processing device of the present invention, but the singing voice modification program is installed in the server device at the video posting site. The server device may function as the signal processing device of the present invention. Further, in the above embodiment, the singing voice correction program for causing the control unit 100 to execute the singing voice correction processing that remarkably shows the feature of the present invention is pre-installed in the non-volatile storage unit 134, but the singing voice correction program is used alone. May be provided at. Further, each of the specific means for executing the processing in the specific step and the correction means for executing the processing in the correction step are realized by hardware such as an electronic circuit, and these hardwares are combined to form the signal processing device of the present invention. Is also good.

（５）上記各実施形態では、動画投稿サイトへ投稿する「歌ってみた動画」の動画データに含まれる歌唱音声データの修正への本発明の適用例を説明した。しかし、本発明による修正対象は動画データに含まれる歌唱音声データに限定されるものではない。例えば、歌唱音声のみの投稿サイトへ投稿する歌手音声データの修正に本発明を適用しても良い。要は、本発明の信号処理装置は、歌唱音声を表す歌唱音声データから当該歌唱音声における有声音区間を修正対象区間として特定する特定手段と、特定手段により特定された修正対象区間について、第３フォルマント周辺のスペクトル包絡の形状を変えない範囲で当該第３フォルマント周辺の周波数成分の振幅を引き上げるまたは引き下げる修正を上記歌唱音声データに施す修正手段と、を有するものであれば良い。 (5) In each of the above embodiments, an example of application of the present invention to the modification of the singing audio data included in the video data of the "sung video" posted on the video posting site has been described. However, the modification target by the present invention is not limited to the singing audio data included in the moving image data. For example, the present invention may be applied to a modification of singer voice data posted on a posting site containing only singing voice. In short, the signal processing device of the present invention has a specific means for specifying a voiced sound section in the singing voice as a correction target section from the singing voice data representing the singing voice, and a third correction target section specified by the specific means. Any one having a correction means for increasing or decreasing the amplitude of the frequency component around the third formant within the range that does not change the shape of the spectral envelope around the formant is applied to the singing voice data.

１０Ａ…信号処理装置、１００…制御部、１１０…外部機器Ｉ／Ｆ部、１２０…通信Ｉ／Ｆ部、１３０…記憶部、１３２…揮発性記憶部、１３４…不揮発性記憶部、１３４０Ａ…歌唱音声修正プログラム、１４０…バス。 10A ... signal processing device, 100 ... control unit, 110 ... external device I / F unit, 120 ... communication I / F unit, 130 ... storage unit, 132 ... volatile storage unit, 134 ... non-volatile storage unit, 1340A ... singing Voice modifier, 140 ... bus.

Claims

A specific step for specifying a voiced sound section in the singing voice as a correction target section from the singing voice data representing the singing voice, and
For the correction target section specified in the specific step, the singing voice data is corrected to increase or decrease the amplitude of the frequency component around the third formant within a range that does not change the convex shape of the spectral envelope around the third formant. Correction steps and
A signal processing method characterized by having.

A specific means for specifying a voiced sound section in the singing voice as a correction target section from the singing voice data representing the singing voice, and
For the correction target section specified by the specific means, a correction is made to the singing voice data to raise or lower the amplitude of the frequency component around the third formant within a range that does not change the convex shape of the spectral envelope around the third formant. Means and
A signal processing device characterized by having.