JP6691737B2

JP6691737B2 - Lyrics sound output device, lyrics sound output method, and program

Info

Publication number: JP6691737B2
Application number: JP2015036702A
Authority: JP
Inventors: 啓太郎菅原
Original assignee: Pioneer Corp
Current assignee: Pioneer Corp
Priority date: 2015-02-26
Filing date: 2015-02-26
Publication date: 2020-05-13
Anticipated expiration: 2035-02-26
Also published as: JP2016157086A

Description

本発明は、楽曲の再生に伴って歌詞の情報を出力する手法に関する。 The present invention relates to a method of outputting lyrics information along with reproduction of music.

カラオケの演奏曲に先行して歌詞データを音声合成して出力するカラオケ装置が知られている（例えば、特許文献１、２）。 There is known a karaoke device that synthesizes and outputs lyrics data in advance of a karaoke song (for example, Patent Documents 1 and 2).

特開４−６７４６７号公報JP-A-4-67467 特開１０−６３２７４号公報JP, 10-63274, A

カラオケ装置の場合、再生される楽曲に歌詞が含まれないため、先行技術により出力される歌詞音声が聞き取りにくくなることはない。しかし、カラオケではなく通常の音楽を再生して聞いているような場合には、先行技術の手法により歌詞を音声出力すると、出力された歌詞音声が元の音楽に含まれる歌詞の部分と重なって聞き取りにくくなってしまうことがある。また、例えば車両の運転中に音楽を聞いている場合には、先行技術の手法により出力される歌詞音声が車載用ナビゲーション装置による道案内の音声メッセージなどと重なって聞き取りにくくなってしまうこともある。 In the case of a karaoke device, since the reproduced music does not include lyrics, the lyrics sound output by the prior art does not become difficult to hear. However, if you are listening to normal music instead of karaoke, when you output the lyrics by the method of the prior art, the output lyrics sound overlaps the lyrics part included in the original music. It may be difficult to hear. Further, for example, when listening to music while driving a vehicle, the lyric voice output by the method of the prior art may be difficult to hear because it overlaps with a voice message or the like of a route guidance by the vehicle-mounted navigation device. ..

本発明の解決しようとする課題としては、上記のものが一例として挙げられる。本発明は、歌詞を含む音楽を再生している際に、ユーザがその曲を歌うための歌詞音声を聞き取り易く提供することを目的とする。 Examples of the problems to be solved by the present invention include the above. It is an object of the present invention to provide a lyric voice for a user to easily listen to a lyric voice for singing the song while playing music including lyrics.

請求項１に記載の発明は、歌詞音声出力装置であって、再生されている楽曲の歌詞データを取得する歌詞データ取得手段と、前記歌詞データに基づいて、歌詞音声データを生成する歌詞音声データ生成手段と、記歌詞音声データの時間的長さを短縮する歌詞音声データ短縮手段と、前記再生されている楽曲中の歌詞部分に先行して、前記歌詞音声データ短縮手段により時間的長さが短縮された前記歌詞音声データを前記楽曲の拍位置で終了するように出力する出力手段と、を備えることを特徴とする。
The invention according to claim 1 is a lyrics voice output device, wherein lyrics data acquisition means for obtaining lyrics data of a musical composition being reproduced, and lyrics voice data for generating lyrics voice data based on the lyrics data. Generating means, lyrics voice data shortening means for shortening the time length of the written lyrics voice data, and a time length by the lyrics voice data shortening means prior to the lyrics portion in the music being reproduced. Output means for outputting the shortened lyrics voice data so as to end at the beat position of the music piece .

請求項１１に記載の発明は、コンピュータを備える端末装置により実行される歌詞音声出力方法であって、再生されている楽曲の歌詞データを取得する歌詞データ取得工程と、前記歌詞データに基づいて、歌詞音声データを生成する歌詞音声データ生成工程と、前記歌詞音声データの時間的長さを短縮する歌詞音声データ短縮工程と、前記再生されている楽曲中の歌詞部分に先行して、前記歌詞音声データ短縮手段により時間的長さが短縮された前記歌詞音声データを前記楽曲の拍位置で終了するように出力する出力工程と、を備えることを特徴とする。
According to an eleventh aspect of the present invention, there is provided a lyrics voice output method executed by a terminal device including a computer, wherein a lyrics data acquisition step of obtaining lyrics data of a music piece being reproduced, and a lyrics data acquisition step based on the lyrics data. The lyrics voice data generation step of generating the lyrics voice data, the lyrics voice data shortening step of shortening the time length of the lyrics voice data, and the lyrics voice prior to the lyrics portion in the music being reproduced. An output step of outputting the lyrics voice data whose time length has been shortened by the data shortening means so as to end at the beat position of the music .

請求項１２に記載の発明は、コンピュータを備える端末装置により実行されるプログラムであって、再生されている楽曲の歌詞データを取得する歌詞データ取得手段、前記歌詞データに基づいて、歌詞音声データを生成する歌詞音声データ生成手段、前記歌詞音声データの時間的長さを短縮する歌詞音声データ短縮手段、前記再生されている楽曲中の歌詞部分に先行して、前記歌詞音声データ短縮手段により時間的長さが短縮された前記歌詞音声データを前記楽曲の拍位置で終了するように出力する出力手段、として前記コンピュータを機能させることを特徴とする。
A twelfth aspect of the present invention is a program executed by a terminal device equipped with a computer, wherein lyrics data acquisition means for acquiring lyrics data of a music piece being reproduced, and lyrics voice data based on the lyrics data. The lyrics voice data generating means for generating, the lyrics voice data shortening means for shortening the time length of the lyrics voice data, the lyrics voice data shortening means for temporally preceding the lyrics portion in the music being reproduced. The computer is made to function as an output unit that outputs the lyrics voice data of which the length is shortened so as to end at the beat position of the music .

アシストボーカルの概念を示す図である。It is a figure which shows the concept of assist vocal. アシストボーカル処理のフローチャートである。It is a flow chart of assist vocal processing. スピーチ情報生成処理のフローチャートである。It is a flow chart of speech information generation processing. スピーチ情報生成処理の概要を示す。An outline of the speech information generation process is shown. 歌詞ブロック化の例を示す。An example of making lyrics blocks will be shown. スピーチ挿入方法の例を示す。An example of a speech insertion method will be shown. スピーチ強調処理の例を示す。An example of speech enhancement processing will be shown. スピーチ強調処理の他の例に係る構成を示す。The structure which concerns on the other example of a speech emphasis process is shown. スピーチ強調処理の他の例に係る構成を示す。The structure which concerns on the other example of a speech emphasis process is shown. 楽曲再生システムの全体構成を示すブロック図である。It is a block diagram which shows the whole structure of a music reproduction system. 端末装置の内部構成例を示すブロック図である。It is a block diagram which shows the internal structural example of a terminal device. 第１実施例の楽曲再生システムによるアシストボーカル処理のフローチャートである。It is a flowchart of the assist vocal processing by the music reproduction system of the first embodiment. 第２実施例の楽曲再生システムによるアシストボーカル処理のフローチャートである。It is a flowchart of the assist vocal processing by the music reproduction system of the second embodiment. スピーチのみを再生するアシストボーカル処理のフローチャートである。It is a flowchart of the assisted vocal process which reproduces only speech. 外部ソースにより再生されている楽曲の特定方法を説明する図である。It is a figure explaining the specific method of the music currently reproduced by the external source.

本発明の好適な実施形態では、歌詞音声出力装置は、外部機器により再生されている楽曲を特定する楽曲特定情報を取得する取得手段と、前記再生されている楽曲の再生位置を決定する再生位置決定手段と、前記再生されている楽曲の歌詞データを取得する歌詞データ取得手段と、前記歌詞データに基づいて、歌詞音声データを生成する歌詞音声データ生成手段と、前記再生位置に基づいて、前記再生されている楽曲中の歌詞部分に先行して、前記歌詞音声データを出力する出力手段と、を備える。 In a preferred embodiment of the present invention, the lyrics voice output device includes an acquisition unit that acquires music specifying information that specifies a music being reproduced by an external device, and a reproduction position that determines a reproduction position of the music being reproduced. Determining means, lyrics data acquiring means for acquiring the lyrics data of the music being reproduced, lyrics voice data generating means for generating lyrics voice data based on the lyrics data, and based on the reproduction position, And an output unit for outputting the lyrics voice data prior to the lyrics portion in the music being reproduced.

上記の歌詞音声出力装置は、外部機器により再生されている楽曲を特定し、再生されている楽曲の再生位置を決定する。また、再生されている楽曲の歌詞データを取得し、歌詞データに基づいて、歌詞音声データを生成する。そして、再生位置に基づいて、再生されている楽曲中の歌詞部分に先行して、歌詞音声データを出力する。これにより、ユーザは、外部機器により再生されている楽曲の歌詞音声を聞き、その楽曲に合わせて歌唱することができる。 The above-mentioned lyrics voice output device specifies a music piece being played by an external device and determines a playback position of the music piece being played. Also, the lyrics data of the music being played is acquired, and the lyrics voice data is generated based on the lyrics data. Then, based on the reproduction position, the lyrics voice data is output prior to the lyrics portion in the music being reproduced. As a result, the user can listen to the lyric sound of the music played by the external device and sing along with the music.

上記の歌詞音声出力装置の一態様では、前記取得手段は、前記再生されている楽曲の音声データを集音する集音手段と、集音した前記音声データを外部サーバへ送信する送信手段と、集音された音声データに基づいて前記外部サーバにより特定された、前記再生されている楽曲の楽曲特定情報を受信する受信手段と、を備える。この態様では、再生されている楽曲の音声データをサーバへ送信し、その楽曲特定情報を受信することにより楽曲を特定する。 In an aspect of the above-mentioned lyrics voice output device, the acquisition unit includes a sound collection unit that collects the sound data of the music that is being reproduced, and a transmission unit that transmits the collected sound data to an external server. Receiving means for receiving the music specifying information of the music being reproduced, which is specified by the external server based on the collected voice data. In this mode, the audio data of the music being reproduced is transmitted to the server, and the music is specified by receiving the music specifying information.

上記の歌詞音声出力装置の他の一態様では、前記受信手段は、前記送信手段により前記外部サーバへ送信した音声データの、前記再生されている楽曲の先頭からの経過時間を示す楽曲再生位置情報を前記外部サーバから受信し、前記再生位置決定手段は、前記楽曲再生位置情報と、前記送信手段が前記音声データを前記外部サーバへ送信した時刻からの経過時間とに基づいて、前記再生位置を決定する。この態様では、再生されている楽曲の先頭からの経過時間を示す楽曲再生位置情報をサーバから受信し、それに基づいて現在の再生位置を決定する。 In another aspect of the above-mentioned lyrics voice output device, the receiving means is music reproduction position information indicating the elapsed time from the beginning of the music being reproduced of the audio data transmitted to the external server by the transmission means. From the external server, the reproduction position determining means determines the reproduction position based on the music reproduction position information and the elapsed time from the time when the transmitting means transmits the audio data to the external server. decide. In this aspect, the music reproduction position information indicating the elapsed time from the beginning of the music being reproduced is received from the server, and the current reproduction position is determined based on the information.

上記の歌詞音声出力装置の他の一態様は、前記楽曲の再生が中断したか否かを判定する中断判定手段を備え、前記出力手段は、前記楽曲の再生が中断した場合に、前記歌詞音声データの出力を終了する。この態様では、外部機器による楽曲の再生が中断したと判定された場合には、自動的に歌詞音声データの出力が終了する。 Another aspect of the above-described lyrics voice output device includes interruption determination means for determining whether or not the reproduction of the music is interrupted, and the output means is configured to detect the lyrics voice when the reproduction of the music is interrupted. End data output. In this aspect, when it is determined that the reproduction of the music by the external device is interrupted, the output of the lyrics voice data is automatically ended.

上記の歌詞音声出力装置の他の一態様では、前記取得手段がそれまで再生されていた楽曲とは別の楽曲の楽曲特定情報を取得した場合には、前記出力手段は前記歌詞音声データの出力を終了する。この態様では、外部機器により再生されていた楽曲が変わった場合には、自動的に歌詞音声データの出力が終了する。 In another aspect of the above-mentioned lyrics voice output device, when the acquisition unit acquires the music identification information of a music different from the music that has been reproduced until then, the output unit outputs the lyrics voice data. To finish. In this aspect, when the music played by the external device is changed, the output of the lyrics voice data is automatically terminated.

上記の歌詞音声出力装置の他の一態様では、前記取得手段がそれまで再生されていた楽曲とは別の楽曲の楽曲特定情報を取得した場合には、前記出力手段は、当該別の楽曲に対応する前記歌詞音声データの出力を継続する。この態様では、外部機器により再生されていた楽曲が変わった場合には、変更後の楽曲に対応する歌詞音声データが出力される。 In another aspect of the above-mentioned lyrics voice output device, when the acquisition unit acquires the music identification information of a music different from the music that has been reproduced until then, the output unit changes the music to another music. The output of the corresponding lyrics voice data is continued. In this aspect, when the music played by the external device is changed, the lyrics voice data corresponding to the changed music is output.

本発明の他の好適な実施形態では、コンピュータを備える端末装置により実行される歌詞音声出力方法は、外部機器により再生されている楽曲を特定する楽曲特定情報を取得する取得工程と、前記再生されている楽曲の再生位置を決定する再生位置決定工程と、前記再生されている楽曲の歌詞データを取得する歌詞データ取得工程と、前記歌詞データに基づいて、歌詞音声データを生成する歌詞音声データ生成工程と、前記再生位置に基づいて、前記再生されている楽曲中の歌詞部分に先行して、前記歌詞音声データを出力する出力工程と、を備える。この方法によっても、ユーザは、外部機器により再生されている楽曲の歌詞音声を聴き、その楽曲に合わせて歌唱することができる。 In another preferred embodiment of the present invention, a lyrics voice output method executed by a terminal device equipped with a computer comprises an acquisition step of acquiring music specifying information for specifying a music being reproduced by an external device, and the reproduction step. Reproduction position determining step for determining the reproduction position of the music being played, lyrics data acquiring step for acquiring the lyrics data of the music being played, and lyrics voice data generation for generating lyrics voice data based on the lyrics data And a step of outputting the lyrics voice data prior to the lyrics portion in the music being reproduced based on the reproduction position. Also by this method, the user can listen to the lyrics voice of the music being played by the external device and sing along with the music.

本発明の他の好適な実施形態では、コンピュータを備える端末装置により実行されるプログラムは、外部機器により再生されている楽曲を特定する楽曲特定情報を取得する取得手段、前記再生されている楽曲の再生位置を決定する再生位置決定手段、前記再生されている楽曲の歌詞データを取得する歌詞データ取得手段、前記歌詞データに基づいて、歌詞音声データを生成する歌詞音声データ生成手段、前記再生位置に基づいて、前記再生されている楽曲中の歌詞部分に先行して、前記歌詞音声データを出力する出力手段、として前記コンピュータを機能させる。このプログラムをコンピュータで実行することにより、上記の端末装置を実現することができる。このプログラムは、記憶媒体に記憶して取り扱うことができる。 In another preferred embodiment of the present invention, the program executed by the terminal device equipped with a computer is an acquisition unit for acquiring music specifying information for specifying a music being reproduced by an external device, Reproduction position determining means for determining a reproduction position, lyrics data acquisition means for acquiring lyrics data of the music being reproduced, lyrics voice data generation means for generating lyrics voice data based on the lyrics data, and a reproduction position Based on this, the computer is caused to function as an output means for outputting the lyrics voice data prior to the lyrics portion in the music being reproduced. By executing this program on a computer, the above terminal device can be realized. This program can be stored in a storage medium and handled.

以下、図面を参照して本発明の好適な実施例について説明する。 Hereinafter, preferred embodiments of the present invention will be described with reference to the drawings.

［１］アシストボーカル
［１．１］アシストボーカルの概念
車両を運転しているユーザが車内で音楽を再生して聞いている際、聞いている曲を歌いたくなることがある。しかし、運転中は歌詞の情報を見ることができないため、ユーザはその曲の歌詞を記憶していないと歌うことはできない。 [1] Assisted Vocal [1.1] Concept of Assisted Vocal When a user who is driving a vehicle plays and listens to music in the vehicle, he or she may want to sing the song being listened to. However, since the lyrics information cannot be seen while driving, the user cannot sing unless the lyrics of the song are stored.

本実施例では、歌詞を含む楽曲を再生している際に、その楽曲に含まれる歌詞を音声信号として出力し、ユーザに教える。具体的には、端末装置のメモリなどに記憶されている楽曲を再生している際に、その楽曲に含まれる歌詞を、その歌詞がその楽曲中で再生される前に、音声として出力してユーザに伝える。これにより、ユーザは、運転中であっても、再生中の曲を歌うことができる。また、運転手以外のユーザも、歌詞集などを見ることなく曲を歌うことができる。 In the present embodiment, while the music including the lyrics is being reproduced, the lyrics included in the music are output as a voice signal to teach the user. Specifically, while playing a song stored in the memory of a terminal device, the lyrics included in the song are output as voice before the lyrics are played in the song. Tell the user. This allows the user to sing the song that is being reproduced even while driving. Also, users other than the driver can sing songs without looking at the lyrics collection.

このように、楽曲中で歌詞が再生されるタイミングに先行して、その歌詞の内容を音声出力してユーザに伝える機能を「アシストボーカル」と呼ぶ。なお、本実施例では、再生される楽曲はカラオケではなく、歌詞を含む通常の曲であるものとする。 The function of outputting the contents of the lyrics by voice output to the user prior to the timing when the lyrics are reproduced in the music is called “assist vocal”. In addition, in the present embodiment, it is assumed that the reproduced music is not a karaoke but a normal music including lyrics.

図１は、アシストボーカルの概念を示す。図１は、１つの楽曲を模式的に示したものである。図１の横軸は時間を示す。１つの楽曲中には、複数のブロックに分けて歌詞の部分が含まれている。再生される楽曲に含まれる歌詞の部分を「ボーカル」と呼ぶ。また、楽曲中において、ボーカル以外の部分を「間奏」と呼ぶ。よって、通常１つの楽曲は、複数の間奏と複数のボーカルとにより構成される。 FIG. 1 shows the concept of assisted vocals. FIG. 1 schematically shows one music piece. The horizontal axis of FIG. 1 represents time. In one music piece, the lyrics portion is divided into a plurality of blocks. The part of the lyrics included in the played music is called "vocal". Also, in the music, a part other than vocal is called "interlude". Therefore, one music piece is usually composed of a plurality of interludes and a plurality of vocals.

図１の例では、楽曲は、３つのボーカル１〜３と、複数の間奏とにより構成されている。ボーカル１の内容（歌詞）は「あいうえお」であり、ボーカル２の内容は「かきくけこ」であり、ボーカル３の内容は「さしすせそ」であるものとする。 In the example of FIG. 1, the music is composed of three vocals 1 to 3 and a plurality of interludes. It is assumed that the content (lyrics) of vocal 1 is "aiueo", the content of vocal 2 is "kakikukeko", and the content of vocal 3 is "sashisuseso".

このような楽曲が再生されている状況において、本実施例では、楽曲中のボーカル１が再生されるタイミングに先行して、ボーカル１に対応する歌詞「あいうえお」が音声出力される。なお、本明細書では、アシストボーカルにより音声出力される歌詞音声を「スピーチ」と呼んで、楽曲中に含まれる「ボーカル」と区別する。 In such a situation where the music is being reproduced, in the present embodiment, the lyrics "aiueo" corresponding to the vocal 1 is voice output prior to the timing when the vocal 1 in the music is reproduced. In this specification, the lyrics voice output by assisted vocal is referred to as "speech" to distinguish it from "vocal" included in the music.

図１の例では、ボーカル１に先行して、ボーカル１に対応するスピーチ１が出力される。同様に、ボーカル２に先行してスピーチ２が出力され、ボーカル３に先行してスピーチ３が出力される。 In the example of FIG. 1, the speech 1 corresponding to the vocal 1 is output prior to the vocal 1. Similarly, speech 2 is output prior to vocal 2 and speech 3 is output prior to vocal 3.

スピーチは、曲に含まれるボーカルの歌詞のみを音声信号として出力するものであり、基本的に音程やリズムなどの要素を含まない。また、後述するように、スピーチは基本的に対応するボーカルの前の間奏に挿入されるので、必要に応じてその長さが調整され、通常は楽曲の再生中にボーカルとして再生される場合よりも短い時間とされる。典型的な例では、スピーチは対応するボーカルの歌詞を早口で話した音声となる。 Speech outputs only the vocal lyrics included in a song as an audio signal, and basically does not include elements such as pitch and rhythm. Also, as will be described later, since the speech is basically inserted in the interlude before the corresponding vocal, its length is adjusted as necessary, and usually, it is better than when it is played as vocal during the playback of music. Is also a short time. In a typical example, the speech is a spoken voice of the lyrics of the corresponding vocal.

［１．２］アシストボーカル処理
次に、スピーチを出力するためのアシストボーカル処理について説明する。図２は、アシストボーカル処理のフローチャートである。なお、この処理は、車両に搭載された端末装置、典型的にはスマートフォンなどの携帯端末などにより実行されるが、その詳細については後述する。以下の説明では、端末装置が処理を実行するものとして説明する。 [1.2] Assisted Vocal Processing Next, the assisted vocal processing for outputting speech will be described. FIG. 2 is a flowchart of the assist vocal processing. It should be noted that this process is executed by a terminal device mounted on the vehicle, typically a mobile terminal such as a smartphone, the details of which will be described later. In the following description, it is assumed that the terminal device executes the process.

まず、端末装置は、アシストボーカルがオンになっているか否かを判定する（ステップＳ１）。ここで、アシストボーカルのオン／オフは、ユーザが手動で行う場合と、自動で行う場合とがある。手動で行う場合、ユーザはアシストボーカルによりスピーチの再生を行いたいときに所定のボタンなどを操作してアシストボーカルをオンに設定し、端末装置はこれを検出する。一方、自動で行う場合、端末装置は例えばマイクなどを利用してユーザの声を判定し、ユーザが歌唱している又は歌唱に準ずる行為を行っている場合に、自動的にアシストボーカルをオンに設定する。なお、アシストボーカルの自動設定方法についてはさらに後述する。 First, the terminal device determines whether or not the assist vocal is turned on (step S1). Here, turning on / off of the assist vocal may be performed manually by the user or automatically. In the case of manual operation, the user operates a predetermined button or the like to turn on the assist vocal when he / she wants to reproduce the speech with the assist vocal, and the terminal device detects this. On the other hand, in the case of performing automatically, the terminal device determines the user's voice by using, for example, a microphone, and automatically turns on the assist vocal when the user is singing or performing an action similar to singing. Set. The method of automatically setting the assist vocal will be described later.

アシストボーカルがオンに設定されていない場合（ステップＳ１：Ｎｏ）、処理は終了する。一方、アシストボーカルがオンに設定されている場合（ステップＳ１：Ｙｅｓ）、端末装置は、再生中の楽曲を特定する（ステップＳ２）。この場合に、車内で再生されている楽曲は、サーバからダウンロードされるなどして端末装置の内部に記憶されている楽曲、ＣＤや車載器のメモリなどの記憶媒体に記憶されている楽曲、ラジオなどから再生されている楽曲などを含む。端末装置の内部に記憶されている楽曲を再生している場合、端末装置はその再生中の楽曲を容易に特定することができる。一方、ＣＤなどの記憶媒体に記憶されている楽曲が再生されている場合やラジオから楽曲が再生されている場合には、端末装置は、車内のスピーカから再生されている楽曲をマイクで集音し、そのオーディオデータを外部の音楽検索サーバに送信する。音楽検索サーバは、多数の楽曲のデータをデータベース化して記憶しており、端末装置から受信したオーディオデータと一致する楽曲を特定してその楽曲を示す情報（例えば、曲名、アーティスト名など、以下、「楽曲特定情報」と呼ぶ。）を端末装置に送信する。こうして、端末装置は、現在再生されている楽曲の楽曲特定情報を取得する。 If the assist vocal is not set to ON (step S1: No), the process ends. On the other hand, when the assist vocal is set to ON (step S1: Yes), the terminal device identifies the music being played (step S2). In this case, the music played in the car is music stored in the terminal device by being downloaded from a server, music stored in a storage medium such as a memory of a CD or vehicle-mounted device, a radio, and the like. Including songs that are being played from. When the music stored in the terminal device is being reproduced, the terminal device can easily identify the music being reproduced. On the other hand, when the music stored in the storage medium such as a CD is being reproduced or when the music is being reproduced from the radio, the terminal device collects the music reproduced from the speaker in the vehicle with the microphone. Then, the audio data is transmitted to an external music search server. The music search server stores a large number of pieces of music data in a database and stores the information. The music search server identifies a music piece that matches the audio data received from the terminal device and indicates the music piece (for example, song name, artist name, etc. ("Music identification information") is transmitted to the terminal device. In this way, the terminal device acquires the music identification information of the music currently being reproduced.

こうして、再生中の楽曲が特定されると、端末装置は、スピーチ情報生成処理を実行する（ステップＳ３）。図３は、スピーチ情報生成処理のフローチャートである。また、図４は、スピーチ情報生成処理の概要を示す。 In this way, when the music being reproduced is specified, the terminal device executes the speech information generation process (step S3). FIG. 3 is a flowchart of the speech information generation process. Further, FIG. 4 shows an outline of the speech information generation processing.

図３において、端末装置は、ステップＳ２で特定された楽曲の歌詞データを外部サーバなどから取得する（ステップＳ３１）。ここで、「歌詞データ」とは、その楽曲において、どのタイミングにどのような歌詞が再生されるかを規定する情報であり、具体的には、楽曲に含まれる歌詞を示す歌詞テキストデータと、その歌詞が再生される再生時刻（曲の開始時刻からの経過時間）を示す再生時刻データとを対応付けた情報である。 In FIG. 3, the terminal device acquires the lyrics data of the song specified in step S2 from an external server or the like (step S31). Here, the "lyric data" is information that defines what kind of lyrics are reproduced at what timing in the music, and specifically, lyrics text data indicating the lyrics included in the music, This information is associated with the reproduction time data indicating the reproduction time at which the lyrics are reproduced (elapsed time from the start time of the song).

次に、端末装置は、楽曲解析データを取得する（ステップＳ３２）。楽曲解析データとは、その楽曲における拍位置、小節位置などの音楽的特徴を示す情報であり、再生された楽曲のオーディオデータに基づいて生成される。具体的には、端末装置は内部に楽曲解析アプリケーションを内蔵しておき、車両のスピーカから再生された楽曲をマイクで集音してオーディオデータを取得し、そのオーディオデータを解析することにより拍位置などの楽曲解析データを取得する。なお、端末装置に楽曲解析アプリケーションを内蔵する代わりに、外部の楽曲解析装置やサーバなどを利用して楽曲解析データを取得してもよい。 Next, the terminal device acquires the music analysis data (step S32). The music analysis data is information indicating musical characteristics such as beat position and bar position in the music, and is generated based on the audio data of the reproduced music. Specifically, the terminal device has a built-in music analysis application, collects music played from the speaker of the vehicle with a microphone to acquire audio data, and analyzes the audio data to determine the beat position. Acquire song analysis data such as. Instead of incorporating the music analysis application in the terminal device, the music analysis data may be acquired by using an external music analysis device or a server.

次に、端末装置は、歌詞ブロック化を行う（ステップＳ３３）。歌詞ブロック化とは、ステップＳ３１で取得した歌詞データに含まれる歌詞テキストデータをブロック化する処理であり、１つのブロックは、１つのスピーチに対応する。即ち、歌詞ブロック化は、歌詞テキストデータを、スピーチの単位に分割する処理である。 Next, the terminal device creates lyrics blocks (step S33). Lyrics block formation is a process of dividing the lyrics text data included in the lyrics data acquired in step S31 into blocks, and one block corresponds to one speech. That is, the lyrics block formation is a process of dividing the lyrics text data into speech units.

図４の例では、端末装置は、歌詞テキストデータとして「あいうえおかきくけこさしすせそ」を取得しており、端末装置は、これを「あいうえお」、「かきくけこ」、「さしすせそ」の３つのブロックに分割してブロック歌詞データを生成する。 In the example of FIG. 4, the terminal device has acquired “aiueo kakiku kosashi suseso” as the lyrics text data, and the terminal device uses this as three blocks of “aiueo”, “kaki kukeko”, and “sashisuseso”. To generate block lyrics data.

図５は、歌詞ブロック化の例を示す。図５（Ａ）に第１の方法を示す。この方法では、楽曲に含まれる間奏と間奏との間を１つのブロックとする。なお、「間奏」は、楽曲のうち「ボーカル」以外の部分である。具体的には、端末装置は、ボーカル以外の区間（非ボーカル区間）の長さＩｔが所定長さｔ１よりも長い場合に、その区間を間奏と判定する。 FIG. 5 shows an example of lyrics block formation. FIG. 5A shows the first method. In this method, one block is provided between the interludes included in the music. The "interlude" is a part of the music other than "vocal". Specifically, when the length It of a section other than vocal (non-vocal section) is longer than the predetermined length t1, the terminal device determines that section as an interlude.

但し、例外的に、間奏の長さとの関係で複数のブロックを１つのブロックにまとめる場合がある。図５（Ｂ）に示す例のように、ボーカル３の長さＶｔ３に対して、その直前の間奏２の長さＩｔ２が非常に短い（Ｉｔ２＜α１・Ｖｔ３；α１は任意の係数）場合、間奏２の間にボーカル３のスピーチを出力することは難しい。このような場合に、その１つの前の間奏１の長さＩｔ１が所定長より長ければ、端末装置は、ボーカル２とボーカル３を１つのブロックとする。これにより、ボーカル２とボーカル３に対応するスピーチは間奏１においてされる。 However, exceptionally, a plurality of blocks may be combined into one block in relation to the length of the interlude. As in the example shown in FIG. 5B, when the length It2 of the interlude 2 immediately before the length Vt3 of the vocal 3 is very short (It2 <α1 · Vt3; α1 is an arbitrary coefficient), It is difficult to output the vocal of vocal 3 during interlude 2. In such a case, if the length It1 of the preceding interlude 1 is longer than the predetermined length, the terminal device sets the vocal 2 and the vocal 3 as one block. As a result, the speeches corresponding to the vocals 2 and 3 are given in the interlude 1.

図５（Ｃ）に第２の方法を示す。この方法では、端末装置は、歌詞データに含まれる区切りに基づいて各ブロックを決定する。即ち、歌詞データに含まれる歌詞テキストデータに予め区切りの情報が含まれている場合には、端末装置はその区切りに従って歌詞テキストデータをブロック化することができる。 The second method is shown in FIG. In this method, the terminal device determines each block based on the division included in the lyrics data. That is, when the lyrics text data included in the lyrics data includes delimiter information in advance, the terminal device can block the lyrics text data according to the delimiters.

次に、端末装置は、歌詞スピーチ化を行う（ステップＳ３４）。歌詞ブロック化により得られたブロック歌詞データはあくまで歌詞を示すテキストデータであり、歌詞スピーチ化はブロック歌詞データを音声データに変換する処理である。具体的には、端末装置は、テキスト−音声変換（ＴＴＳ：ＴｅｘｔＴｏＳｐｅｅｃｈ）ソフトウェアを内蔵し、ステップＳ３３で得られた各ブロック歌詞データを音声データに変換する。これにより、図４に示すように、各ブロック歌詞データから、音声データであるスピーチ１〜３が生成される。なお、端末装置にＴＴＳソフトウェアを内蔵する代わりに、外部サーバなどによるＴＴＳ変換を利用してもよい。 Next, the terminal device makes lyrics speech (step S34). The block lyric data obtained by the lyric block is only text data indicating lyrics, and the lyric speech is a process of converting the block lyric data into voice data. Specifically, the terminal device incorporates text-to-speech conversion (TTS: TextToSpeech) software, and converts each block lyrics data obtained in step S33 into speech data. As a result, as shown in FIG. 4, speech 1 to 3 which are voice data are generated from each block lyrics data. Note that TTS conversion by an external server or the like may be used instead of incorporating the TTS software in the terminal device.

次に、端末装置は、スピーチ長変更を行う（ステップＳ３５）。スピーチ長変更とは、歌詞スピーチ化により得られた各スピーチの時間的な長さを短縮して、短い時間で再生できるようにする処理である。既に述べたように、各スピーチは対応するボーカルに先行する間奏において再生されるが、間奏の時間的な長さには制限があるので、スピーチを短くして再生する必要がある。このため、スピーチ長変更が行われる。 Next, the terminal device changes the speech length (step S35). The speech length change is a process of shortening the time length of each speech obtained by making the lyrics speech so that the speech can be reproduced in a short time. As described above, each speech is reproduced in the interlude preceding the corresponding vocal. However, since the time length of the interlude is limited, it is necessary to shorten the speech for reproduction. Therefore, the speech length is changed.

基本的には、人間により聞き取り可能な範囲で、各スピーチの再生時間を短く（再生速度を速く）する。例えば、ステップＳ３４で得られた各スピーチの時間的な長さ（「オリジナルスピーチ長」と呼ぶ。）を「Ｓｔ」とし、スピーチ長変換係数を「α２」とすると、スピーチ長変更による変更後の長さ「Ｓｔｖ」は、
Ｓｔｖ＝Ｓｔ・α２（α２＜１．０）（１）
で与えられる。例えば、α２＝０．７とすれば、スピーチ長変更により各スピーチは元の３割増しの速さで再生されることになる。 Basically, the reproduction time of each speech is shortened (the reproduction speed is increased) within a range that can be heard by humans. For example, if the temporal length of each speech obtained in step S34 (referred to as "original speech length") is "St" and the speech length conversion coefficient is "α2", the speech length is changed by the speech length change. The length "Stv" is
Stv = St · α2 (α2 <1.0) (1)
Given in. For example, if α2 = 0.7, each speech will be reproduced at the speed of 30% of the original speed by changing the speech length.

また、上記のような一括変更に加えて、各スピーチ毎に対応する間奏の時間に応じてさらに再生時間を短くしてもよい。なお、この場合、同じ文字数のスピーチ、又は、同じ歌詞の言葉であっても、曲中の位置（先行する間奏の長さ）に応じて、再生時間が異なることになる。 Further, in addition to the batch change as described above, the reproduction time may be further shortened according to the duration of the interlude corresponding to each speech. In this case, even if the speech has the same number of characters or the words have the same lyrics, the reproduction time differs depending on the position in the music (the length of the preceding interlude).

次に、端末装置は、スピーチ挿入タイミングを算出する（ステップＳ３６）。端末装置は、あるボーカルに対応するスピーチを、そのボーカルの再生タイミングに先行して挿入する。図４に示す例では、ボーカル１に対応するスピーチ１はボーカルの再生タイミングより前に挿入される。同様に、ボーカル２に対応するスピーチ２はボーカル２の再生タイミングより前に挿入され、ボーカル３に対応するスピーチ３はボーカル３の再生タイミングより前に挿入される。 Next, the terminal device calculates the speech insertion timing (step S36). The terminal device inserts the speech corresponding to a certain vocal in advance of the reproduction timing of the vocal. In the example shown in FIG. 4, the speech 1 corresponding to the vocal 1 is inserted before the vocal reproduction timing. Similarly, the speech 2 corresponding to the vocal 2 is inserted before the reproduction timing of the vocal 2, and the speech 3 corresponding to the vocal 3 is inserted before the reproduction timing of the vocal 3.

スピーチを挿入する方法の具体例を図６に示す。図６は、ボーカル２に対応するスピーチ２を挿入するタイミングの例を示す。 A specific example of a method for inserting speech is shown in FIG. FIG. 6 shows an example of the timing of inserting the speech 2 corresponding to the vocal 2.

方法１では、スピーチは、対応するボーカルの開始タイミングよりも一定時間前に終了する。具体的に、図６に示すように、スピーチ２は、ボーカル２の再生開始タイミングより一定時間Ｔ２前に終了するように挿入される。即ち、スピーチ２はボーカル２の再生開始より一定時間Ｔ２前に終了する。この場合、スピーチ２の再生開始タイミングはスピーチ２の長さに応じて決まる。方法１では、スピーチの再生が終了してから、対応するボーカルが再生されるまでに一定時間が確保されるので、ユーザは余裕を持ってボーカル部分を歌うことができる。 In method 1, the speech ends a certain time before the start timing of the corresponding vocal. Specifically, as shown in FIG. 6, the speech 2 is inserted so as to end before a certain time T2 from the reproduction start timing of the vocal 2. That is, the speech 2 ends a certain time T2 before the reproduction of the vocal 2 starts. In this case, the reproduction start timing of the speech 2 is determined according to the length of the speech 2. In the method 1, since a certain time is secured from the end of the reproduction of the speech to the reproduction of the corresponding vocal, the user can sing the vocal part with a margin.

方法２では、スピーチの終了タイミングを楽曲の拍位置と一致させる。具体的に、図６の例では、スピーチ２は、ボーカル２の再生開始タイミングよりＮ拍前（Ｎは任意の整数；本例ではＮ＝１）に終了するように挿入される。この場合、スピーチ２の再生開始タイミングはスピーチ２の長さに応じて決まる。なお、楽曲の拍の位置は、前述の楽曲解析データから取得される。 In method 2, the timing of ending the speech is matched with the beat position of the music. Specifically, in the example of FIG. 6, the speech 2 is inserted so as to end N beats before the reproduction start timing of the vocal 2 (N is an arbitrary integer; N = 1 in this example). In this case, the reproduction start timing of the speech 2 is determined according to the length of the speech 2. The beat position of the music is obtained from the music analysis data described above.

方法３では、スピーチの再生開始タイミングと再生終了タイミングの両方を楽曲の拍位置と一致させる。具体的に、図６の例では、スピーチ２の再生開始タイミング及び再生終了タイミングをともに４拍子の３拍目に一致させている。 In method 3, both the reproduction start timing and the reproduction end timing of the speech are matched with the beat position of the music. Specifically, in the example of FIG. 6, the reproduction start timing and the reproduction end timing of the speech 2 are both matched to the third beat of the four beats.

方法２、３のように、スピーチの終了タイミング、又は、開始／終了タイミングの両方を楽曲の拍位置と一致させると、スピーチが楽曲と連動するのでユーザが楽曲を歌いやすくなる。 When the end timing of the speech or both the start timing and the end timing of the speech are made to coincide with the beat position of the music as in the methods 2 and 3, the speech is linked to the music, so that the user can easily sing the music.

以上のようにして、端末装置は、スピーチの挿入タイミングを決定する。具体的には、各スピーチについて、その再生開始タイミングと再生終了タイミングとを、楽曲の先頭からの経過時間により規定する。各スピーチの再生開始タイミングと再生終了タイミングは、スピーチ情報の一部として記憶される。即ち、スピーチ情報は、各スピーチに対応する音声信号（以下、「スピーチ信号」とも呼ぶ。）と、各スピーチの再生開始タイミング／再生終了タイミングとを含む。 As described above, the terminal device determines the speech insertion timing. Specifically, for each speech, the reproduction start timing and the reproduction end timing are defined by the elapsed time from the beginning of the music. The reproduction start timing and reproduction end timing of each speech are stored as a part of the speech information. That is, the speech information includes an audio signal corresponding to each speech (hereinafter, also referred to as “speech signal”) and a reproduction start timing / reproduction end timing of each speech.

次に、処理は図２に示すメインルーチンに戻り、端末装置は、再生中の楽曲の現在の再生位置を取得する（ステップＳ４）。具体的には、端末装置は、再生中の楽曲の再生開始時刻からの経過時間をカウントすることにより、現在の再生位置を取得する。 Next, the process returns to the main routine shown in FIG. 2, and the terminal device acquires the current reproduction position of the music being reproduced (step S4). Specifically, the terminal device acquires the current reproduction position by counting the elapsed time from the reproduction start time of the music being reproduced.

次に、端末装置は、スピーチ強調処理を行う（ステップＳ５）。スピーチ強調処理は、楽曲に含まれるボーカルと、スピーチとを区別して聞き取り易くする処理であるが、その詳細は後述する。 Next, the terminal device performs a speech enhancement process (step S5). The speech emphasis process is a process for distinguishing vocals included in music and speech to make them easier to hear, and details thereof will be described later.

次に、端末装置は、スピーチ情報に含まれる各スピーチの再生開始タイミング／再生終了タイミングと、現在の再生位置とに基づいて、スピーチを再生する（ステップＳ６）。具体的には、スピーチの再生開始タイミングでスピーチの再生を開始し、スピーチの再生終了タイミングでスピーチの再生を終了する。これにより、楽曲中のボーカルに先行して、対応するスピーチが再生されることになる。 Next, the terminal device reproduces the speech based on the reproduction start timing / reproduction end timing of each speech included in the speech information and the current reproduction position (step S6). Specifically, the speech reproduction is started at the speech reproduction start timing, and the speech reproduction is ended at the speech reproduction end timing. As a result, the corresponding speech is reproduced prior to the vocal in the music.

次に、端末装置は、スピーチの再生を終了すべきか否かを判定する（ステップＳ７）。スピーチの再生を終了すべき場合とは、スピーチ情報が無くなった場合、楽曲の再生自体が終了した場合、ユーザの操作によりアシストボーカルがオフされた場合、などが挙げられる。スピーチの再生を終了すべきでない場合（ステップＳ７：Ｎｏ）、処理はステップＳ４へ戻り、スピーチの再生を継続する。一方、スピーチの再生を終了すべきである場合（ステップＳ７：Ｙｅｓ）、アシストボーカル処理は終了する。 Next, the terminal device determines whether or not to end the reproduction of the speech (step S7). The case where the reproduction of the speech should be ended includes the case where the speech information is lost, the reproduction of the music itself is ended, the case where the assist vocal is turned off by the operation of the user, and the like. If the reproduction of the speech should not be ended (step S7: No), the process returns to step S4 to continue the reproduction of the speech. On the other hand, when the reproduction of the speech should be ended (step S7: Yes), the assist vocal processing ends.

［１．３］アシストボーカルの自動オン設定方法
次に、図２に示すアシストボーカル処理のステップＳ１においてアシストボーカルを自動的にオンに設定する方法について説明する。 [1.3] Automatic Assist Vocal Setting Method Next, a method for automatically setting the assist vocal to ON in step S1 of the assist vocal processing shown in FIG. 2 will be described.

基本的な方法としては、端末装置は、ユーザが発している音声をマイクで集音し、ユーザが楽曲に合わせて歌唱している（歌を歌っている）又は歌唱に準ずる行為を行っていると判定される場合にアシストボーカルを自動的にオンにする。例えば、マイクにより集音した音声データを解析した結果、鼻歌を歌っている、断片的に曲を歌っている、ハミングしているなどと判定される場合には、アシストボーカルをオンにする。一方、音声データが歌唱しているのではなく、同乗者との会話である場合にはアシストボーカルをオンにしない。音声データが鼻歌を歌っている部分を含んでいるような場合でも、大部分が会話であるような場合にもアシストボーカルをオンにはしない。 As a basic method, the terminal device collects the voice uttered by the user with a microphone, and the user is singing (singing a song) in accordance with the music or performing an action similar to singing. If it is judged, the assist vocal is automatically turned on. For example, if it is determined as a result of analyzing voice data collected by a microphone that a humming song is being performed, a piece of song is singing, or a person is humming, the assist vocal is turned on. On the other hand, when the voice data is not singing but a conversation with a passenger, the assist vocal is not turned on. Even if the voice data includes a humming part, even if the majority is conversation, the assist vocal is not turned on.

なお、音声データに含まれるユーザの音声が歌唱であるか否かは、音声データに含まれるリズムや音程の有無に基づいて判断することができる。例えばリズムが規則的である場合や音程の変化が大きい場合には歌唱であると判断し、リズムが不規則である場合は音程の変化が小さい場合に歌唱ではない（会話である）と判断することができる。また、前述の楽曲解析アプリケーションを利用し、音声データから拍や小節が抽出できた場合に歌唱であると判断し、抽出できない場合に歌唱ではないと判断してもよい。また、前述の音楽検索サーバ又は音楽検索機能を利用し、音声データから楽曲が特定できた場合に歌唱であると判断し、楽曲が特定できない場合に歌唱ではないと判断してもよい。 Note that whether or not the user's voice included in the voice data is a song can be determined based on the presence or absence of the rhythm or pitch included in the voice data. For example, if the rhythm is regular or there is a large change in pitch, it is judged to be singing, and if the rhythm is irregular, it is judged to be not singing (is a conversation) if there is a small change in pitch. be able to. Further, by using the above-mentioned music analysis application, it may be determined that it is a singing when a beat or a bar can be extracted from the voice data, and it may be determined that it is not a singing when it cannot be extracted. Further, by using the above-described music search server or music search function, it may be determined that the song is a song when the music can be identified from the audio data, and it may be determined that the song is not a song when the music cannot be identified.

また、端末装置は、集音した音声データと、再生中の楽曲との相関を算出し、一定値以上の相関がある場合に、ユーザが歌唱していると判断してアシストボーカルをオンにしてもよい。また、端末装置が再生中の曲の歌詞データを既に取得している場合には、マイクにより集音した音声データと歌詞データとの相関が一定値以上である場合に、ユーザが歌っていると判断してもよい。また、歌詞データに基づいて、歌詞が存在しないはずの楽曲の間奏位置においてもユーザの音声が出力されている場合には、それは会話であると判断してもよい。 Further, the terminal device calculates the correlation between the collected voice data and the music being played, and when there is a certain value or more of the correlation, determines that the user is singing and turns on the assist vocal. Good. Further, when the terminal device has already acquired the lyrics data of the song being played, if the user sings when the correlation between the voice data collected by the microphone and the lyrics data is a certain value or more. You may judge. Further, based on the lyrics data, if the user's voice is output even at the interlude position of the song where the lyrics should not exist, it may be determined that it is a conversation.

また、マイクで集音したリズムの情報を利用してもよい。例えば、ユーザが楽曲のリズムに合わせて手や指でステアリングなどを叩いているとか、足で床を踏んでリズムを取っていると判断される場合には、ユーザが歌唱に準ずる行為を行っていると判定し、アシストボーカルをオンにしてもよい。この場合、マイクで集音したリズムと再生中の楽曲のリズムとの相関を算出し、相関が一定値以上である場合にアシストボーカルをオンにしてもよい。また、再生中の楽曲のリズムとの相関を算出しなくても、マイクで集音されたリズムが、一定のリズムの繰り返しになっているような場合には、アシストボーカルをオンにしてもよい。 Alternatively, information on the rhythm collected by the microphone may be used. For example, if it is determined that the user is striking the steering wheel with his or her hand or finger in time with the rhythm of the music, or if the user is stepping on the floor with the foot to take the rhythm, the user performs an action that is similar to singing It may be determined that the assist vocal is turned on. In this case, the correlation between the rhythm collected by the microphone and the rhythm of the music being reproduced may be calculated, and the assist vocal may be turned on when the correlation is a certain value or more. In addition, even if the correlation with the rhythm of the music being played back is not calculated, if the rhythm collected by the microphone is a repetition of a certain rhythm, the assist vocal may be turned on. ..

さらには、車内を撮影するカメラでユーザの状態を撮影し、ユーザが楽曲に合わせて首を振っているような場合に、アシストボーカルをオンにしてもよい。また、車内を撮影するカメラにより、助手席や後部座席に同乗者がいるか否かを検出し、同乗者の有無により、ユーザが歌っているのか会話しているのかの判定基準を変化させてもよい。 Furthermore, the assist vocal may be turned on when the user's state is photographed by a camera that photographs the inside of the vehicle and the user shakes his / her head along with the music. In addition, even if the camera that captures the inside of the vehicle detects whether or not there is a passenger in the passenger seat or rear seat, and whether or not the user is singing or talking is changed depending on the presence or absence of the passenger, Good.

また、上記の例では、ユーザが歌唱していると判断した場合に、アシストボーカルをオンにする例を説明したが、ユーザが歌唱していても、ユーザが歌詞を知っていてアシストボーカルを再生する必要がないと判断した場合には、アシストボーカルをオンにしなくてもよい。具体的には、例えば集音した音声データと、再生中の楽曲との相関が一定値以上であり、かつ歌詞データとの相関が一定値以上である場合には、ユーザが歌詞を知っていると判断し、歌唱していてもアシストボーカルをオンにしない。 In the above example, when the user determines that he is singing, the example in which the assist vocal is turned on has been described. However, even if the user is singing, the user knows the lyrics and plays the assist vocal. If it is determined that there is no need to do so, the assist vocal need not be turned on. Specifically, for example, when the correlation between the collected voice data and the music being played is a certain value or more and the correlation with the lyrics data is a certain value or more, the user knows the lyrics. Therefore, the assisted vocal is not turned on even when singing.

ただしこの場合、ユーザが途中から歌詞が分からなくなる可能性があるため、スピーチ情報を生成し、出力する準備をしておいてもよい。そのあとに、集音した音声データと、再生中の楽曲との相関が一定値未満であり、または歌詞データとの相関が一定値未満である場合には、ユーザは歌詞を知らないと判断し、アシストボーカルを出力する。 However, in this case, since the user may not understand the lyrics in the middle of the process, the user may be prepared to generate and output the speech information. After that, if the correlation between the collected voice data and the music being played is less than a certain value, or if the correlation with the lyrics data is less than a certain value, the user is determined not to know the lyrics. , Assist vocal is output.

また、上記の例では、アシストボーカルの自動オン設定の方法について説明したが、アシストボーカルの自動オフ設定も行うことができる。アシストボーカルをオンしている間に、ユーザが楽曲に合わせて歌唱していない（歌を歌っていない）又は歌唱に準ずる行為（鼻歌を歌っている、断片的に曲を歌っている、ハミングをしている等）を行っていないと判定された場合に、アシストボーカルを自動的にオフにしてもよい。同様に、会話が検出されたら、アシストボーカルを自動的にオフにしてもよいし、リズムをとっていないと判断されたり、ユーザが楽曲に合わせて頭を振っていないと判断された場合、アシストボーカルを自動的にオフにしてもよい。 Further, in the above example, the method of automatically setting the assist vocals has been described, but the automatic turn-off setting of the assist vocals can also be performed. While the assist vocal is turned on, the user does not sing along with the song (does not sing) or acts that are similar to singing (singing humming, singing pieces, humming When it is determined that the assist vocal is not performed, the assist vocal may be automatically turned off. Similarly, if a conversation is detected, assist vocals may be turned off automatically, or if it is determined that the rhythm is not being taken or the user is not shaking his / her head to the music, Vocals may be turned off automatically.

また、上記の例では、ユーザが歌唱しているもしくは歌唱に準ずる行為をしているか否かに基づき、アシストボーカルの自動オン設定もしくは自動オフ設定を行うことを説明したが、再生されている楽曲の構成に基づき自動オン設定もしくは自動オフ設定してもよい。例えば、楽曲のサビの部分だけ歌唱したいというユーザに対しては、楽曲のサビの部分を再生する際に、アシストボーカルを自動的にオン設定し、楽曲のサビ以外の部分を再生する際に、アシストボーカルを自動的にオフ設定してもよい。逆に、サビの部分は知っていてサビ以外の部分を練習したいというユーザに対しては、楽曲のサビ以外の部分を再生する際に、アシストボーカルを自動的にオン設定し、楽曲のサビの部分を再生する際に、アシストボーカルを自動的にオフ設定してもよい。 In addition, in the above example, it is explained that the assist vocal is automatically turned on or off based on whether the user is singing or performing an action similar to singing. The automatic ON setting or the automatic OFF setting may be performed based on the above configuration. For example, for a user who wants to sing only the chorus portion of the song, when playing the chorus portion of the song, the assist vocal is automatically set to on, and when the portion other than the chorus portion of the song is played, Assisted vocals may be automatically set to off. On the other hand, for users who know the chorus part and want to practice the part other than the chorus, when the part other than the chorus of the song is played, the assist vocal is automatically set to ON and the Assisted vocals may be automatically turned off when playing a portion.

［１．４］スピーチ強調処理
次に、図２に示すアシストボーカル処理のステップＳ５において実行されるスピーチ強調処理について説明する。スピーチ強調処理は、ユーザがスピーチとボーカルとを区別して聞き取り易くする方法であり、以下のいくつかの方法を示す。 [1.4] Speech Enhancement Processing Next, the speech enhancement processing executed in step S5 of the assist vocal processing shown in FIG. 2 will be described. The speech enhancement process is a method for the user to distinguish between speech and vocal to make it easier to hear, and the following several methods are shown.

［１．４．１］スピーチとボーカルが重なる場合の処理
スピーチは基本的に対応するボーカルの直前の間奏中に再生され、ボーカルとは時間的に重ならないことが好ましい。このために前述のスピーチ長変更処理（ステップＳ３５）を行うのであるが、スピーチの長さと間奏の長さによっては、スピーチ長を短縮してもスピーチを間奏中に再生しきれないこともある。即ち、間奏の長さよりも、スピーチの長さの方が長い場合、スピーチとボーカルとが部分的に重なって再生される。このようにスピーチとボーカルとを重ねて再生することに代えて、以下のいずれかの処理を行ってもよい。 [1.4.1] Processing when speech and vocal overlap each other Basically, it is preferable that the speech is reproduced during the interlude just before the corresponding vocal and does not overlap the vocal in time. For this reason, the above-described speech length changing process (step S35) is performed. However, depending on the length of the speech and the length of the interlude, the speech may not be reproduced during the interlude even if the speech length is shortened. That is, when the length of the speech is longer than the length of the interlude, the speech and the vocal are partially overlapped and reproduced. As described above, one of the following processes may be performed instead of reproducing the speech and the vocal in a superimposed manner.

（１）ボーカルのレベルを調整する。 (1) Adjust the vocal level.

スピーチとボーカルとが重なってしまう場合、ボーカルの音量レベルを下げる方法がある。図７（Ａ）は、スピーチの後方部分と、ボーカルの先頭部分とが重なり、重複部分Ｘが生じる場合を示す。この場合、重複部分Ｘにおいてボーカルの音量を調整する。具体的には、ボーカルの音量をスピーチが聞こえる程度まで低下させる、もしくはゼロにする。これにより、重複部分Ｘでは、スピーチの再生が優先され、スピーチが聞き取り易くなる。 If the speech and the vocal overlap, there is a method to lower the volume level of the vocal. FIG. 7A shows a case where the rear part of the speech and the head part of the vocal overlap and an overlapping part X occurs. In this case, the vocal volume is adjusted in the overlapping portion X. Specifically, the volume of the vocal is lowered to a level at which the speech can be heard, or set to zero. As a result, the reproduction of the speech is prioritized in the overlapping portion X, and the speech can be easily heard.

図７（Ｂ）は、逆にスピーチの先頭部分と、１つ前のボーカルの後方部分とが重なり、重複部分Ｘが生じる場合を示す。この場合にも、重複部分Ｘにおいて、ボーカルの音量を調整する。具体的には、ボーカルの音量をスピーチが聞こえる程度まで低下させる、もしくはゼロにする。また、重複部分Ｘにおいて、急にボーカルの音量レベルを下げるのではなく、ボーカルをフェードアウトさせて徐々に音量レベルを下げるようにしてもよい。これにより、重複部分Ｘでは、スピーチの再生が優先され、スピーチが聞き取り易くなる。 On the contrary, FIG. 7B shows a case where the beginning portion of the speech overlaps with the rear portion of the previous vocal, and the overlapping portion X occurs. Also in this case, the vocal volume is adjusted in the overlapping portion X. Specifically, the volume of the vocal is lowered to a level at which the speech can be heard, or set to zero. Further, in the overlapping portion X, the volume level of the vocal may be faded out and the volume level may be gradually reduced, instead of suddenly lowering the volume level of the vocal. As a result, the reproduction of the speech is prioritized in the overlapping portion X, and the speech can be easily heard.

具体的に上記のレベル調整は、楽曲信号においてボーカルの成分と楽器などの演奏の成分とが分離している場合には、ボーカルの成分の音量レベルを低下させればよい。一方、ボーカルの部分が楽器などの演奏の部分と合成されており、ボーカルのみの音量を調整できない場合には、楽曲信号全体の音量レベルを低下させてもよいし、又は、楽曲信号のうち一般的にボーカル（人間の声）に相当する周波数帯域の成分のみ音量レベルを低下させるようにしてもよい。 Specifically, the above level adjustment may be performed by lowering the volume level of the vocal component when the vocal component and the musical component of the musical instrument are separated in the music signal. On the other hand, if the vocal part is combined with the performance part of the musical instrument and the volume of only the vocal cannot be adjusted, the volume level of the entire music signal may be lowered, or the general volume of the music signal may be reduced. Alternatively, the volume level may be lowered only for the component in the frequency band corresponding to vocal (human voice).

（２）スピーチのレベルを調整する。 (2) Adjust the speech level.

スピーチとボーカルとが重なってしまう場合、逆にスピーチの音量レベルを下げる方法もある。図７（Ｃ）は、スピーチの後方部分と、ボーカルの先頭部分とが重なり、重複部分Ｘが生じる場合を示す。この場合、重複部分Ｘにおいて、スピーチの音量を調整する。具体的には、スピーチの音量を低下させる、もしくはゼロにする。急にスピーチの音量を下げるのではなく、スピーチをフェードアウトさせて徐々に音量を下げるようにしてもよい。この場合、重複部分Ｘでは、スピーチが聞き取れなくなるが、一般的にユーザがある程度知っている楽曲を聞く場合には、歌詞の全てを覚えてはいないものの、歌詞の先頭部分がわかれば、その後は歌詞を思い出して歌うことができるということも多い。よって、図７（Ｃ）のように、スピーチの先頭部分が聞き取れれば、スピーチの後方部分が聞き取りにくくなっても構わないということも多い。この手法はそのような場合に有効である。 If the speech and the vocal overlap, there is also a method of lowering the volume level of the speech. FIG. 7C shows a case where the rear part of the speech overlaps with the head part of the vocal, and the overlapping part X occurs. In this case, the volume of speech is adjusted in the overlapping portion X. Specifically, the volume of speech is reduced or set to zero. Instead of abruptly lowering the volume of the speech, the volume of the speech may be faded out and gradually reduced. In this case, the speech cannot be heard at the overlapping portion X, but generally when the user listens to a piece of music that he / she knows to some extent, although he / she does not remember all the lyrics, if the beginning portion of the lyrics is known, after that, Often you can remember the lyrics and sing. Therefore, as shown in FIG. 7C, if the head portion of the speech can be heard, it is often acceptable that the back portion of the speech becomes hard to hear. This method is effective in such cases.

［１．４．２］スピーチとボーカルを異なる方向から聞かせる処理
人間には、同時に異なる方向から到来する音を聞き分ける能力がある（いわゆるカクテルパーティ効果）。これを利用し、ユーザがスピーチとボーカルとを聞き分けることができるようにする手法が考えられる。なお、この手法は、スピーチとボーカルとが時間的に重なるか否かに拘わらず実行される。 [1.4.2] Processing of making speech and vocals heard from different directions Humans have the ability to distinguish sounds coming from different directions at the same time (so-called cocktail party effect). By utilizing this, a method that allows the user to distinguish between speech and vocal can be considered. It should be noted that this method is executed regardless of whether the speech and the vocal overlap in time.

（１）左右のスピーカで位相を調整する方法
図８（Ａ）は、左右のスピーカから出力されるスピーチの位相を反転させる構成を示す。左（Ｌ）チャンネルの楽曲信号は加算器３２に供給され、右（Ｒ）チャンネルの楽曲信号は加算器３３に供給される。一方、スピーチ信号は、そのまま加算器３３に供給されるとともに、位相反転器３１で位相が反転されて加算器３２に供給される。加算器３２の出力は左スピーカ３０Ｌに供給され、加算器３３の出力は右スピーカ３０Ｒに供給される。 (1) Method of adjusting phase with left and right speakers FIG. 8A shows a configuration in which the phase of speech output from the left and right speakers is inverted. The music signal of the left (L) channel is supplied to the adder 32, and the music signal of the right (R) channel is supplied to the adder 33. On the other hand, the speech signal is supplied to the adder 33 as it is, and the phase is inverted by the phase inverter 31 and supplied to the adder 32. The output of the adder 32 is supplied to the left speaker 30L, and the output of the adder 33 is supplied to the right speaker 30R.

この構成によれば、ボーカルを含む楽曲の音像は左右スピーカの間に定位するのに対し、スピーチの音像はユーザの耳回りに定位することになり、ユーザはスピーチと楽曲中のボーカルとを聞き分けやすくなる。なお、図８（Ａ）の例では、位相反転器３１により左スピーカ３０Ｌに供給されるスピーチ信号の位相のみを反転しているが、逆に右スピーカ３０Ｒに供給されるスピーチ信号の位相のみを反転させてもよい。また、左右のスピーカに供給されるスピーチ信号の間に一定の位相差があればスピーチの音像位置と楽曲の音像位置とを異ならせることができるので、一方のスピーカに供給されるスピーチ信号を必ずしも反転（１８０°変化）させる必要はない。即ち、一方のスピーカに供給されるスピーチ信号と、他方のスピーカに供給されるスピーチ信号との間に一定の位相差を与えてやればよい。 According to this configuration, the sound image of the music including the vocal is localized between the left and right speakers, while the sound image of the speech is localized around the user's ear, and the user can distinguish the speech from the vocal in the music. It will be easier. In the example of FIG. 8A, only the phase of the speech signal supplied to the left speaker 30L is inverted by the phase inverter 31, but only the phase of the speech signal supplied to the right speaker 30R is inverted. You may invert it. Further, if there is a constant phase difference between the speech signals supplied to the left and right speakers, the sound image position of the speech and the sound image position of the music can be made different, so that the speech signal supplied to one speaker is not always required. There is no need to invert (change by 180 °). That is, it suffices to give a constant phase difference between the speech signal supplied to one speaker and the speech signal supplied to the other speaker.

（２）音像の定位を制御する方法
図８（Ｂ）は、スピーチの音像を任意の位置に設定可能な構成を示す。左（Ｌ）チャンネルの楽曲信号は加算器３２に供給され、右（Ｒ）チャンネルの楽曲信号は、加算器３３に供給される。一方、スピーチ信号は、音像定位制御演算部３４、クロストークキャンセル部３５を経由して加算器３２、３３に供給される。音像定位制御演算部３４は、目標のスピーカ位置と聴取位置（ユーザの位置）との間の伝達関数をスピーチ信号に畳み込み、クロストークキャンセル部３５は楽曲を出力しているスピーカと聴取位置との間の伝達関数をキャンセルする処理を行う。これにより、楽曲の音像は左右のスピーカ３０Ｌ、３０Ｒの間に定位させるとともに、スピーチの音像を目標のスピーカ位置に定位させることができるので、ユーザはスピーチとボーカルとを聞き分けやすくなる。 (2) Method of controlling sound image localization FIG. 8B shows a configuration in which a sound image of speech can be set at an arbitrary position. The music signal of the left (L) channel is supplied to the adder 32, and the music signal of the right (R) channel is supplied to the adder 33. On the other hand, the speech signal is supplied to the adders 32 and 33 via the sound image localization control calculation unit 34 and the crosstalk cancellation unit 35. The sound image localization control calculation unit 34 convolves the transfer function between the target speaker position and the listening position (user's position) into the speech signal, and the crosstalk canceling unit 35 defines the speaker outputting the music and the listening position. Performs processing to cancel the transfer function between. As a result, the sound image of the music can be localized between the left and right speakers 30L and 30R, and the sound image of the speech can be localized at the target speaker position, which makes it easier for the user to distinguish between the speech and the vocal.

（３）ヘッドレストスピーカを利用する方法
車両のスピーカに加えて車両のシートにヘッドレストスピーカが搭載されている場合、車両のスピーカからボーカルを含む楽曲を出力し、ヘッドレストスピーカからスピーチを出力することができる。この場合の構成例を図９に示す。 (3) Method of using headrest speaker When a headrest speaker is mounted on the seat of the vehicle in addition to the speaker of the vehicle, music including vocals can be output from the speaker of the vehicle and speech can be output from the headrest speaker. .. An example of the configuration in this case is shown in FIG.

左右チャンネルの楽曲信号はそれぞれ車両のスピーカ３０Ｌ、３０Ｒに供給される。また、スピーチ信号は、そのまま右のヘッドレストスピーカ３５Ｒに供給されるとともに、位相反転器３１で位相が反転されて左のヘッドレストスピーカ３５Ｌに供給される。この場合も、２つのヘッドレストスピーカ３５Ｌ、３５Ｒに供給されるスピーチ信号に位相差が与えられているため、スピーチの音像は楽曲の音像と異なる位置に定位し、ユーザはスピーチと楽曲中のボーカルとを聞き分けやすくなる。なお、この例においても、図８（Ａ）の例と同様に、一方のヘッドレストスピーカに供給されるスピーチ信号と、他方のヘッドレストスピーカに供給されるスピーチ信号との間に一定の位相差を与えてやればよい。 The music signals of the left and right channels are supplied to the speakers 30L and 30R of the vehicle, respectively. Further, the speech signal is supplied as it is to the right headrest speaker 35R, and the phase thereof is inverted by the phase inverter 31 and supplied to the left headrest speaker 35L. Also in this case, since the phase difference is given to the speech signals supplied to the two headrest speakers 35L and 35R, the sound image of the speech is localized at a position different from the sound image of the music, and the user can recognize the speech and the vocal in the music. Will be easier to hear. In this example as well, as in the example of FIG. 8A, a constant phase difference is provided between the speech signal supplied to one headrest speaker and the speech signal supplied to the other headrest speaker. You can do it.

ヘッドレストスピーカを利用する場合には、運転席のヘッドレストスピーカの代わりに、助手席のヘッドレストスピーカを利用してスピーチを再生してもよい。また、車両の複数の座席にヘッドレストスピーカが搭載されている場合には、各座席毎にスピーチの再生の要否を選択して設定できるようにしてもよい。こうすると、スピーチを聞いて楽曲を歌いたい搭乗者の座席のヘッドレストスピーカのみからスピーチが再生されるように設定することができる。 When the headrest speaker is used, the headrest speaker in the passenger seat may be used instead of the headrest speaker in the driver's seat to reproduce the speech. Further, when the headrest speaker is installed in a plurality of seats of the vehicle, the necessity of speech reproduction may be selected and set for each seat. By doing so, it is possible to set that the speech is reproduced only from the headrest speaker of the seat of the passenger who wants to hear the speech and sing a song.

また、位相差を与えることに代えて、図８（Ｂ）で説明した処理と同様に、音像定位制御演算部３４と、クロストークキャンセル部３５とを用いることで、スピーチの音像を任意の位置に定位させてもよい。これにより、ユーザがスピーチとボーカルとを聞き分けやすくすることができる。 Further, instead of giving the phase difference, the sound image localization control calculation unit 34 and the crosstalk cancellation unit 35 are used in the same manner as the processing described in FIG. It may be localized to. This makes it easier for the user to distinguish between speech and vocal.

［２］システム構成
次に、上述のアシストボーカルを実現する楽曲再生システムの構成例を説明する。 [2] System Configuration Next, a configuration example of a music reproduction system that realizes the above assist vocal will be described.

［２．１］第１実施例
第１実施例では、アシストボーカル処理を主として端末装置側で実行する。第１実施例による楽曲再生システムの全体構成を図１０に示す。第１実施例の楽曲再生システムでは、複数の車両１と、コンテンツプロバイダ２と、ゲートサーバ３とがネットワーク４を介して通信可能とされる。なお、複数の車両１は、無線通信によりネットワーク４を介してコンテンツサーバ２、ゲートサーバ３と通信可能となっている。 [2.1] First Example In the first example, the assist vocal processing is mainly executed on the terminal device side. FIG. 10 shows the overall structure of the music reproducing system according to the first embodiment. In the music reproducing system according to the first embodiment, a plurality of vehicles 1, a content provider 2, and a gate server 3 can communicate with each other via a network 4. The plurality of vehicles 1 can communicate with the content server 2 and the gate server 3 via the network 4 by wireless communication.

コンテンツプロバイダ２は、音楽配信業者などのサーバであり、楽曲データ、楽曲のメタデータ、歌詞データなどを提供する。ゲートサーバ３は、本実施例によるアシストボーカルを実現するために機能するサーバであり、コンテンツプロバイダ２から必要な楽曲の楽曲データ、メタデータ、歌詞データなどを取得して、図示しないデータベースに記憶している。 The content provider 2 is a server such as a music distributor, and provides music data, music metadata, lyrics data, and the like. The gate server 3 is a server that functions to realize the assisted vocal according to the present embodiment, and acquires the music data, metadata, lyrics data, etc. of necessary music from the content provider 2 and stores it in a database (not shown). ing.

車両１の内部構成の一例を図１１（Ａ）に示す。車両１は、端末装置１０と、音楽再生装置２０と、スピーカ３０とを備える。 An example of the internal configuration of the vehicle 1 is shown in FIG. The vehicle 1 includes a terminal device 10, a music reproducing device 20, and a speaker 30.

端末装置１０は、典型的にはスマートフォンなどの携帯端末であり、通信部１１と、制御部１２と、記憶部１３と、マイク１４と、操作部１５とを備える。通信部１１は、ネットワーク４を通じてゲートサーバ３と通信する。制御部１２は、ＣＰＵなどからなり、端末装置１０の全体を制御する。 The terminal device 10 is typically a mobile terminal such as a smartphone, and includes a communication unit 11, a control unit 12, a storage unit 13, a microphone 14, and an operation unit 15. The communication unit 11 communicates with the gate server 3 via the network 4. The control unit 12 includes a CPU and controls the entire terminal device 10.

記憶部１３は、ＲＯＭ、ＲＡＭなどのメモリであり、制御部１２が各種の処理を実行するためのプログラムを記憶するとともに、ワークメモリとしても機能する。記憶部１３に記憶されたプログラムを制御部１２が実行することにより、アシストボーカル処理を含む処理が実行される。また、記憶部１３は、ユーザが保存した楽曲の楽曲データを記憶していてもよい。 The storage unit 13 is a memory such as a ROM and a RAM, stores a program for the control unit 12 to execute various processes, and also functions as a work memory. When the control unit 12 executes the program stored in the storage unit 13, the processing including the assist vocal processing is executed. Further, the storage unit 13 may store the music data of the music saved by the user.

マイク１４は、車内で再生されている楽曲、ユーザによる歌唱、会話などの音声を集音して音声データを生成する。操作部１５は、典型的にはタッチパネルなどであり、ユーザによる操作、選択の入力を受け付ける。 The microphone 14 collects voices such as music played in the vehicle, singing by the user, and conversation, and generates voice data. The operation unit 15 is typically a touch panel or the like, and receives an input of operation and selection by the user.

音楽再生装置２０は、例えばカーオーディオなどであり、アンプなどを含む。スピーカ３０は、車両に搭載されたスピーカである。音楽再生装置２０は、端末装置１０から供給される楽曲データに基づいて楽曲をスピーカ３０から再生する。 The music reproducing device 20 is, for example, a car audio device, and includes an amplifier. The speaker 30 is a speaker mounted on the vehicle. The music reproducing device 20 reproduces the music from the speaker 30 based on the music data supplied from the terminal device 10.

車両１の内部構成の他の例を図１１（Ｂ）に示す。この例では、車両１は端末装置１０ｘを備える。端末装置１０ｘは、図１１（Ａ）に示す携帯端末などの端末装置１０とカーオーディオなどの音楽再生装置２０の機能を併せ持つ装置である。端末装置１０ｘは、端末装置１０と同様に通信部１１、制御部１２、記憶部１３、マイク１４、操作部１５を備えるとともに、音楽再生装置２０に相当する音楽再生部１６を備える。端末装置１０ｘはスピーカ３０に接続され、楽曲データに基づいて楽曲をスピーカ３０から再生する。 Another example of the internal configuration of the vehicle 1 is shown in FIG. In this example, the vehicle 1 includes a terminal device 10x. The terminal device 10x is a device having both the functions of the terminal device 10 such as the mobile terminal shown in FIG. 11A and the music reproducing device 20 such as a car audio. Like the terminal device 10, the terminal device 10x includes a communication unit 11, a control unit 12, a storage unit 13, a microphone 14, an operation unit 15, and a music reproduction unit 16 corresponding to the music reproduction device 20. The terminal device 10x is connected to the speaker 30 and reproduces the music from the speaker 30 based on the music data.

次に、第１実施例の楽曲再生システムによるアシストボーカル処理について説明する。図１２は、第１実施例に係るアシストボーカル処理のフローチャートである。この処理では、アシストボーカル処理を主として端末装置１０又は１０ｘ（以下、代表して単に「端末装置１０」と記す。）により実行する。 Next, the assist vocal processing by the music reproduction system of the first embodiment will be described. FIG. 12 is a flowchart of the assist vocal processing according to the first embodiment. In this processing, the assist vocal processing is mainly executed by the terminal device 10 or 10x (hereinafter, simply referred to as "terminal device 10").

まず、ゲートサーバ３は、ネットワーク４を介してコンテンツプロバイダ２に接続し、複数の楽曲について、楽曲データ及び歌詞データを取得し、内部のデータベースに保存しておく（ステップＳ１０１）。 First, the gate server 3 connects to the content provider 2 via the network 4, acquires song data and lyrics data for a plurality of songs, and stores them in an internal database (step S101).

端末装置１０は、ユーザによる操作部１５の操作により、再生すべき楽曲の指定を受け取り（ステップＳ１０２）、その楽曲を指定する楽曲指定情報をゲートサーバ３へ送信する（ステップＳ１０３）。ゲートサーバ３は、受け取った楽曲指定情報に対応する楽曲の楽曲データ及び歌詞データをデータベースから取得し、端末装置１０へ送信する（ステップＳ１０４）。 The terminal device 10 receives the designation of the music piece to be reproduced by the user operating the operation unit 15 (step S102), and transmits the music piece designation information designating the music piece to the gate server 3 (step S103). The gate server 3 acquires the music data and the lyrics data of the music corresponding to the received music designation information from the database and transmits the music data and the lyrics data to the terminal device 10 (step S104).

次に、端末装置１０は、受信した楽曲データ及び歌詞データを利用して、ステップＳ１０５〜Ｓ１０９の処理を行う。ここで、ステップＳ１０５〜Ｓ１０９の処理は、図２におけるステップＳ３〜Ｓ７と同様であるので、説明を省略する。 Next, the terminal device 10 uses the received song data and lyrics data to perform the processes of steps S105 to S109. Here, the processes of steps S105 to S109 are similar to steps S3 to S7 in FIG.

こうして、第１実施例の楽曲再生システムにおいては、車両１に搭載された端末装置１０が主としてアシストボーカル処理を実行する。 Thus, in the music reproduction system of the first embodiment, the terminal device 10 mounted on the vehicle 1 mainly executes the assist vocal processing.

上記の例では、ステップＳ１０１でゲートサーバ３はコンテンツプロバイダから楽曲データを取得しているが、楽曲データが端末装置１０に保存されている場合には、ゲートサーバ３は端末装置１０から楽曲データを取得してもよい。また、ゲートサーバ３内のデータベースに楽曲データが保存されている場合には、そこから楽曲データを取得してもよい。 In the above example, the gate server 3 acquires the music data from the content provider in step S101, but when the music data is stored in the terminal device 10, the gate server 3 acquires the music data from the terminal device 10. You may get it. Further, when the music data is stored in the database in the gate server 3, the music data may be acquired from that.

［２．２］第２実施例
第２実施例では、アシストボーカル処理の一部をゲートサーバ３側で実行する。第２実施例による楽曲再生システムの全体構成は、図１０に示す第１実施例と同様であるので、説明を省略する。 [2.2] Second Embodiment In the second embodiment, a part of the assist vocal processing is executed on the gate server 3 side. The overall structure of the music reproducing system according to the second embodiment is similar to that of the first embodiment shown in FIG.

次に、第２実施例の楽曲再生システムによるアシストボーカル処理について説明する。図１３は、第２実施例に係るアシストボーカル処理のフローチャートである。この処理では、ゲートサーバ３がスピーチ情報を生成し、さらにスピーチ付楽曲データを生成して端末装置１０へ送信する。端末装置１０は、スピーチ付楽曲データを受信して再生する。以下、詳しく説明する。 Next, the assist vocal processing by the music reproducing system of the second embodiment will be described. FIG. 13 is a flowchart of the assist vocal processing according to the second embodiment. In this process, the gate server 3 generates speech information, further generates music data with speech, and transmits it to the terminal device 10. The terminal device 10 receives and reproduces the music data with speech. The details will be described below.

まず、ゲートサーバ３は、ネットワーク４を介してコンテンツプロバイダ２に接続し、複数の楽曲について、楽曲データ及び歌詞データを取得し、内部のデータベースに保存する（ステップＳ２０１）。そして、ゲートサーバ３は、各楽曲について、取得した楽曲データと歌詞データとに基づいてスピーチ情報を生成する（ステップＳ２０２）。このスピーチ情報生成処理は、図２のステップＳ３と同一であるので、説明を省略する。 First, the gate server 3 connects to the content provider 2 via the network 4, acquires song data and lyrics data for a plurality of songs, and saves them in an internal database (step S201). And the gate server 3 produces | generates speech information based on the acquired music data and lyrics data about each music (step S202). This speech information generation process is the same as step S3 in FIG. 2, and therefore its explanation is omitted.

スピーチ情報を生成すると、ゲートサーバ３は、楽曲データにスピーチを付加してスピーチ付楽曲データを生成する（ステップＳ２０３）。具体的に、ゲートサーバ３は、生成したスピーチ情報に基づいて、各スピーチに対応するスピーチ信号を、図３のステップＳ３６の処理により算出したタイミングで楽曲データに合成し、スピーチ付楽曲データを生成してデータベースに記憶する。つまり、スピーチ付楽曲データは、そのまま再生することにより、楽曲に加えてスピーチが再生されるデータである。 After generating the speech information, the gate server 3 adds speech to the music data to generate music data with speech (step S203). Specifically, the gate server 3 synthesizes the speech signal corresponding to each speech with the music data at the timing calculated by the process of step S36 of FIG. 3 based on the generated speech information to generate the music data with speech. And store it in the database. That is, the music data with speech is data in which speech is reproduced in addition to music by reproducing the music data with speech.

端末装置１０は、ユーザによる操作部１５の操作により、再生すべき楽曲の指定を受け取り（ステップＳ２０４）、その楽曲を指定する楽曲指定情報をゲートサーバ３へ送信する（ステップＳ２０５）。ゲートサーバ３は、受け取った楽曲指定情報に対応する楽曲のスピーチ付楽曲データを端末装置１０へ送信する（ステップＳ２０６）。 The terminal device 10 receives the designation of the music piece to be reproduced by the user operating the operation unit 15 (step S204), and transmits the music piece designation information designating the music piece to the gate server 3 (step S205). The gate server 3 transmits the music data with speech of the music corresponding to the received music designation information to the terminal device 10 (step S206).

次に、端末装置１０は、受信したスピーチ付楽曲データを再生する（ステップＳ２０７）。これにより、楽曲の再生中の適切なタイミングで、スピーチが再生される。次に、端末装置１０は、楽曲の再生を終了すべきか否かを判定する（ステップＳ２０８）。その楽曲が最後まで再生された場合、又は、ユーザが再生を中止した場合など、再生を終了すべき場合には（ステップＳ２０８：Ｙｅｓ）、端末装置１０は再生を終了する。一方、楽曲の再生を終了すべきではない場合（ステップＳ２０８：Ｎｏ）、処理はステップＳ２０７へ戻り、スピーチ付楽曲データの再生が継続される。 Next, the terminal device 10 reproduces the received music data with speech (step S207). Thereby, the speech is reproduced at an appropriate timing during the reproduction of the music. Next, the terminal device 10 determines whether or not to end the reproduction of the music (step S208). When the reproduction is to be ended (step S208: Yes), such as when the music is reproduced to the end or when the user stops the reproduction (step S208: Yes), the terminal device 10 ends the reproduction. On the other hand, when the reproduction of the music should not be ended (step S208: No), the process returns to step S207 and the reproduction of the music data with speech is continued.

こうして、第２実施例の楽曲再生システムにおいては、ゲートサーバ３側でスピーチ付楽曲データが生成され、端末装置１０へ提供される。端末装置１０は、受信したスピーチ付楽曲データを再生することにより、スピーチを含む楽曲を聞くことができる。 Thus, in the music reproduction system of the second embodiment, the music data with speech is generated on the gate server 3 side and provided to the terminal device 10. The terminal device 10 can listen to the music including the speech by reproducing the received music data with the speech.

上記の例では、ステップＳ２０１でゲートサーバ３はコンテンツプロバイダから楽曲データを取得しているが、楽曲データが端末装置１０に保存されている場合には、ゲートサーバ３は端末装置１０から楽曲データを取得してもよい。また、ゲートサーバ３内のデータベースに楽曲データが保存されている場合には、そこから楽曲データを取得してもよい。 In the above example, the gate server 3 acquires the music data from the content provider in step S201, but when the music data is stored in the terminal device 10, the gate server 3 acquires the music data from the terminal device 10. You may get it. Further, when the music data is stored in the database in the gate server 3, the music data may be acquired from that.

［３］スピーチのみを再生するアシストボーカル
上述のアシストボーカル処理では、端末装置１０により再生している楽曲に対して、スピーチを付加して再生している。しかし、端末装置１０以外のソース、例えば車内のラジオ、ＣＤなど(以下、「外部ソース」と呼ぶ。）から再生されている楽曲に対してスピーチを付加することができれば便利である。この場合、端末装置１０は、基本的に上述の方法でスピーチ情報を生成し、外部ソースから再生されている楽曲の再生位置に応じたタイミングでスピーチのみを再生すればよい。 [3] Assisted Vocal for Reproducing Only Speech In the above-mentioned assisted vocal processing, a speech is added to the music being reproduced by the terminal device 10 and reproduced. However, it would be convenient if speech could be added to a music piece reproduced from a source other than the terminal device 10, for example, a radio in the car, a CD (hereinafter referred to as “external source”). In this case, the terminal device 10 basically has only to generate the speech information by the method described above and reproduce only the speech at the timing corresponding to the reproduction position of the music piece reproduced from the external source.

この場合のアシストボーカル処理のフローチャートを図１４に示す。まず、端末装置１０は、外部ソースから再生されている楽曲をマイク１４により集音して再生楽曲データを取得し（ステップＳ１５１）、これをゲートサーバ３へ送信する（ステップＳ１５２）。 FIG. 14 shows a flowchart of the assist vocal processing in this case. First, the terminal device 10 collects music that is being reproduced from an external source by the microphone 14 to acquire reproduced music data (step S151), and transmits this to the gate server 3 (step S152).

ゲートサーバ１５３は、端末装置１０から再生楽曲データを受信し、対応する楽曲及びその再生位置を特定する（ステップＳ１５３）。具体的には、ゲートサーバ３は、前述の音楽検索サーバの機能を有する音楽検索部を備え、再生楽曲データに基づいて、その楽曲を特定するとともに、その再生楽曲データの部分に対応する再生位置を特定する。そして、ゲートサーバ３は、特定した楽曲の楽曲名やアーティスト名とともに、歌詞データと、再生位置情報とを端末装置１０へ送信する（ステップＳ１５４）。 The gate server 153 receives the reproduced music data from the terminal device 10 and specifies the corresponding music and its reproduction position (step S153). Specifically, the gate server 3 includes a music search unit having the function of the above-described music search server, specifies the music piece based on the reproduction music data, and reproduces the reproduction position corresponding to the portion of the reproduction music data. Specify. Then, the gate server 3 transmits the lyrics data and the reproduction position information to the terminal device 10 together with the music name and artist name of the specified music (step S154).

端末装置１０は、受信した歌詞データを利用して、スピーチ情報を生成する（ステップＳ１５５）。なお、スピーチ情報の生成は、図３を参照して説明したのと同様の方法で行われる。なお、端末装置１０は、マイク１４で取得した再生楽曲データを解析することにより、楽曲解析データを取得することができる（図３のステップＳ３２の処理）。 The terminal device 10 uses the received lyrics data to generate speech information (step S155). The speech information is generated by the same method as described with reference to FIG. The terminal device 10 can acquire the music analysis data by analyzing the reproduced music data acquired by the microphone 14 (the process of step S32 in FIG. 3).

次に、端末装置１０は、ゲートサーバ３から取得した再生位置情報に基づいて、その楽曲における現在の再生位置を算出する（ステップＳ１５６）。この方法については後述する。次に、端末装置１０は、スピーチ強調処理を行い（ステップＳ１５７）、外部ソースにより再生されている楽曲に合わせて適切なタイミングでスピーチを再生する（ステップＳ１５８）。これにより、外部ソースから再生されている楽曲に合わせて、スピーチが再生される。 Next, the terminal device 10 calculates the current reproduction position in the music based on the reproduction position information acquired from the gate server 3 (step S156). This method will be described later. Next, the terminal device 10 performs a speech enhancement process (step S157) and reproduces the speech at an appropriate timing according to the music being reproduced by the external source (step S158). As a result, the speech is reproduced according to the music being reproduced from the external source.

そして、端末装置１０は、スピーチの再生を終了すべきか否かを判定し（ステップＳ１５９）、終了させるべきでない場合には、ステップＳ１５６へ戻って処理を継続する。一方、外部ソースからの楽曲の再生が終了した場合、再生されている楽曲が別の楽曲に変わった場合、再生すべきスピーチが無くなった場合など、スピーチの再生を終了すべき場合には（ステップＳ１５９：Ｙｅｓ）、処理を終了する。 Then, the terminal device 10 determines whether or not the reproduction of the speech should be ended (step S159), and if not ended, the process returns to step S156 to continue the processing. On the other hand, when the reproduction of the music from the external source is finished, the music being played is changed to another music, the speech to be played is lost, etc. S159: Yes), and the process ends.

次に、図１５を参照して、ステップＳ１５６において楽曲の現在の再生位置を特定する方法を説明する。端末装置１０からゲートサーバ３へ送信される再生楽曲データは、実際には複数のオーディオフレームのデータとなる。即ち、端末装置１０は、外部ソースにより再生されている楽曲をマイク１４で集音し、複数のオーディオフレームとして順次ゲートサーバ３へ送信する。 Next, with reference to FIG. 15, a method of identifying the current reproduction position of the music in step S156 will be described. The reproduced music data transmitted from the terminal device 10 to the gate server 3 is actually data of a plurality of audio frames. That is, the terminal device 10 collects the music being reproduced by the external source with the microphone 14 and sequentially transmits it to the gate server 3 as a plurality of audio frames.

図１５の例では、端末装置１０は、外部ソースにより再生されている楽曲のオーディオフレームｎ、（ｎ＋１）、（ｎ＋２）、．．．を、再生楽曲データとして順次ゲートサーバ３へ送信する。この際、端末装置１０は、最初に再生楽曲データを送信した時刻、図１５の例ではオーディオフレームｎを送信した時刻（以下、「基準時刻ｔ０」と呼ぶ。）を記憶しておく。 In the example of FIG. 15, the terminal device 10 has audio frames n, (n + 1), (n + 2) ,. ．． Are sequentially transmitted to the gate server 3 as reproduced music data. At this time, the terminal device 10 stores the time when the reproduction music data is first transmitted, in the example of FIG. 15, the time when the audio frame n is transmitted (hereinafter, referred to as “reference time t0”).

ゲートサーバ３の音楽検索部は、データベースに記憶された多数の楽曲の情報を参照し、受信した複数のオーディオフレームに基づいて楽曲を特定する。図１５の例では、ゲートサーバ３の音楽検索部は、オーディオフレームｎ〜（ｎ＋４）に基づいて楽曲を特定できたものとする。この場合、ゲートサーバ３は、楽曲判定結果として、楽曲名、アーティスト名などに加えて、端末装置１０から最初に受信したオーディオフレームｎの曲先頭からの再生時間（ｔｎ）を再生位置情報として端末装置１０へ送信する。即ち、図１４のステップＳ１５４でゲートサーバ３から端末装置１０へ送信される再生位置情報は、端末装置１０がゲートサーバ３へ最初に送信したオーディオフレームｎの、その楽曲の先頭からの経過時間となっている。そこで、ステップＳ１５６において、端末装置１０は、予め記憶していた基準時刻ｔ０から現在までの経過時間の経過時間Δｔを算出し、これを再生時間ｔｎに加算する。即ち、ゲートサーバ３から送信される再生時間ｔｎは、その楽曲の先頭からオーディオフレームｎまでの時間であり、経過時間Δｔはオーディオフレームｎから現在までの時間である。よって、現在の再生位置（再生時間）Ｔｃは、以下の式で算出される。 The music search unit of the gate server 3 refers to the information of many music pieces stored in the database, and specifies the music piece based on the plurality of received audio frames. In the example of FIG. 15, it is assumed that the music search unit of the gate server 3 has been able to identify the music piece based on the audio frames n to (n + 4). In this case, the gate server 3 uses the reproduction time (tn) from the beginning of the music of the audio frame n first received from the terminal device 10 as the reproduction position information in addition to the music name, artist name, etc. as the music position determination result. Send to device 10. That is, the reproduction position information transmitted from the gate server 3 to the terminal device 10 in step S154 of FIG. 14 is the elapsed time from the beginning of the song of the audio frame n first transmitted by the terminal device 10 to the gate server 3. Is becoming Therefore, in step S156, the terminal device 10 calculates the elapsed time Δt of the elapsed time from the reference time t0 to the present, which is stored in advance, and adds this to the reproduction time tn. That is, the reproduction time tn transmitted from the gate server 3 is the time from the beginning of the music piece to the audio frame n, and the elapsed time Δt is the time from the audio frame n to the present. Therefore, the current reproduction position (reproduction time) Tc is calculated by the following formula.

Ｔｃ＝ｔｎ＋Δｔ（２）
以上のように、ゲートサーバ３に音楽検索機能を設け、再生楽曲データに基づいて楽曲及びその再生位置を特定することにより、外部ソースから再生されている楽曲に合わせてスピーチを再生することができる。また、ゲートサーバ３に音楽検索機能を設ける代わりに、外部の音楽検索サーバを利用しても良い。 Tc = tn + Δt (2)
As described above, by providing the music search function in the gate server 3 and specifying the music and its reproduction position based on the reproduction music data, it is possible to reproduce the speech in accordance with the music reproduced from the external source. .. Further, instead of providing the music search function in the gate server 3, an external music search server may be used.

なお、ステップＳ１５９では、１つの楽曲が終了したときに再生を終了してもよいが、１つの楽曲が終了した後で別の楽曲が再生されているような場合には、処理を継続してもよい。即ち、端末装置１０からゲートサーバ３への楽曲再生データの送信を継続している間は、スピーチの再生を継続することとしてもよい。これにより、外部ソースから再生される曲が変わっても、それに追従してスピーチの再生を継続することが可能となる。 Note that in step S159, the reproduction may be ended when one music is ended, but if another music is being reproduced after one music is ended, the process is continued. Good. That is, the speech reproduction may be continued while the transmission of the music reproduction data from the terminal device 10 to the gate server 3 is continued. As a result, even if the music played from the external source changes, it is possible to continue playing the speech following the change.

なお、上記の構成において、制御部１２は本発明の再生位置決定手段、歌詞データ取得手段、歌詞音声データ生成手段、歌詞音声付楽曲データ生成手段、中断判定手段の一例であり、マイク１４は本発明の集音手段の一例であり、通信部１１は本発明の送信手段、受信手段の一例である。 In the above configuration, the control unit 12 is an example of the reproduction position determining unit, the lyrics data acquiring unit, the lyrics voice data generating unit, the song data with lyrics voice generating unit, and the interruption determining unit of the present invention, and the microphone 14 is a main unit. The communication unit 11 is an example of the sound collecting unit of the invention, and the communication unit 11 is an example of the transmitting unit and the receiving unit of the invention.

１車両
２コンテンツプロバイダ
３ゲートサーバ
４ネットワーク
１０、１０ｘ端末装置
１２制御部
１３記憶部
１４マイク
２０音楽再生装置
３０スピーカ 1 vehicle 2 content provider 3 gate server 4 network 10, 10x terminal device 12 control unit 13 storage unit 14 microphone 20 music playback device 30 speaker

Claims

Lyrics data acquisition means for acquiring the lyrics data of the music being played,
Lyrics voice data generation means for generating lyrics voice data based on the lyrics data,
Lyrics voice data shortening means for reducing the time length of the lyrics voice data,
Outputting means for outputting the lyrics voice data whose time length has been shortened by the lyrics voice data shortening means so as to end at the beat position of the song prior to the lyrics portion in the music being reproduced. ,
A lyric voice output device comprising:

The lyrics voice according to claim 1, wherein the lyrics voice data corresponding to the lyrics portion is output so as to be finished before the start timing of the lyrics portion in the music being reproduced. Output device.

The lyrics voice output device according to claim 1 or 2, wherein
Further comprising a voice recognition means for recognizing voice,
A lyrics voice output device, wherein the output means for outputting the lyrics voice data starts operation when it is determined that the user is singing.

The lyrics voice output device according to claim 1 or 2, wherein
The vehicle further comprises sound collecting means for collecting sound in the vehicle,
A lyrics voice output device, wherein the output means for outputting the lyrics voice data starts operation when the user determines that he is in rhythm.

The lyrics voice output device according to claim 1 or 2, wherein
Further equipped with a photographing means for photographing the inside of the car,
A lyrics voice output device, wherein an output means for outputting the lyrics voice data starts operating when it is determined that the user is moving the body in accordance with the music.

The lyrics data acquisition means is
Sound data collecting means for collecting sound data of the music being reproduced,
Transmitting means for transmitting the collected voice data to an external server,
Receiving means for receiving the music specifying information of the music being reproduced, which is specified by the external server based on the collected audio data.
The lyrics voice output apparatus according to claim 1, further comprising:

The receiving means receives, from the external server, music reproduction position information indicating the elapsed time from the beginning of the music being reproduced of the audio data transmitted to the external server by the transmitting means,
The reproduction position determination means determines the reproduction position based on the music reproduction position information and an elapsed time from the time when the transmission means transmits the audio data to the external server. 6. The lyrics voice output device according to 6.

An interruption determination means for determining whether or not the reproduction of the music is interrupted,
8. The lyrics voice output device according to claim 1, wherein the output unit ends the output of the lyrics voice data when the reproduction of the music piece is interrupted.

The output means ends the output of the lyrics voice data when the lyrics data acquisition means acquires the music identification information of a music different from the music that has been reproduced until then. 9. The lyrics voice output device according to any one of items 8 to 8.

When the lyrics data acquisition means acquires the music identification information of a music different from the music that has been played so far, the output means continues to output the lyrics voice data corresponding to the different music. 10. The lyrics voice output device according to claim 1, wherein

A lyrics voice output method executed by a terminal device including a computer,
A lyrics data acquisition process that acquires the lyrics data of the song being played,
A lyrics voice data generation step of generating lyrics voice data based on the lyrics data;
A lyrics voice data shortening step for shortening the time length of the lyrics voice data,
An output step of outputting the lyrics voice data whose time length has been shortened by the lyrics voice data shortening means so as to end at the beat position of the song prior to the lyrics portion in the song being reproduced. ,
A lyrics voice output method comprising:

A program executed by a terminal device including a computer,
Lyrics data acquisition means for acquiring the lyrics data of the music being played,
Lyrics voice data generation means for generating lyrics voice data based on the lyrics data,
Lyrics voice data shortening means for shortening the time length of the lyrics voice data,
Outputting means for outputting the lyrics voice data whose time length has been shortened by the lyrics voice data shortening means so as to end at the beat position of the song prior to the lyrics portion in the music being reproduced,
A program that causes the computer to function as.

A storage medium storing the program according to claim 12.