WO2016135920A1

WO2016135920A1 - Reproduction device, reproduction method, and program

Info

Publication number: WO2016135920A1
Application number: PCT/JP2015/055609
Authority: WO
Inventors: 克仁石岡; 知道木村; 啓太郎菅原
Original assignee: パイオニア株式会社
Priority date: 2015-02-26
Filing date: 2015-02-26
Publication date: 2016-09-01

Abstract

This reproduction device outputs, along with a music signal, a lyric voice signal expressing lyrics included in the music signal. More specifically, the reproduction device outputs the lyric voice signal such that the output timing of the lyric voice signal is a predetermined time before the output start timing of a lyric section in the music signal that corresponds to the lyric voice signal. In this way, the lyric voice signal corresponding to a lyric is output a predetermined time before starting to output that lyric in the music signal, and thus, a user can sing along with the music signal after listening to the lyric voice signal and grasping the lyrics.

Description

REPRODUCTION DEVICE, REPRODUCTION METHOD, AND PROGRAM

The present invention relates to a method of outputting lyrics information as music is played.

A karaoke apparatus that synthesizes and outputs lyrics data prior to a karaoke performance is known (for example, Patent Documents 1 and 2).

Japanese Patent Laid-Open No. 4-67467 Japanese Patent Laid-Open No. 10-63274

In the case of a karaoke device, lyrics are not included in the music to be played back, so that the lyric sound output by the prior art does not become difficult to hear. However, if you are listening to normal music instead of karaoke, if you output the lyrics by voice using the technique of the prior art, the output lyrics will overlap with the lyrics included in the original music. It may be difficult to hear. Also, for example, when listening to music while driving a vehicle, the lyrics voice output by the prior art method may overlap with the voice message of the route guidance by the in-vehicle navigation device, and it may be difficult to hear. .

The above is one example of problems to be solved by the present invention. An object of the present invention is to provide an easy-to-listen lyric sound for a user to sing a song while playing music including lyrics.

The invention according to claim 1 is a playback device that outputs a music signal and a lyrics voice signal representing lyrics of the music signal, and the output end timing of the lyrics voice signal is the lyrics voice in the music signal. Output means for outputting the lyric audio signal is provided so as to be a certain time before the output start timing of the lyrics section corresponding to the signal.

The invention according to claim 4 is a playback method that is executed by a playback device that outputs a music signal and a lyrics voice signal representing lyrics of the music signal, wherein the output end timing of the lyrics voice signal is the music. An output step of outputting the lyric audio signal is provided so as to be a predetermined time before the output start timing of the lyric section corresponding to the lyric audio signal in the signal.

The invention according to claim 5 is a program that is executed by a playback device that includes a computer and outputs a music signal and a lyrics voice signal representing the lyrics of the music signal, and the output end timing of the lyrics voice signal is The computer is caused to function as output means for outputting the lyric sound signal so as to be a predetermined time before the output start timing of the lyrics section corresponding to the lyric sound signal in the music signal.

It is a figure which shows the concept of assist vocal. It is a flowchart of an assist vocal process. It is a flowchart of a speech information generation process. An overview of the speech information generation process is shown. An example of lyric blocking is shown. An example of a speech insertion method is shown. An example of speech enhancement processing is shown. The structure which concerns on the other example of a speech emphasis process is shown. The structure which concerns on the other example of a speech emphasis process is shown. It is a block diagram which shows the whole structure of a music reproduction system. It is a block diagram which shows the internal structural example of a terminal device. It is a flowchart of the assist vocal process by the music reproduction system of 1st Example. It is a flowchart of the assist vocal process by the music reproduction system of 2nd Example. It is a flowchart of the assist vocal process which reproduces | regenerates only speech. It is a figure explaining the identification method of the music currently reproduced | regenerated by the external source.

In a preferred embodiment of the present invention, in the playback device that outputs a music signal and a lyrics voice signal representing the lyrics of the music signal, the output end timing of the lyrics voice signal is the same as the lyrics voice signal in the music signal. Output means for outputting the lyric audio signal is provided so as to be a certain time before the output start timing of the corresponding lyrics section.

The above playback device outputs a lyric audio signal representing the lyrics included in the music signal in accordance with the music signal. Here, the playback device outputs the lyrics audio signal so that the output timing of the lyrics audio signal is a certain time before the output start timing of the lyrics section corresponding to the lyrics audio signal in the music signal. As a result, the lyrics audio signal corresponding to the lyrics is output a certain time before the start of the output of the lyrics in the music signal. Therefore, the user listens to the lyrics audio signal, grasps the lyrics, and adjusts to the music signal. I can sing.

One aspect of the playback apparatus includes beat detecting means for detecting the beat position of the music signal, and the output means matches the output end timing of the lyrics voice signal with the beat position of the music signal. In this aspect, since the output of the lyric audio signal ends at the timing coincident with the beat position of the music signal, the user can easily sing along with the music signal.

In another aspect of the above playback device, the output means matches the output start timing of the lyrics audio signal with the beat position of the music signal. In this aspect, since the output of the lyric audio signal starts at the timing coincident with the beat position of the music signal, the user can easily sing along with the music signal.

In another preferred embodiment of the present invention, a playback method executed by a playback device that outputs a music signal and a lyrics audio signal representing the lyrics of the music signal, the output end timing of the lyrics audio signal is the music. An output step of outputting the lyric audio signal so as to be a predetermined time before the output start timing of the lyric section corresponding to the lyric audio signal in the signal; According to this method, since the lyrics audio signal corresponding to the lyrics is output a certain time before the start of the output of the lyrics in the music signal, the user listens to the lyrics audio signal and grasps the lyrics, then the music signal You can sing along.

In another preferred embodiment of the present invention, a program executed by a playback device that includes a computer and outputs a music signal and a lyrics voice signal representing lyrics of the music signal has an output end timing of the lyrics voice signal. The computer is caused to function as output means for outputting the lyric sound signal so as to be a predetermined time before the output start timing of the lyric section corresponding to the lyric sound signal in the music signal. By executing this program, the lyrics audio signal corresponding to the lyrics is output a certain time before the start of the output of the lyrics in the music signal, so the user must understand the lyrics by listening to the lyrics audio signal. Sing along with the music signal. This program can be stored in a storage medium and handled. Preferably, this program can be stored and handled in a storage medium.

Hereinafter, preferred embodiments of the present invention will be described with reference to the drawings.

[1] Assist Vocal [1.1] Concept of Assist Vocal When a user who is driving a vehicle plays and listens to music in the car, he may want to sing the song he is listening to. However, since the lyrics information cannot be seen while driving, the user cannot sing unless the lyrics of the song are stored.

In this embodiment, when a song containing lyrics is played, the lyrics contained in the song are output as an audio signal to teach the user. Specifically, when playing a song stored in the memory of the terminal device, the lyrics included in the song are output as audio before the lyrics are played in the song. Tell the user. Thereby, the user can sing the music being reproduced even during driving. Also, users other than the driver can sing songs without looking at the lyrics collection.

In this way, the function of outputting the contents of the lyrics and transmitting them to the user prior to the timing when the lyrics are reproduced in the music is called “assist vocal”. In this embodiment, it is assumed that the music to be played is not a karaoke but a normal music including lyrics.

Figure 1 shows the concept of assist vocals. FIG. 1 schematically shows one piece of music. The horizontal axis in FIG. 1 indicates time. One piece of music includes a lyrics portion divided into a plurality of blocks. The part of the lyrics included in the music to be played is called “vocal”. Also, in the music, the part other than vocals is called “interlude”. Therefore, usually one piece of music is composed of a plurality of interludes and a plurality of vocals.

In the example of FIG. 1, the music is composed of three vocals 1 to 3 and a plurality of interludes. It is assumed that the content (lyrics) of the vocal 1 is “Aiueo”, the content of the vocal 2 is “Kakikukeko”, and the content of the vocal 3 is “Sashisuseso”.

In a situation where such a music is being played back, in the present embodiment, the lyrics “Aiueo” corresponding to the vocal 1 are output as audio prior to the timing at which the vocal 1 in the music is played back. In the present specification, the lyric sound output by the assist vocal is called “speech” and is distinguished from “vocal” included in the music.

In the example of FIG. 1, speech 1 corresponding to vocal 1 is output prior to vocal 1. Similarly, speech 2 is output prior to vocal 2, and speech 3 is output prior to vocal 3.

The speech outputs only the vocal lyrics included in the song as an audio signal, and basically does not include elements such as pitch and rhythm. Also, as will be described later, the speech is basically inserted into the interlude before the corresponding vocal, so the length is adjusted as necessary, and it is usually played back as a vocal during the playback of a song. It is also a short time. In a typical example, the speech is spoken speech of the corresponding vocal lyrics.

[1.2] Assist Vocal Processing Next, assist vocal processing for outputting speech will be described. FIG. 2 is a flowchart of the assist vocal process. This process is executed by a terminal device mounted on the vehicle, typically a mobile terminal such as a smartphone, and the details thereof will be described later. In the following description, it is assumed that the terminal device executes processing.

First, the terminal device determines whether or not the assist vocal is on (step S1). Here, the assist vocal may be turned on / off manually by the user or automatically. When performing manually, when a user wants to reproduce speech by assist vocal, the user operates a predetermined button or the like to turn on the assist vocal, and the terminal device detects this. On the other hand, when performing automatically, the terminal device determines the voice of the user using, for example, a microphone, and automatically turns on the assist vocal when the user is singing or performing an action equivalent to singing Set. The assist vocal automatic setting method will be described later.

If the assist vocal is not set to ON (step S1: No), the process ends. On the other hand, when the assist vocal is set to ON (step S1: Yes), the terminal device specifies the music being reproduced (step S2). In this case, the music played in the car is a music stored in the terminal device by downloading from the server, a music stored in a storage medium such as a CD or a memory of the vehicle-mounted device, a radio Music that is being played from When the music stored in the terminal device is being reproduced, the terminal device can easily specify the music being reproduced. On the other hand, when music stored on a storage medium such as a CD is being played or when music is being played from the radio, the terminal device collects the music being played from the speaker in the car with a microphone. Then, the audio data is transmitted to an external music search server. The music search server stores a large number of pieces of music data as a database, specifies music that matches the audio data received from the terminal device, and indicates the music (for example, music name, artist name, etc. (Referred to as “music identification information”) to the terminal device. In this way, the terminal device acquires the music specifying information of the music currently being played.

Thus, when the music being played is specified, the terminal device executes a speech information generation process (step S3). FIG. 3 is a flowchart of the speech information generation process. FIG. 4 shows an outline of the speech information generation process.

In FIG. 3, the terminal device acquires the lyrics data of the music specified in step S2 from an external server or the like (step S31). Here, the “lyric data” is information that defines what lyrics are reproduced at which timing in the music, specifically, lyrics text data indicating the lyrics included in the music, This is information in which the reproduction time data indicating the reproduction time (the elapsed time from the start time of the song) is associated with the reproduction time data.

Next, the terminal device acquires music analysis data (step S32). The music analysis data is information indicating musical features such as beat positions and bar positions in the music, and is generated based on the audio data of the reproduced music. Specifically, the terminal device has a built-in music analysis application, collects music played from a vehicle speaker with a microphone, acquires audio data, and analyzes the audio data to obtain beat positions. Acquire music analysis data such as. Note that the music analysis data may be acquired using an external music analysis device or server instead of incorporating the music analysis application in the terminal device.

Next, the terminal device performs lyrics block formation (step S33). Lyrics blocking is a process of blocking lyrics text data included in the lyrics data acquired in step S31, and one block corresponds to one speech. That is, lyric blocking is a process of dividing lyric text data into speech units.

In the example of FIG. 4, the terminal device has acquired “Aiueokiki Kekoshisashisoseso” as the lyric text data, and the terminal device has obtained three blocks “Aiueo”, “Kakikukeko”, and “Sashisetsuso”. To generate block lyrics data.

FIG. 5 shows an example of lyrics block. FIG. 5A shows a first method. In this method, the interval between the interludes included in the music is set as one block. The “interlude” is a part other than “vocal” in the music. Specifically, when the length It of a section other than vocal (non-vocal section) is longer than a predetermined length t1, the terminal apparatus determines that the section is an interlude.

However, there are exceptional cases where multiple blocks are combined into one block due to the length of the interlude. As in the example shown in FIG. 5B, the length It2 of the interlude 2 immediately before the length Vt3 of the vocal 3 is very short (It2 <α1 · Vt3; α1 is an arbitrary coefficient). It is difficult to output vocal 3 speech during interlude 2. In such a case, if the length It1 of the previous interlude 1 is longer than a predetermined length, the terminal device sets the vocal 2 and vocal 3 as one block. Thus, speech corresponding to vocal 2 and vocal 3 is made in interlude 1.

Fig. 5 (C) shows the second method. In this method, the terminal device determines each block based on a break included in the lyrics data. That is, if the lyric text data included in the lyric data includes delimiter information in advance, the terminal device can block the lyric text data according to the delimiter.

Next, the terminal device performs lyrics speech (step S34). The block lyric data obtained by lyric blocking is text data indicating lyrics, and lyric speech is a process of converting block lyric data into audio data. Specifically, the terminal device incorporates text-to-speech (TTS: TextToSpeech) software, and converts each block lyrics data obtained in step S33 into speech data. As a result, as shown in FIG. 4, speech data 1 to 3 that are audio data are generated from each block lyrics data. Instead of incorporating TTS software in the terminal device, TTS conversion by an external server or the like may be used.

Next, the terminal device changes the speech length (step S35). The speech length change is a process for shortening the time length of each speech obtained by lyric speech so that it can be reproduced in a short time. As already described, each speech is reproduced in an interlude preceding the corresponding vocal. However, since there is a limitation on the time length of the interlude, it is necessary to reproduce the speech by shortening it. For this reason, the speech length is changed.

Basically, the playback time of each speech is shortened (the playback speed is increased) within a range that can be heard by humans. For example, when the time length of each speech obtained in step S34 (referred to as “original speech length”) is “St” and the speech length conversion coefficient is “α2”, the speech length is changed by changing the speech length. The length “Stv” is
Stv = St · α2 (α2 <1.0) (1)
Given in. For example, if α2 = 0.7, each speech is reproduced at a rate 30% higher than the original by changing the speech length.

In addition to the batch change as described above, the playback time may be further shortened according to the duration of the interlude corresponding to each speech. In this case, even for speech with the same number of characters or words with the same lyrics, the playback time varies depending on the position in the song (the length of the preceding interlude).

Next, the terminal device calculates the speech insertion timing (step S36). The terminal device inserts speech corresponding to a certain vocal prior to the playback timing of the vocal. In the example shown in FIG. 4, the speech 1 corresponding to the vocal 1 is inserted before the reproduction timing of the vocal. Similarly, the speech 2 corresponding to the vocal 2 is inserted before the reproduction timing of the vocal 2, and the speech 3 corresponding to the vocal 3 is inserted before the reproduction timing of the vocal 3.

A specific example of a method for inserting speech is shown in FIG. FIG. 6 shows an example of the timing at which the speech 2 corresponding to the vocal 2 is inserted.

In Method 1, the speech ends a certain time before the start timing of the corresponding vocal. Specifically, as shown in FIG. 6, the speech 2 is inserted so as to end a certain time T2 before the reproduction start timing of the vocal 2. That is, the speech 2 ends a predetermined time T2 before the start of the reproduction of the vocal 2. In this case, the reproduction start timing of the speech 2 is determined according to the length of the speech 2. In Method 1, since a certain time is secured from the end of the speech reproduction until the corresponding vocal is reproduced, the user can sing the vocal portion with a margin.

In Method 2, the speech end timing is matched with the beat position of the music. Specifically, in the example of FIG. 6, the speech 2 is inserted so as to end N beats before the playback start timing of the vocal 2 (N is an arbitrary integer; N = 1 in this example). In this case, the reproduction start timing of the speech 2 is determined according to the length of the speech 2. The position of the beat of the music is acquired from the music analysis data described above.

In Method 3, both the speech playback start timing and playback end timing are matched with the beat position of the music. Specifically, in the example of FIG. 6, the playback start timing and playback end timing of the speech 2 are both made coincident with the third beat of the four beats.

As in

methods

2 and 3, if both the speech end timing or the start / end timing coincide with the beat position of the music, the speech is linked to the music, so that the user can easily sing the music.

As described above, the terminal device determines the speech insertion timing. Specifically, for each speech, the playback start timing and playback end timing are defined by the elapsed time from the beginning of the music. The playback start timing and playback end timing of each speech is stored as part of the speech information. That is, the speech information includes an audio signal corresponding to each speech (hereinafter also referred to as “speech signal”) and the reproduction start timing / reproduction end timing of each speech.

Next, the processing returns to the main routine shown in FIG. 2, and the terminal device acquires the current playback position of the music being played back (step S4). Specifically, the terminal device acquires the current reproduction position by counting the elapsed time from the reproduction start time of the music being reproduced.

Next, the terminal device performs speech enhancement processing (step S5). The speech emphasis process is a process for distinguishing vocals included in music from speech and making them easy to hear, details of which will be described later.

Next, the terminal device reproduces the speech based on the reproduction start timing / reproduction end timing of each speech included in the speech information and the current reproduction position (step S6). Specifically, the speech reproduction is started at the speech reproduction start timing, and the speech reproduction is terminated at the speech reproduction end timing. As a result, the corresponding speech is reproduced prior to the vocal in the music.

Next, the terminal device determines whether or not the speech reproduction should be terminated (step S7). Examples of the case where speech reproduction should be terminated include a case where speech information is lost, a case where music reproduction itself is terminated, a case where assist vocals are turned off by a user operation, and the like. If the speech reproduction should not be terminated (step S7: No), the process returns to step S4 to continue the speech reproduction. On the other hand, if the speech reproduction should be terminated (step S7: Yes), the assist vocal process is terminated.

[1.3] Automatic Assist Vocal Setting Method Next, a method for automatically turning on the assist vocal in step S1 of the assist vocal process shown in FIG. 2 will be described.

As a basic method, the terminal device collects the voice uttered by the user with a microphone, and the user is singing (singing a song) according to the music or performing an action equivalent to singing. Assist vocal is automatically turned on when it is determined. For example, as a result of analyzing voice data collected by a microphone, if it is determined that a nose is sung, a piece is being sung, a humming, or the like, the assist vocal is turned on. On the other hand, when the voice data is not singing but is a conversation with a passenger, the assist vocal is not turned on. The assist vocal is not turned on even when the voice data includes a part singing a nose song, or when the voice data is mostly conversational.

It should be noted that whether or not the user's voice included in the voice data is a song can be determined based on the presence or absence of a rhythm or pitch included in the voice data. For example, if the rhythm is regular or the change in pitch is large, it is judged as singing. If the rhythm is irregular, it is judged as not singing (conversation) if the change in pitch is small. be able to. Further, by using the music analysis application described above, it may be determined that the song is a song when a beat or measure can be extracted from the audio data, and may not be a song when the song cannot be extracted. Further, by using the music search server or the music search function described above, it may be determined that the song is a song when the song can be specified from the audio data, and may not be determined when the song cannot be specified.

Also, the terminal device calculates the correlation between the collected audio data and the music being played, and when there is a correlation greater than a certain value, determines that the user is singing and turns on the assist vocal Also good. In addition, when the terminal device has already acquired the lyrics data of the song being played back, the user is singing when the correlation between the voice data collected by the microphone and the lyrics data is a certain value or more You may judge. Further, based on the lyric data, when the user's voice is output even at the interlude position of the music where the lyric should not exist, it may be determined that it is a conversation.

Also, the rhythm information collected by the microphone may be used. For example, if it is determined that the user is hitting the steering wheel with his / her hand or finger in accordance with the rhythm of the music, or if he / she is stepping on the floor with his / her foot, he / she performs an act similar to singing. The assist vocal may be turned on. In this case, the correlation between the rhythm collected by the microphone and the rhythm of the music being played back may be calculated, and the assist vocal may be turned on when the correlation is a certain value or more. In addition, the assist vocal may be turned on when the rhythm collected by the microphone repeats a certain rhythm without calculating the correlation with the rhythm of the music being played. .

Furthermore, the assist vocal may be turned on when the state of the user is photographed with a camera that photographs the inside of the vehicle and the user is shaking his / her head along with the music. In addition, it is possible to detect whether there is a passenger in the passenger seat or the rear seat with a camera that captures the interior of the vehicle, and even if the determination criterion of whether the user is singing or talking is changed depending on the presence of the passenger Good.

Also, in the above example, an example is described in which the assist vocal is turned on when it is determined that the user is singing. However, even if the user is singing, the user knows the lyrics and plays the assist vocal. If it is determined that it is not necessary to turn on the assist vocal, it is not necessary to turn on the assist vocal. Specifically, for example, when the correlation between the collected sound data and the music being played is a certain value or more and the correlation with the lyrics data is a certain value or more, the user knows the lyrics. Assist vocals are not turned on even when singing.

However, in this case, since the user may not understand the lyrics from the middle, the speech information may be generated and prepared for output. After that, if the correlation between the collected audio data and the music being played is less than a certain value, or if the correlation with the lyrics data is less than a certain value, it is determined that the user does not know the lyrics. , Output assist vocals.

In the above example, the assist vocal auto-on setting method has been described, but the assist vocal auto-off setting can also be performed. While the assist vocal is turned on, the user does not sing along with the song (does not sing a song) or acts similar to the song (singing a nose song, singing a piece of music, humming Assist Vocal may be automatically turned off when it is determined that the user has not performed the operation). Similarly, if a conversation is detected, the assist vocal may be automatically turned off, and if it is determined that the rhythm is not taken or the user is not shaking his / her head to the music, Vocals may be turned off automatically.

Moreover, in the above example, it is described that the assist vocal is automatically turned on or off based on whether or not the user is singing or acting in accordance with the singing. Depending on the configuration, automatic on setting or automatic off setting may be performed.
For example, for a user who wants to sing only the chorus part of the song, when playing the chorus part of the song, the assist vocal is automatically turned on, and when playing the part other than the chorus part of the song, The assist vocal may be automatically turned off. Conversely, for users who know the rust part and want to practice the part other than rust, when playing the part other than the rust part of the song, the assist vocal is automatically turned on, When playing the part, the assist vocal may be automatically turned off.

[1.4] Speech Enhancement Process Next, the speech enhancement process executed in step S5 of the assist vocal process shown in FIG. 2 will be described. The speech emphasis process is a method in which the user distinguishes between speech and vocal and makes it easy to hear, and shows the following several methods.

[1.4.1] Processing when speech and vocal overlap The speech is basically reproduced during the interlude immediately before the corresponding vocal, and preferably does not overlap with the vocal in time. For this purpose, the above-described speech length changing process (step S35) is performed. Depending on the length of the speech and the length of the interlude, the speech may not be completely reproduced during the interlude even if the speech length is shortened. That is, when the length of the speech is longer than the length of the interlude, the speech and vocal are partially overlapped and reproduced. As described above, any of the following processing may be performed instead of reproducing speech and vocals in an overlapping manner.

(1) Adjust the vocal level.

¡If speech and vocals overlap, there is a way to lower the vocal volume level. FIG. 7A shows a case where the rear portion of the speech and the head portion of the vocal overlap and an overlapping portion X occurs. In this case, the volume of the vocal is adjusted in the overlapping portion X. Specifically, the vocal volume is reduced to a level where speech can be heard, or zero. Thereby, in the overlapping part X, the reproduction of the speech is prioritized and the speech is easy to hear.

FIG. 7 (B) shows a case where the speech head portion and the rear portion of the previous vocal overlap, resulting in an overlap portion X. Also in this case, the vocal volume is adjusted in the overlapping portion X. Specifically, the vocal volume is reduced to a level where speech can be heard, or zero. Further, in the overlapping portion X, the volume level of the vocal may not be suddenly lowered, but the volume level may be gradually lowered by fading out the vocal. Thereby, in the overlapping part X, the reproduction of the speech is prioritized and the speech is easy to hear.

Specifically, the above level adjustment may be performed by lowering the volume level of the vocal component when the vocal component and the performance component such as a musical instrument are separated in the music signal. On the other hand, if the vocal part is synthesized with a performance part such as a musical instrument and the volume of the vocal alone cannot be adjusted, the volume level of the entire music signal may be reduced, In particular, the volume level may be lowered only for the component in the frequency band corresponding to vocal (human voice).

(2) Adjust the speech level.

¡If speech and vocals overlap, there is a method to lower the speech volume level. FIG. 7C shows a case where the rear part of the speech and the head part of the vocal overlap and an overlapping part X occurs. In this case, the speech volume is adjusted in the overlapping portion X. Specifically, the volume of speech is reduced or zero. Instead of suddenly reducing the speech volume, the speech may be faded out to gradually decrease the volume. In this case, speech cannot be heard at the overlapping portion X, but generally when listening to a song that the user knows to some extent, the entire lyrics are not remembered, but if the beginning of the lyrics is known, Often, you can sing with the memory of the lyrics. Therefore, as shown in FIG. 7C, if the head portion of the speech can be heard, the rear portion of the speech may be difficult to hear. This technique is effective in such a case.

[1.4.2] Processing to hear speech and vocals from different directions Humans have the ability to distinguish sounds coming from different directions at the same time (so-called cocktail party effect). Using this, it is possible to consider a method that enables the user to distinguish between speech and vocal. This method is executed regardless of whether speech and vocal overlap in time.

(1) Method of adjusting phase with left and right speakers FIG. 8A shows a configuration in which the phase of speech output from left and right speakers is inverted. The music signal of the left (L) channel is supplied to the adder 32, and the music signal of the right (R) channel is supplied to the adder 33. On the other hand, the speech signal is supplied to the adder 33 as it is, and its phase is inverted by the phase inverter 31 and supplied to the adder 32. The output of the adder 32 is supplied to the left speaker 30L, and the output of the adder 33 is supplied to the right speaker 30R.

According to this configuration, the sound image of a song including vocals is localized between the left and right speakers, whereas the sound image of a speech is localized around the user's ears, and the user distinguishes between the speech and the vocals in the song. It becomes easy. In the example of FIG. 8A, only the phase of the speech signal supplied to the left speaker 30L is inverted by the phase inverter 31, but only the phase of the speech signal supplied to the right speaker 30R is inverted. It may be reversed. Also, if there is a certain phase difference between the speech signals supplied to the left and right speakers, the sound image position of the speech and the sound image position of the music can be made different, so the speech signal supplied to one speaker is not necessarily There is no need to reverse (change 180 °). That is, it is only necessary to give a certain phase difference between the speech signal supplied to one speaker and the speech signal supplied to the other speaker.

(2) Method for Controlling Sound Image Localization FIG. 8B shows a configuration in which a sound image of speech can be set at an arbitrary position. The music signal of the left (L) channel is supplied to the adder 32, and the music signal of the right (R) channel is supplied to the adder 33. On the other hand, the speech signal is supplied to the

adders

32 and 33 via the sound image localization control calculation unit 34 and the crosstalk cancellation unit 35. The sound image localization control calculation unit 34 convolves the transfer function between the target speaker position and the listening position (user's position) with the speech signal, and the crosstalk canceling unit 35 sets the speaker outputting the music and the listening position. A process for canceling the transfer function between them is performed. Accordingly, the sound image of the music can be localized between the left and

right speakers

30L and 30R, and the sound image of the speech can be localized at the target speaker position, so that the user can easily distinguish between the speech and the vocal.

(3) Method of using a headrest speaker When a headrest speaker is mounted on a vehicle seat in addition to a vehicle speaker, music including vocals can be output from the vehicle speaker, and speech can be output from the headrest speaker. . A configuration example in this case is shown in FIG.

The music signals of the left and right channels are supplied to the

vehicle speakers

30L and 30R, respectively. The speech signal is supplied to the right headrest speaker 35R as it is, and the phase is inverted by the phase inverter 31 and is supplied to the left headrest speaker 35L. In this case as well, since the phase difference is given to the speech signals supplied to the two

headrest speakers

35L and 35R, the sound image of the speech is localized at a position different from the sound image of the music, and the user can recognize the speech and the vocals in the music. It becomes easy to distinguish. In this example as well, as in the example of FIG. 8A, a constant phase difference is given between the speech signal supplied to one headrest speaker and the speech signal supplied to the other headrest speaker. Do it.

When using the headrest speaker, the speech may be reproduced using the headrest speaker in the passenger seat instead of the headrest speaker in the driver seat. Further, when headrest speakers are mounted on a plurality of seats of the vehicle, it may be possible to select and set the necessity of speech reproduction for each seat. In this way, it is possible to set so that the speech is reproduced only from the headrest speaker in the seat of the passenger who wants to sing the music while listening to the speech.

Further, instead of providing the phase difference, the sound image of the speech can be placed at an arbitrary position by using the sound image localization control calculation unit 34 and the crosstalk cancellation unit 35 in the same manner as the processing described in FIG. 8B. May be localized. This makes it easy for the user to distinguish between speech and vocals.

[2] System Configuration Next, a configuration example of a music playback system that realizes the above-described assist vocal will be described.

[2.1] First Example In the first example, assist vocal processing is executed mainly on the terminal device side. FIG. 10 shows the overall configuration of the music playback system according to the first embodiment. In the music reproduction system of the first embodiment, a plurality of vehicles 1, a content provider 2, and a gate server 3 can communicate with each other via a network 4. The plurality of vehicles 1 can communicate with the content server 2 and the gate server 3 via the network 4 by wireless communication.

Content provider 2 is a server such as a music distributor and provides music data, music metadata, lyrics data, and the like. The gate server 3 is a server that functions to realize the assist vocal according to the present embodiment, acquires music data, metadata, lyrics data, and the like of necessary music from the content provider 2 and stores them in a database (not shown). ing.

An example of the internal configuration of the vehicle 1 is shown in FIG. The vehicle 1 includes a terminal device 10, a music playback device 20, and a speaker 30.

The terminal device 10 is typically a mobile terminal such as a smartphone, and includes a communication unit 11, a control unit 12, a storage unit 13, a microphone 14, and an operation unit 15. The communication unit 11 communicates with the gate server 3 through the network 4. The control unit 12 includes a CPU and the like, and controls the entire terminal device 10.

The storage unit 13 is a memory such as a ROM or a RAM, and stores a program for the control unit 12 to execute various processes, and also functions as a work memory. When the control unit 12 executes the program stored in the storage unit 13, processing including assist vocal processing is executed. Moreover, the memory | storage part 13 may memorize | store the music data of the music preserve | saved by the user.

The microphone 14 collects sounds such as music being played in the car, singing by the user, conversation, etc., and generates sound data. The operation unit 15 is typically a touch panel or the like, and receives an operation and selection input by a user.

The music playback device 20 is a car audio, for example, and includes an amplifier. The speaker 30 is a speaker mounted on the vehicle. The music playback device 20 plays back music from the speaker 30 based on the music data supplied from the terminal device 10.

Another example of the internal configuration of the vehicle 1 is shown in FIG. In this example, the vehicle 1 includes a terminal device 10x. The terminal device 10x is a device having the functions of the terminal device 10 such as a portable terminal shown in FIG. 11A and the music playback device 20 such as car audio. Similarly to the terminal device 10, the terminal device 10 x includes a communication unit 11, a control unit 12, a storage unit 13, a microphone 14, and an operation unit 15, and a music playback unit 16 that corresponds to the music playback device 20. The terminal device 10x is connected to the speaker 30 and reproduces music from the speaker 30 based on music data.

Next, assist vocal processing by the music reproducing system of the first embodiment will be described. FIG. 12 is a flowchart of the assist vocal process according to the first embodiment. In this process, the assist vocal process is executed mainly by the terminal device 10 or 10x (hereinafter simply referred to as “terminal device 10”).

First, the gate server 3 is connected to the content provider 2 via the network 4, acquires music data and lyrics data for a plurality of music, and stores them in an internal database (step S 101).

The terminal device 10 receives designation of the music to be played by the operation of the operation unit 15 by the user (step S102), and transmits music designation information for designating the music to the gate server 3 (step S103). The gate server 3 acquires the song data and lyrics data of the song corresponding to the received song designation information from the database, and transmits it to the terminal device 10 (step S104).

Next, the terminal device 10 performs the processing of steps S105 to S109 using the received music data and lyrics data. Here, the processing in steps S105 to S109 is the same as that in steps S3 to S7 in FIG.

Thus, in the music reproducing system of the first embodiment, the terminal device 10 mounted on the vehicle 1 mainly executes the assist vocal process.

In the above example, the gate server 3 acquires the music data from the content provider in step S101. However, when the music data is stored in the terminal device 10, the gate server 3 receives the music data from the terminal device 10. You may get it. Further, when music data is stored in the database in the gate server 3, the music data may be acquired therefrom.

[2.2] Second Embodiment In the second embodiment, a part of the assist vocal process is executed on the gate server 3 side. The overall configuration of the music playback system according to the second embodiment is the same as that of the first embodiment shown in FIG.

Next, assist vocal processing by the music reproducing system of the second embodiment will be described. FIG. 13 is a flowchart of the assist vocal process according to the second embodiment. In this process, the gate server 3 generates speech information, further generates music data with speech, and transmits it to the terminal device 10. The terminal device 10 receives and reproduces the music data with speech. This will be described in detail below.

First, the gate server 3 is connected to the content provider 2 via the network 4, acquires music data and lyrics data for a plurality of music, and stores them in an internal database (step S 201). And the gate server 3 produces | generates speech information about each music based on the acquired music data and lyrics data (step S202). This speech information generation process is the same as step S3 in FIG.

When the speech information is generated, the gate server 3 adds the speech to the music data and generates the music data with speech (step S203). Specifically, the gate server 3 combines the speech signal corresponding to each speech with the music data at the timing calculated by the process of step S36 in FIG. 3 based on the generated speech information, and generates music data with speech. And store it in the database. In other words, the music data with speech is data in which speech is reproduced in addition to the music by reproducing as it is.

The terminal device 10 receives designation of the music to be played by the operation of the operation unit 15 by the user (step S204), and transmits music designation information for designating the music to the gate server 3 (step S205). The gate server 3 transmits the song-attached music data corresponding to the received music designation information to the terminal device 10 (step S206).

Next, the terminal device 10 reproduces the received music data with speech (step S207). Thereby, the speech is reproduced at an appropriate timing during the reproduction of the music. Next, the terminal device 10 determines whether or not the music reproduction should be terminated (step S208). When the music has been played to the end, or when playback should be terminated, such as when the user has stopped playing (step S208: Yes), the terminal device 10 finishes playing. On the other hand, if the reproduction of the music should not be terminated (step S208: No), the process returns to step S207, and the reproduction of the music data with speech is continued.

Thus, in the music reproduction system of the second embodiment, the music data with speech is generated on the gate server 3 side and provided to the terminal device 10. The terminal device 10 can listen to music including speech by reproducing the received music data with speech.

In the above example, the gate server 3 acquires the music data from the content provider in step S201. However, if the music data is stored in the terminal device 10, the gate server 3 receives the music data from the terminal device 10. You may get it. Further, when music data is stored in the database in the gate server 3, the music data may be acquired therefrom.

[3] Assist Vocal that Reproduces Only Speech In the above-described assist vocal process, the music that is being reproduced by the terminal device 10 is reproduced with speech added. However, it is convenient if speech can be added to music reproduced from a source other than the terminal device 10, such as a radio in a car, a CD (hereinafter referred to as “external source”). In this case, the terminal device 10 basically generates the speech information by the above-described method, and only needs to reproduce the speech at a timing corresponding to the reproduction position of the music reproduced from the external source.

FIG. 14 shows a flowchart of assist vocal processing in this case. First, the terminal device 10 collects music reproduced from an external source by the microphone 14 to acquire reproduced music data (step S151), and transmits this to the gate server 3 (step S152).

The gate server 153 receives the reproduction music data from the terminal device 10, and specifies the corresponding music and its reproduction position (step S153). Specifically, the gate server 3 includes a music search unit having the function of the music search server described above, specifies the music based on the reproduced music data, and reproduces the reproduction position corresponding to the reproduced music data portion. Is identified. Then, the gate server 3 transmits the lyrics data and the reproduction position information to the terminal device 10 together with the music name and artist name of the specified music (Step S154).

The terminal device 10 generates speech information using the received lyric data (step S155). Note that the speech information is generated by the same method as described with reference to FIG. In addition, the terminal device 10 can acquire music analysis data by analyzing the reproduction music data acquired with the microphone 14 (process of step S32 of FIG. 3).

Next, the terminal device 10 calculates the current playback position of the music based on the playback position information acquired from the gate server 3 (step S156). This method will be described later. Next, the terminal device 10 performs speech enhancement processing (step S157), and reproduces speech at an appropriate timing according to the music being reproduced by the external source (step S158). As a result, the speech is reproduced in accordance with the music being reproduced from the external source.

Then, the terminal device 10 determines whether or not to end the speech reproduction (step S159), and if not to end, returns to step S156 and continues the process. On the other hand, when the playback of a song from an external source is finished, when the song being played is changed to another song, when there is no more speech to be played, etc. S159: Yes), the process ends.

Next, with reference to FIG. 15, a method for specifying the current reproduction position of the music in step S156 will be described. The reproduced music data transmitted from the terminal device 10 to the gate server 3 is actually data of a plurality of audio frames. That is, the terminal device 10 collects the music reproduced by the external source with the microphone 14 and sequentially transmits it to the gate server 3 as a plurality of audio frames.

In the example of FIG. 15, the terminal device 10 has audio frames n, (n + 1), (n + 2),. . . Are sequentially transmitted to the gate server 3 as reproduced music data. At this time, the terminal device 10 stores the time when the reproduced music data is first transmitted, and the time when the audio frame n is transmitted in the example of FIG. 15 (hereinafter referred to as “reference time t0”).

The music search unit of the gate server 3 refers to information on a large number of music pieces stored in the database, and specifies music pieces based on the received plurality of audio frames. In the example of FIG. 15, it is assumed that the music search unit of the gate server 3 can identify the music based on the audio frames n to (n + 4). In this case, the gate server 3 uses the playback time information (tn) from the beginning of the audio frame n received from the terminal device 10 as the playback position information as the playback position information in addition to the music title, artist name, etc. Transmit to device 10. That is, the reproduction position information transmitted from the gate server 3 to the terminal device 10 in step S154 of FIG. 14 is the elapsed time from the beginning of the music of the audio frame n first transmitted to the gate server 3 by the terminal device 10. It has become. Therefore, in step S156, the terminal apparatus 10 calculates the elapsed time Δt of the elapsed time from the reference time t0 stored in advance to the present, and adds this to the reproduction time tn. That is, the reproduction time tn transmitted from the gate server 3 is the time from the beginning of the music to the audio frame n, and the elapsed time Δt is the time from the audio frame n to the present. Therefore, the current playback position (playback time) Tc is calculated by the following equation.

Tc = tn + Δt (2)
As described above, by providing a music search function in the gate server 3 and specifying the music and its reproduction position based on the reproduction music data, it is possible to reproduce the speech according to the music being reproduced from the external source. . Further, an external music search server may be used instead of providing the gate server 3 with a music search function.

In step S159, the reproduction may be ended when one music is finished. However, when another music is reproduced after one music is finished, the process is continued. Also good. That is, the speech reproduction may be continued while the transmission of the music reproduction data from the terminal device 10 to the gate server 3 is continued. Thereby, even if the music reproduced from the external source changes, it becomes possible to continue the speech reproduction following the song.

The present invention can be used for an apparatus for playing music.

DESCRIPTION OF SYMBOLS 1 Vehicle 2 Content provider 3 Gate server 4 Network 10, 10x terminal device 12 Control part 13 Memory | storage part 14 Microphone 20 Music reproducing apparatus 30 Speaker

Claims

A playback device that outputs a music signal and a lyrics audio signal representing the lyrics of the music signal,
Output means for outputting the lyric audio signal so that the output end timing of the lyric audio signal is a predetermined time before the output start timing of the lyrics section corresponding to the lyric audio signal in the music signal; A playback device.
Comprising beat detecting means for detecting a beat position of the music signal;
2. The playback apparatus according to claim 1, wherein the output unit matches an output end timing of the lyrics audio signal with a beat position of the music signal.
3. The playback apparatus according to claim 2, wherein the output means matches the output start timing of the lyrics audio signal with the beat position of the music signal.
A playback method executed by a playback device that outputs a music signal and a lyrics audio signal representing lyrics of the music signal,
An output step of outputting the lyrics audio signal so that an output end timing of the lyrics audio signal is a predetermined time before an output start timing of a lyrics section corresponding to the lyrics audio signal in the music signal; A characteristic reproduction method.
A program comprising a computer and executed by a playback device that outputs a music signal and a lyrics voice signal representing lyrics of the music signal,
The computer as an output means for outputting the lyrics audio signal so that the output end timing of the lyrics audio signal is a certain time before the output start timing of the lyrics section corresponding to the lyrics audio signal in the music signal. A program characterized by functioning.
A storage medium storing the program according to claim 5.