WO2015060340A1

WO2015060340A1 - Singing voice synthesis

Info

Publication number: WO2015060340A1
Application number: PCT/JP2014/078080
Authority: WO
Inventors: 土屋　豪; 毅彦川原; 純也浦
Original assignee: ヤマハ株式会社
Priority date: 2013-10-23
Filing date: 2014-10-22
Publication date: 2015-04-30
Also published as: JP2015082028A

Abstract

A user arbitrarily sings the melody of a given song with the lyrics of that song, made-up lyrics, scat syllables, etc. A pitch detection unit (104) receives the voice of the singing user as inputted voice and detects the pitch thereof. A volume detection unit (108) detects the volume of the inputted voice. A voice synthesis unit (140) synthesizes an artificial singing voice on the basis of lyric data which is supplied according to the progress of the performance, and controls the pitch and volume of the synthesized singing voice according to the detected pitch and volume of the inputted voice. The synthesized singing voice is acoustically outputted. An artificial singing voice is thus generated according to the pitch which the user vocalizes.

Description

Singing voice synthesis

The present invention relates to an apparatus and method for synthesizing a singing voice, and further relates to a non-transitory computer-readable storage medium storing a program executable by a processor for realizing the method.

Conventionally, the following techniques are known as techniques for converting a singer's (user) singing voice into another person's singing voice. That is, when formant sequence data when a specific person (for example, an original singer) sings is stored in advance as source data and the singing voice by the singer (user) is converted, the singer (user) There has been proposed a technique for synthesizing a singing voice by shaping a formant based on a formant sequence of an original singer in accordance with the pitch and volume of the singing voice (see, for example, Patent Document 1).

JP-A-10-268895

By the way, in the above technique, since the formant based on the formant sequence data of the original singer is shaped, it is inevitable that the influence of the original singer's singing remains in the output singing voice. Therefore, sufficient or various singing experiences cannot be obtained for the user who sings. On the other hand, it is also known to synthesize a singing voice artificially by a voice synthesis technique. However, the conventional artificial singing voice synthesizing technique synthesizes singing voice corresponding to the inputted lyric data based on the user inputting the lyric data by a keyboard or the like. Therefore, the user cannot intervene in artificial singing voice synthesis with his own voice, and the experience is poor.

The present invention has been made in view of the above-described circumstances, and brings about an unprecedented experience in the technology for generating a singing voice having a voice quality different from the input voice based on the input voice (for example, a user's singing voice). The purpose is to be able to. It is another object of the present invention to prevent the synthesized singing voice from affecting the way the original singer sings.

In order to achieve the above object, a singing voice synthesizing apparatus according to the present invention supplies a pitch detection unit for detecting the pitch of an input voice, a volume detection unit for detecting the volume of the input voice, and the progress of the performance. A voice synthesizer that synthesizes a singing voice based on the lyrics data, and controls the pitch and volume according to the pitch detected by the pitch detector and the volume detected by the volume detector The voice synthesis unit that synthesizes the singing voice.

According to this, while the singing voice is artificially synthesized based on the lyrics data, the pitch and the volume of the synthesized singing voice are controlled according to the pitch and the volume detected from the input voice. . For this reason, there is no concept of how to sing the original singer, and the synthesized singing voice does not affect the way the original singer sings. Also, since the singing voice is synthesized with a voice quality different from that of the singer (user) while reflecting the pitch and volume of the singing by the singer (user), the expression of the singing is seen from the viewpoint of the singer (user). You can zoom in and experience a new singing experience.

In one embodiment, the speech synthesizer may synthesize artificial singing speech corresponding to the characters of the lyrics data using speech segment data stored in the library.

The voice synthesis unit may synthesize the singing voice with, for example, the same pitch as the pitch detected by the pitch detection unit, or a sound shifted in a predetermined relationship with respect to the detected pitch. You may synthesize with high. Further, the voice synthesis unit may synthesize the singing voice with the same volume as the volume detected by the volume detection unit, or with a volume having a predetermined relationship with the detected volume. Alternatively, when the detected sound volume exceeds the threshold value, the sound volume may be synthesized according to the sound volume.

In one embodiment, a configuration may further include a sound source unit that generates an accompaniment sound in accordance with the progress of the performance, and an output unit that acoustically outputs the accompaniment sound and the singing voice. According to this configuration, the singing voice synthesized by the voice synthesizing unit and the accompaniment sound according to the progress of the performance are output, so that the singer can experience a new singing. The output unit may further output the input voice acoustically.

In an embodiment, the voice synthesizer may synthesize the singing voice according to the utterance timing of the lyrics data. Alternatively, the voice synthesizer may synthesize singing voice by changing the singing timing of the lyrics data according to the volume detected by the volume detector. According to this configuration, the singer can control the synthesized lyric sound to some extent, not according to the utterance timing defined by the lyric data. For this reason, it becomes possible to improvise (ad-lib) the timing of the singing voice-synthesized.

The present invention can be embodied not only as a singing voice synthesizing apparatus but also as a computer-implemented method, and the method is stored in a program that stores a program executable by a processor. It can also be embodied as a transient computer readable storage medium.

It is a functional block diagram which shows the structure of the song synthesizing | combining apparatus which concerns on 1st Embodiment of this invention.

It is a figure which shows the lyric data etc. in a song synthesizing | combining apparatus.

It is a flowchart which shows the singing voice synthesis | combination process in a song synthesizer.

It is a figure which shows the example of an output of the song voice in a song synthesis apparatus.

It is a functional block diagram which shows the structure of the song synthesizing | combining apparatus which concerns on 2nd Embodiment of this invention.

It is a functional block diagram which shows the structure of the song synthesizing | combining apparatus which concerns on 3rd Embodiment of this invention.

Hereinafter, embodiments of the present invention will be described with reference to the drawings.

<First Embodiment>
FIG. 1 is a functional block diagram showing the configuration of the singing voice synthesizing apparatus 10 according to the first embodiment. In this figure, the singing voice synthesizing apparatus 10 is a notebook type or tablet type computer such as a voice input unit 102, a pitch detection unit 104, a volume detection unit 108, an operation unit 112, a control unit 120, a database 130, a voice. A synthesis unit 140, a sound source unit 160, and

speakers

172 and 174 are included. Among these functional blocks, for example, the voice input unit 102, the operation unit 112, the voice synthesis unit 140, and the

speakers

172 and 174 are constructed by hardware, and the pitch detection unit 104, the volume detection unit 108, the control unit 120, the database 130 and the sound source unit 160 are constructed by executing an application program in which a CPU (Central Processing Unit) (not shown) is installed in advance. Although not shown in particular, the singing voice synthesizing apparatus 10 has a display unit in addition to this, so that the user can check the status and settings of the apparatus.

The audio input unit 102 omits details, but a microphone that converts a singing voice by a singer (user) into a singing voice signal of an electric signal, and an LPF (low-pass) that cuts a high frequency component of the converted singing voice signal. Filter) and an A / D converter that converts a singing voice signal from which a high frequency component is cut into a digital signal.

The pitch detection unit 104 performs frequency analysis on the singing voice signal (input voice) converted into a digital signal and outputs pitch data indicating the pitch (frequency) obtained by the analysis in almost real time. For frequency analysis, FFT (Fast Fourier Transform) or other known methods can be used.

The volume detection unit 108 detects the amplitude envelope of the singing voice signal by, for example, filtering the singing voice signal converted into a digital signal with a low-pass filter, and obtains volume data indicating the volume of the singing voice almost in real time. Output. The operation unit 112 inputs an operation by a singer, for example, an operation for selecting a song to be sung, and supplies information indicating the operation to the control unit 120. The database 130 stores music data for a plurality of songs. The music data for one song is composed of accompaniment data that defines the accompaniment sound of the song with one or more tracks, and lyrics data indicating the lyrics of the song.

In addition to managing the database 130, the control unit 120 functions as a sequencer when the performance is in progress. The control unit 120 functioning as a sequencer interprets the accompaniment data in the music data read from the database 130, and sets the musical tone information that defines the musical tone to be generated in time series from the start of the performance to the progress of the performance. Are supplied to the sound source unit 160 in this order. Here, for example, the accompaniment data conforming to the MIDI standard is used. In the case of conforming to the MIDI standard, the accompaniment data is defined by a combination of an event and a duration indicating a time interval between events. For this reason, the control unit 120 supplies tone information indicating the content of the event to the sound source unit 160 every time the time indicated by the duration elapses. In other words, the control unit 120 interprets the accompaniment data and supplies musical tone information to the sound source unit 160 to advance the performance of the song. In addition, when interpreting the accompaniment data, the control unit 120 obtains an integrated value of the duration from the start of the performance. The control unit 120 can grasp the progress state of the performance, that is, which part of the song is being played based on the integrated value.

The sound source unit 160 synthesizes a musical sound signal indicating an accompaniment sound according to the musical sound information supplied from the control unit 120. In the present embodiment, the sound source unit 160 is not essential because it is not always necessary to output the accompaniment sound. The musical tone signal output from the sound source unit 160 is converted into an analog signal by a D / A conversion unit (not shown), and then is acoustically converted by the speaker 174 and output.

The control unit 120 supplies musical tone information to the sound source unit 160 and also supplies lyrics data to the voice synthesis unit 140 as the performance progresses. The voice synthesizer 140 synthesizes the singing voice according to the lyrics data supplied from the controller 120, the pitch data supplied from the pitch detector 104, and the volume data supplied from the volume detector 108. And output as a singing voice signal. Note that the singing voice signal output from the voice synthesis unit 140 is converted into an analog signal by a D / A conversion unit (not shown), and then acoustically converted by the speaker 172 and output.

FIG. 2 is a diagram showing an example of lyrics data. In the example of this figure, the lyrics data of “Sakura” is shown as a song together with the melody (the score displayed on the lyrics). Note that the copyright protection period of the music “Sakura” has already expired in accordance with the provisions of Article 51 and Article 57 of the Copyright Act of Japan.

As shown in this figure, the lyric data is composed of a character string in which character information of lyrics to be sung is arranged in order from the start of the performance, and utterance timing for each character or character group constituting the lyrics (that is, for each syllable). Information for defining the utterance timing). As an example, the lyric data may include character information indicating the lyrics, and the characters or character groups of the lyrics may be divided in time so that the temporal arrangement can be specified. The lyric data may include information that associates each character of the lyric with a note of a melody, that is, a singing timing at which the lyrics are to be sung and a pitch at which the lyrics are to be sung. In the example of FIG. 2, one note is assigned to each of the lyrics 51 to 57 (the lyrics 51 to 57 are shown in the figure, and the subsequent illustration is omitted), but depending on the song (lyrics), A plurality of notes may be assigned to one character or character group (that is, one syllable), or a plurality of characters or character groups (that is, a plurality of syllables) may be assigned to one note. When the progress of the performance reaches the singing timing (speech timing) indicated by the notes, the control unit 120 displays the lyric character or character group corresponding to the note (that is, the characters constituting the syllable) and the pitch of the note. Is supplied to the speech synthesizer 140.

As to whether or not the progress of the performance has reached the singing timing, if the duration integrated value in the interpretation of the accompaniment data and the singing timing of the lyric data are associated in advance, the integrated value in the performance progress will be the lyric data. The control unit 120 can determine whether or not the value corresponding to the singing timing has been reached. On the other hand, when the accompaniment sound is not output (when the accompaniment data is not used), the progress of the performance cannot be grasped by the integrated value of the duration of the accompaniment data. In this case, for example, the singing timing of the lyrics is the accompaniment data. Similarly, whether or not it is the singing timing is defined by the event (the lyrics singing event) and the duration indicating the time interval between the events. What is necessary is just to judge by no.

In FIG. 1, the speech synthesizer 140 synthesizes the characters of the lyrics data supplied from the controller 120 using speech segment data registered in a library (not shown). In this library, speech unit data defining waveforms of various speech units that are materials of singing speech such as a single phoneme or a transition part from one phoneme to another phoneme is registered in advance. . Specifically, the speech synthesizer 140 converts the phoneme sequence indicated by the characters of the supplied lyric data into a speech unit sequence, and selects speech unit data corresponding to these speech units from the library. While connecting mutually, the pitch of each connected speech unit data is converted according to the designated pitch, and the singing voice signal which shows a singing voice is synthesize | combined. Note that the pitch and volume of the singing voice in the voice synthesizer 140 will be described later.

In this embodiment, the singing voice is output separately from the speaker 172 and the accompaniment sound is output separately from the speaker 174. However, the singing voice and the accompaniment sound may be mixed and output from the same speaker. .

Next, the operation of the singing voice synthesizing apparatus 10 according to this embodiment will be described. In this song synthesizing device 10, when the singer operates the operation unit 112 to select a desired song, the control unit 120 reads the song data corresponding to the song from the database 130, and among the song data, Interpret the accompaniment data, supply the musical tone information of the accompaniment sound to be synthesized to the sound source unit 160 and synthesize the musical tone signal to the sound source unit 160, while the lyrics data of the music data is synchronized with the progress of the performance The voice synthesis unit 140 is supplied to synthesize a singing voice signal. That is, in the singing synthesizing apparatus 10, when a performance is started, firstly, a musical tone synthesis process for synthesizing musical tone signals in accordance with the progress of the performance, and secondly, lyrics data is supplied in accordance with the progress of the performance. By doing so, the singing voice synthesizing process for synthesizing the singing voice is executed independently of each other. Among these, the tone synthesis process is a process in which the control unit 120 supplies tone information as the performance progresses, while the tone generator unit 160 synthesizes a tone signal based on the tone information. This process itself is well known. (See, for example, JP-A-7-199975). For this reason, description is abbreviate | omitted about the detail of a musical tone synthesis process, and only a song voice synthesis process is demonstrated below.

When a song is selected by the operation unit 112, the control unit 120 automatically starts supplying accompaniment data and lyrics data of the song. As a result, the start of performance of the song is instructed. However, even if a song is selected, if the performance of another song is in progress, the control unit 120 waits for the performance of the selected song until the other song ends.

FIG. 3 is a flowchart showing the singing voice synthesis process. This singing voice synthesis process is executed by the control unit 120 and the voice synthesis unit 140. When the performance is started, the control unit 120 first determines whether or not the progress stage of the performance is a singing timing (step Sa11).

If it is determined that the performance stage is not the singing timing (if the determination result in step Sa11 is “No”), the control unit 120 returns the processing procedure to step Sa11. In other words, the process waits at step Sa11 until the performance stage reaches the singing timing. On the other hand, if it is determined that the progress stage of the performance has become the singing timing (if the determination result of step Sa11 is “Yes”), the control unit 120 may control the lyrics data, that is, the characters and sounds to be sung at the singing timing. Data defining the height is supplied to the speech synthesizer 140 (step Sa12).

The speech synthesizer 140 synthesizes speech based on the lyrics data when the lyrics data is supplied from the control unit 120, but controls the pitch and volume as follows (step Sa13). That is, if the volume indicated by the volume data supplied from the volume detection unit 108 is equal to or lower than the threshold, the voice synthesis unit 140 supplies the text of the lyrics data from the volume detection unit 108 at the pitch of the lyrics data. The voice is synthesized at the volume indicated by the volume data to be output and output as a singing voice signal. This threshold value is a small value. Therefore, when the volume indicated by the volume data is equal to or lower than the threshold value, even if the singing voice signal is output from the speaker 172, it is output at a level that can be ignored for hearing.

On the other hand, when the volume data indicated by the volume data is larger than the threshold when the lyrics data is supplied from the control unit 120, the voice synthesis unit 140 determines the pitch of the lyrics data supplied from the control unit 120 as a pitch detection unit. The pitch is changed to the pitch indicated by the pitch data supplied from 104, and the characters of the lyrics data are synthesized with the volume indicated by the volume data supplied from the volume detector 108 and output as a singing voice signal. . For this reason, the singing voice signal that can be heard from the speaker 172 is obtained by synthesizing the characters of the lyric data with the pitch changed by the singer and the volume change following the change of the volume sung by the singer.

On the other hand, after supplying the lyric data that has reached the singing timing to the speech synthesizer 140, the control unit 120 determines whether or not there is lyric data to be sung next (step Sa14). If it exists (if the determination result in step Sa14 is “No”), the control unit 120 returns the processing procedure to step Sa11. Thereby, the process of step Sa12, 13 is performed when the progress stage of performance comes to the next song timing. Finally, if there is no data to be sung next (if the determination result in step Sa14 is “Yes”), the control unit 120 ends the singing voice synthesis process.

FIG. 4 is a diagram showing a specific synthesis example of singing voice. This figure is an example in the case where “Sakura” (see FIG. 2) is selected as a song sung by the singer. When the singer sings at a volume as shown in (b) while listening to the accompaniment sound, in this embodiment, the singing voice is shown in FIG. Is output. That is, when the singer sings with the volume turned up at a slightly delayed timing from the beginning of “sa” (lyric characters 51) with respect to the progress of the performance, the speech synthesizer 140 is supplied from the volume detector 108. When the volume indicated by the volume data exceeds the threshold, the amplitude of the singing voice signal is adjusted according to the volume, so that “s” (symbol 61) of the singing voice shown in (c) is (a) The correct singing timing as defined by the lyric data (lyrics 51) is not obtained.

In addition, when the singer decreases the volume from “ku” (lyric characters 52) to “ra” (lyric characters 53) as the performance progresses (or moves the microphone of the voice input unit 102 away from the mouth). ), In the singing voice of (c), there is a gap between “ku” (reference numeral 62) and “ra” (reference numeral 63-1). For the same reason, when the singer decreases the volume in the middle of “ra” (lyric characters 53) with respect to the progress of the performance, in the singing voice of (c), “ra” is indicated by reference numerals 63-1, 63. -2 will be divided. Note that “ra” (symbol 63-2) at the back of the time is expressed as “ra” for convenience of explanation, but in reality it can be heard as “a” which is a vowel of “ra”. Become.

In addition, in the example of FIG. 4, when the singer sings at what volume, it is a figure demonstrated from the viewpoint of how the singing voice is voice-synthesized. This example does not show the pitch at which the singing voice is synthesized when the singer sings, but does not require any detailed explanation. I will. That is, voice synthesis is performed according to the pitch of the input voice by the singer (user).

The singing voice synthesizing apparatus 10 in the first embodiment uses only the pitch and volume of the singer when synthesizing the singing voice. Therefore, if a singer (user) sings in a scat or humming style, such as “Ah, Ah,” instead of “Sakura, Sakura”, The singing voice synthesized by the singing voice synthesizing apparatus 10 has the correct lyrics “Sakura, Sakura”.

When using formant sequence data as described in the background art, it is necessary to collect data when the original singer sang. Further, in this case, since the formant based on the formant sequence data is shaped according to the pitch and volume sung by the singer, it is inevitable that the singer is influenced by the way of singing the original singer. On the other hand, in this embodiment, since the singing voice is synthesized using the library which is a speech unit, it is not affected by the way of singing the model person, and it is necessary to let the model person sing the song in the first place. In addition, there is an advantage that the singing voice can be voice-synthesized faithfully to the pitch and volume that the singer (user) actually sang on the spot. And according to this embodiment, since the singing voice synthesize | combined with the voice quality different from a singer is output, reflecting the intention (pitch, volume) of the singing by a singer, the singing expression of the singer is expressed. It can be expanded and a new singing experience can be experienced.

Second Embodiment
In the first embodiment, the singing voice is synthesized by reflecting the pitch and volume of the singing by the singer. Information other than the pitch and volume, in short, the actual singing voice by the singer It is not used at all. On the other hand, in 2nd Embodiment demonstrated below, it comprises so that a chorus may be performed by the actual singing voice itself by a singer, and the singing voice synthesize | combined with speech. The second embodiment can be summarized as follows. For example, an actual singing voice by a singer is a root sound, while a sound that is three times higher than the root sound and a sound that is five times higher than the root sound. Is synthesized with a triad, even though the singer is singing alone.

FIG. 5 is a functional block diagram showing the configuration of the singing voice synthesizing apparatus 10 according to the second embodiment. The singing voice synthesizing device 10 shown in this figure is different from the first embodiment shown in FIG. 1 in that the

pitch converting units

106a and 106b are provided, and the two

voice synthesizing units

140a and 140b are provided. The point provided is the point where the mixer 150 is provided. For this reason, in the second embodiment, these different portions will be mainly described.

The pitch conversion unit 106a converts the pitch indicated by the pitch data supplied from the pitch detection unit 104 into a pitch having a predetermined relationship, for example, a pitch that is three degrees above. To the speech synthesizer 140a. The pitch conversion unit 106b converts the pitch indicated by the pitch data supplied from the pitch detection unit 104 into a pitch having a predetermined relationship, for example, a pitch that is 5 degrees above. To the speech synthesizer 140b. Note that 3 degrees for the root sound include a short 3 degree and a long 3 degree, and 5 degrees for the root sound includes a complete 5 degree, a decreased 5 degree, and an increased 5 degree. Since which pitch is determined by the pitch (and key signature) of the root note, the

pitch converters

106a and 106b, for example, tabulate the converted pitches for the root pitches in advance. The pitch indicated by the pitch data supplied from the pitch detection unit 104 may be converted with reference to the table.

The

voice synthesis units

140a and 140b are functionally the same as the voice synthesis unit 140 in the first embodiment, and receive the same lyrics data from the control unit 120, but the voice synthesis unit 140a The pitch converted by the pitch converter 106a is specified, and the pitch converted by the pitch converter 106b is specified in the voice synthesizer 140b. The mixer 150 mixes the singing voice signal from the voice input unit 102, the singing voice signal from the voice synthesis unit 140a, and the singing voice signal from the voice synthesis unit 140b. Note that the mixed singing voice signal is converted into an analog signal by a D / A converter (not shown), and then acoustically converted by a speaker 172 and output.

FIG. 6 is a diagram showing a specific synthesis example of the singing voice according to the second embodiment. In this figure, “SAKURA” (see FIG. 2) is selected as a song to be sung by the singer, and the singer listens to the accompaniment sound while the performance progresses, with

reference numerals

71, 72, 73,. This is an example in which the lyrics shown are sung at the pitch indicated by the keyboard in the left column of the figure, that is, when the singing is performed at the pitch and singing timing of the score (lyric data) shown in the upper column of the figure. In this case, the speech synthesizer 140a synthesizes speech with a pitch that is three times higher than the pitch of the song as indicated by reference numerals 61a, 62a, 63a,. As shown by 62b, 63b,..., Voice synthesis is performed at a pitch five degrees higher than the pitch of the singer's singing.

In addition, in the example of FIG. 6, the code | symbol 61a has a 3rd minor relationship with respect to the code | symbol 71 in C major, and the code | symbol 61b has a 3rd major relationship with respect to the code | symbol 61a. For this reason, the code | symbol 71, 61a, 61b becomes a short triad. Reference numerals 72, 62a and 62b are also short triads. Further, the reference numeral 63a has a minor third relation with respect to the reference numeral 73, and the reference numeral 63b has a minor third relation with respect to the reference numeral 63a. For this reason, the code |

symbol

73, 63a, 63b becomes a reduced triad. In this way, when the singer sings at a volume exceeding the threshold and at the pitch and timing as shown in the figure, the speaker 172 uses a triad with the singer's singing as the root. A singing voice will be output.

Thus, according to 2nd Embodiment, since a singer can sing even though he is singing alone, he can further expand the expression of singing to the singer. it can. Note that the pitch conversion described above is merely an example. You may convert so that it may become other than a chord, and you may carry out octave conversion. Further, the speech synthesis unit is not limited to two systems, and may be configured to convert to a pitch having a predetermined relationship as one system, or may be three or more systems.

In the second embodiment, the singing voice of the singer and the singing voice of the

voice synthesis units

140a and 140b are mixed and output from the speaker 172, and the accompaniment sound by the sound source unit 160 is output from another speaker 174. However, it is good also as a structure which mixes a singing voice and an accompaniment sound and outputs it from one speaker. That is, it does not matter whether the output unit that outputs the singing voice and the accompaniment sound is a separate speaker or the same speaker. The pitch conversion unit 106a converts the pitches indicated by the pitch data supplied from the pitch detection unit 104 into pitches having a predetermined relationship. The relationship may be changed by an instruction from the control unit 120 or the operation unit 112. The same applies to the pitch conversion unit 106b, and the pitch relationship to be converted may be changed by an instruction from the control unit 120 or the operation unit 112.

<Third Embodiment>
In the first embodiment, when the progress stage of the performance is the singing timing, among the lyrics data, data (characters, pitches) to be sung at the singing timing is supplied to the speech synthesizer 140. From the point of view of the singer, it was not possible to control the timing of the lyrics that were synthesized. On the other hand, in the third embodiment described below, the singer can control the timing of the lyrics to be synthesized to some extent.

FIG. 7 is a functional block diagram showing the configuration of the singing voice synthesizing apparatus 10 according to the third embodiment. The singing voice synthesizing apparatus 10 shown in this figure is different from the first embodiment shown in FIG. 1 in that the volume data output from the volume detection unit 108 is supplied to the control unit 120 together with the voice synthesis unit 140. It is. For this reason, in the third embodiment, this difference will be mainly described.

In the third embodiment, the control unit 120 triggers that the volume indicated by the volume data supplied from the volume detection unit 108 exceeds a threshold value or that the temporal change in the volume exceeds a predetermined value. The lyrics data corresponding to the next note is supplied to the speech synthesizer 140. That is, the control unit 120 synthesizes the lyric data corresponding to the next note, such as when the singer's singing volume exceeds a threshold, even if the performance stage is not the singing timing of the lyric data. To the unit 140.

A specific synthesis example of the singing voice according to the third embodiment will be described. Here, as in the first embodiment, as shown in FIG. 4A, the singer selects “Sakura” as the song to sing, and the singer listens to the accompaniment sound. A description will be given of an example of singing at a volume as shown in (b) of the figure as the performance progresses. In the third embodiment, a singing voice is output as shown in (d) of the figure in the third embodiment. Is done.

The characteristic part of the third embodiment will be described. After the singer decreases the volume in the middle of “ra” (lyrics 53) with respect to the progress of the performance, before the next “sa” (lyrics 54) In addition, when the volume is increased (when the temporal change of the volume exceeds a predetermined value), the control unit 120 performs the following “sa” in accordance with the change in the volume data supplied from the volume detection unit 108. The lyric data (reference numeral 54) is supplied to the speech synthesizer 140. For this reason, “sa” (symbol 64) is voice-synthesized at a timing earlier than the singing timing defined by the lyrics data. Regarding the reading of the lyric data corresponding to the next note, the volume indicated by the volume data supplied from the volume detection unit 108 has exceeded a threshold value, or the temporal change in the volume has exceeded a predetermined value. In addition to this, it may be executed as a trigger when the slope (acceleration) of the temporal change of the volume exceeds a predetermined value.

By the way, when a singer continuously sings a certain lyrics with almost the same pitch, almost the same volume, and longer than the timing specified by the lyrics data, the lyrics are intentionally extended (with a reverberation). It is thought that. In order to cope with such a case, the configuration shown by the broken line in FIG. That is, the pitch data output from the pitch detection unit 104 is supplied to the control unit 120 together with the voice synthesis unit 140, and the control unit 120 is indicated by the pitch data supplied from the pitch detection unit 104. When the pitch is constant within a predetermined value and the volume indicated by the volume data supplied from the volume detector 108 is constant within a predetermined value, even if the next singing timing has arrived, A configuration may be adopted in which the lyrics data is not supplied to the speech synthesizer 140 and waits for a predetermined time (or until the volume is lowered). With this configuration, the singer can synthesize the singing voice by continuing the desired lyrics longer than the timing defined by the lyrics data.

As described above, according to the third embodiment, the singer can control the lyrics to be synthesized by voice to some extent rather than the timing specified by the lyrics data, and thus improvise the timing of singing to be synthesized (ad-lib). ) Can be changed. In addition, this 3rd Embodiment is not restricted to 1st Embodiment, You may combine with 2nd Embodiment which mixes the song by the song singer himself, and the voice-synthesized song.

<Application and modification>
The present invention is not limited to the first to third embodiments described above, and various applications and modifications described below are possible, for example. Note that one or a plurality of arbitrarily selected aspects of application / deformation described below can be appropriately combined.

In the first (second) embodiment, the control unit 120 supplies lyrics data (characters, pitches) corresponding to the singing timing to the voice synthesizing unit 140 when the progress stage of the performance is the singing timing. Although it was a structure, among these, the control part 120 does not need to supply to the speech synthesis part 140 about a pitch. The reason is that the voice synthesizer 140 does not substantially output the singing voice signal when the volume indicated by the volume data is equal to or lower than the threshold, and when the volume exceeds the threshold, the pitch of the lyrics data This is because the pitch is indicated by the pitch data output from the pitch detector 104. Even if the control unit 120 is configured not to supply the pitch of the lyrics, the voice synthesis unit 140 determines that the volume of the lyrics data supplied from the control unit 120 exceeds the threshold value indicated by the volume data of the input voice. The voice may be synthesized at the pitch indicated by the pitch data of the input voice according to the volume.

In each embodiment, MIDI data is used as accompaniment data, but the present invention is not limited to this. For example, a configuration may be adopted in which a musical tone signal is obtained by reproducing a compact disc. In this configuration, elapsed time information and remaining time information can be used as information for grasping the progress of performance. For this reason, the control part 120 should just supply lyric data to the speech synthesis part 140 (140a, 140b) according to the progress of the performance grasped | ascertained by elapsed time information and remaining time information.

In each embodiment, the voice input unit 102 is configured to input a singer's singing with a microphone and convert it into a singing voice signal. However, the singing voice signal (input voice) is input or inputted in some form. Any configuration can be used. For example, the voice input unit 102 may be configured to input a singing voice signal processed by another processing unit, a singing voice signal supplied (or transferred) from another device, or simply a singing voice. It may be an input interface circuit that receives a signal and transfers it to a subsequent stage. Moreover, the input voice is not limited to the voice uttered by the user using the singing voice synthesizing apparatus, but may be voiced by another person (friend or third party).

In each embodiment, the pitch detection unit 104, the

pitch conversion units

106a and 106b, and the volume detection unit 108 are configured by software, but may be configured by hardware. Further, the speech synthesizer 140 (140a, 140b) may be configured by software. In addition to controlling the pitch and volume of the singing voice synthesized according to the pitch and volume of the input voice, other voice elements such as timbre are controlled according to the pitch and / or volume of the input voice. You may make it do.

The processor according to the present invention is not limited to a processor that can execute a software program such as the CPU described in the above embodiment, but may be a processor that can execute a microprogram such as a DSP. It may be a processor configured by a dedicated hardware circuit (an integrated circuit or a discrete circuit group) so as to realize a desired processing function.

Claims

A pitch detector for detecting the pitch of the input voice;
A volume detector for detecting the volume of the input voice;
A speech synthesizer that synthesizes a singing voice based on lyrics data supplied as the performance progresses, wherein the sound is detected according to the pitch detected by the pitch detector and the volume detected by the volume detector. The voice synthesizer that synthesizes the singing voice by controlling high and volume; and
A singing synthesizer.
The lyric data has information defining the utterance timing of the lyrics,
The speech synthesizer synthesizes the singing speech according to the utterance timing of the lyrics data.
The singing voice synthesizing apparatus according to claim 1.
The singing according to claim 1, wherein the voice synthesizing unit changes the utterance timing of the singing voice synthesized based on the lyrics data according to the volume detected by the volume detecting unit. Synthesizer.
A sound source unit that generates an accompaniment sound according to the progress of the performance;
An output unit for acoustically outputting the accompaniment sound and the singing voice;
The singing voice synthesizing apparatus according to claim 1, further comprising:
The voice synthesis unit converts the pitch of the singing voice to be synthesized into a pitch having a given pitch with respect to the pitch detected by the pitch detection unit,
Acoustically outputting the input voice together with the synthesized singing voice;
The singing voice synthesizing apparatus according to any one of claims 1 to 4.
The singing voice synthesizing apparatus according to any one of claims 1 to 5, wherein the voice synthesizing unit synthesizes the singing voice based on voice segment data corresponding to characters of the lyrics data.
The singing voice synthesizing apparatus according to any one of claims 1 to 5, wherein the input voice is voice uttered by a user.
Detecting the pitch of the input voice;
Detecting the volume of the input voice;
A step of synthesizing a singing voice based on lyrics data supplied in accordance with the progress of the performance, wherein the singing voice is controlled by controlling the pitch and the volume according to the detected pitch and the detected volume. Said step of synthesizing;
A computer-implemented method comprising:
Detecting the pitch of the input voice;
Detecting the volume of the input voice;
A step of synthesizing a singing voice based on lyrics data supplied in accordance with the progress of the performance, wherein the singing voice is controlled by controlling the pitch and the volume according to the detected pitch and the detected volume. Said step of synthesizing;
A non-transitory computer readable storage medium storing a program executable by a processor for executing a method comprising: