WO2019003350A1

WO2019003350A1 - Singing sound generation device, method and program

Info

Publication number: WO2019003350A1
Application number: PCT/JP2017/023786
Authority: WO
Inventors: 一輝柏瀬; 桂三濱野
Original assignee: ヤマハ株式会社
Priority date: 2017-06-28
Filing date: 2017-06-28
Publication date: 2019-01-03
Also published as: CN110709922B; CN110709922A; JP6809608B2; JPWO2019003350A1

Abstract

Provided is a singing sound generation device capable of defining the sound production pitch of the singing sound to be generated at a period that corresponds to the syllable to be produced. A CPU 10 obtains a sound production or sound production removal instruction specifying pitch, determines the determination duration T according to the obtained syllable information, defines a single sound production pitch after the determination duration T elapses on the basis of the obtained sound production or sound production removal instruction, and generates a singing sound on the basis of the obtained syllable information and the defined sound production pitch.

Description

Sing sound generation apparatus and method, program

The present invention relates to a singing sound generation apparatus, method and program for generating a singing sound based on a pronunciation instruction.

2. Description of the Related Art Conventionally, there has been known an apparatus that uses voice synthesis technology to synthesize and sing a song according to a performer's performance (Patent Document 1). This device updates the singing position in the lyrics indicated by the lyrics data according to the performance. That is, according to each performance operation, this device reads the lyrics in advance in the order defined in the lyric data, and generates a single singing voice at the pitch designated by the performance.

Patent No. 4735544 gazette

By the way, in operation of performance operation elements, such as a keyboard, a plurality of operation elements may be operated due to a user's mistouch, and a plurality of pitches may be designated. In the above-described conventional apparatus, when a plurality of sound generations are instructed by mistouch, there is a possibility that the lyrics may be read out unnecessarily. If a singing voice is generated and output corresponding to each of a plurality of instructed pitches, there is a possibility that the audience may clearly recognize a mistouch.

An object of the present invention is to provide a singing sound generating apparatus, method, and program capable of determining the sound production pitch of a generated singing sound in a period corresponding to a syllable to be pronounced.

In order to achieve the above object, according to the present invention, a syllable acquisition unit for acquiring syllable information indicating one syllable to be pronounced, and a determination unit for determining a standby time according to syllable information acquired by the syllable acquisition unit And the instruction acquiring unit acquires an instruction for sound generation based on the instruction acquiring unit for acquiring the instruction for sound generation or cancellation of sound generation designating the pitch and the instruction for sound generation or cancellation of the pronunciation acquired by the instruction acquiring unit. And a determination unit for determining a single tone pitch after a lapse of the waiting time determined by the determination unit, syllable information acquired by the syllable acquisition unit, and the tone pitch determined by the determination unit. And a generation unit configured to generate a singing sound based on the above.

In addition, the code in the said parenthesis is an illustration.

According to the present invention, the tone pitch of the generated singing sound can be determined in a period corresponding to the syllable to be produced.

It is a schematic diagram of a song sound production | generation apparatus. It is a block diagram of an electronic musical instrument. It is a flowchart which shows an example of the flow of a process in case a performance is performed. It is a figure which shows an example of lyric text data. It is a figure which shows an example of the kind of voice | phonetic segment data. It is a schematic diagram of phoneme type information. It is a figure which shows the volume envelope with respect to the elapsed time at the time of producing a syllable. It is a flowchart of an output sound generation process.

Hereinafter, embodiments of the present invention will be described with reference to the drawings.

FIG. 1 is a schematic view of a singing sound generating apparatus according to an embodiment of the present invention. The singing sound generating apparatus is configured as an electronic musical instrument 100 which is a keyboard instrument as an example, and has a main body 30 and a neck 31. The main body portion 30 has a first surface 30a, a second surface 30b, a third surface 30c, and a fourth surface 30d. The first surface 30a is a keyboard mounting surface on which a keyboard section KB composed of a plurality of keys is disposed. The second surface 30 b is the back surface.

Hooks

36 and 37 are provided on the second surface 30 b. A strap (not shown) can be placed between the

hooks

36 and 37, and the player usually puts the strap on his shoulder and performs performance such as operating the keyboard KB. Therefore, at the time of use with shoulders, particularly when the scale direction (key arrangement direction) of the keyboard KB is in the left-right direction, the first surface 30a and the keyboard KB face the listener side, the third surface 30c, the fourth The faces 30d face generally downward and upward, respectively. The neck portion 31 is extended from the side of the main body 30. The neck portion 31 is provided with various operators including the advance operator 34 and the return operator 35. A display unit 33 composed of liquid crystal or the like is disposed on the fourth surface 30 d of the main body 30.

The electronic musical instrument 100 is a musical instrument that simulates singing in response to an operation on a performance operator. Here, the song simulation is to output a voice simulating a human voice by song synthesis. White keys and black keys are arranged in the order of pitches of the keys of the keyboard section KB, and the keys are associated with different pitches. When playing the electronic musical instrument 100, the user presses a desired key on the keyboard KB. The electronic musical instrument 100 detects a key operated by the user, and produces a singing sound of a pitch corresponding to the operated key. The order of the syllables of the singing voice to be pronounced is predetermined.

FIG. 2 is a block diagram of the electronic musical instrument 100. As shown in FIG. The electronic musical instrument 100 includes a central processing unit (CPU) 10, a timer 11, a read only memory (ROM) 12, a random access memory (RAM) 13, a data storage unit 14, a performance control 15, and other operations. A slave 16, a parameter value setting operator 17, a display unit 33, a sound source 19, an effect circuit 20, a sound system 21, a communication I / F (Interface), and a bus 23. The CPU 10 is a central processing unit that controls the entire electronic musical instrument 100. The timer 11 is a module that measures time. The ROM 12 is a non-volatile memory that stores control programs and various data. The RAM 13 is a volatile memory used as a work area of the CPU 10 and various buffers. The display unit 33 is a display module such as a liquid crystal display panel or an organic EL (Electro-Luminescence) panel. The display unit 33 displays the operation state of the electronic musical instrument 100, various setting screens, a message for the user, and the like.

The performance operator 15 is a module that mainly accepts a performance operation that designates a pitch. In the present embodiment, the keyboard portion KB, the advance operator 34, and the return operator 35 are included in the performance operator 15. As an example, when the performance operation element 15 is a keyboard, the performance operation element 15 may be a note on / note off based on sensor on / off corresponding to each key, a key depression strength (speed, velocity), etc. Output performance information. This performance information may be in the form of a MIDI (musical instrument digital interface) message. The other operator 16 is, for example, an operation module such as an operation button or an operation knob for performing settings other than performance, such as settings relating to the electronic musical instrument 100. The parameter value setting operation unit 17 is an operation module such as an operation button or an operation knob that is mainly used to set parameters for the attribute of the singing voice. Examples of this parameter include harmonics, brightness, resonance, and gender factor. The harmony is a parameter for setting the balance of the harmonic component contained in the voice. Brightness is a parameter for setting the tone of the voice and gives a tone change. The resonance is a parameter for setting timbre and strength of singing voice and musical instrument sound. The gender element is a parameter for setting formants, and changes the thickness and texture of the voice in a feminine or male manner. The external storage device 3 is, for example, an external device connected to the electronic musical instrument 100, and is, for example, a device that stores audio data. The communication I / F 22 is a communication module that communicates with an external device. The bus 23 transfers data between the units in the electronic musical instrument 100.

The data storage unit 14 stores singing data 14a. The song data 14a includes lyric text data, a phonological information database, and the like. The lyrics text data is data describing the lyrics. In the lyric text data, the lyrics of each song are described divided in syllable units. That is, the lyric text data has character information obtained by dividing the lyrics into syllables, and the character information is also information for display corresponding to the syllables. Here, the syllable is a group of sounds output in response to one performance operation. The phonological information database is a database storing speech segment data (syllable information). The voice segment data is data indicating a waveform of voice, and includes, for example, spectrum data of a sample string of the voice segment as waveform data. The speech segment data includes segment pitch data indicating the pitch of the waveform of the speech segment. The lyrics text data and the speech segment data may each be managed by a database.

The sound source 19 is a module having a plurality of tone generation channels. Under the control of the CPU 10, one sound generation channel is assigned to the sound source 19 in accordance with the user's performance. In the case of producing a singing voice, the sound source 19 reads voice segment data corresponding to a performance from the data storage unit 14 in the assigned tone generation channel to generate singing voice data. The effect circuit 20 applies the acoustic effect designated by the parameter value setting operator 17 to the singing voice data generated by the sound source 19. The sound system 21 converts the singing sound data processed by the effect circuit 20 into an analog signal by a digital / analog converter. Then, the sound system 21 amplifies the singing sound converted into the analog signal and outputs it from a speaker or the like.

FIG. 3 is a flowchart showing an example of the flow of processing when the electronic musical instrument 100 performs a performance. Here, the processing in the case where the user performs the selection of the musical composition and the performance of the selected musical composition will be described. Further, in order to simplify the description, a case where only a single sound is output will be described even if a plurality of keys are simultaneously operated. In this case, only the highest pitch among the pitches of keys operated simultaneously may be processed, or only the lowest pitch may be processed. The processing described below is realized, for example, by the CPU 10 executing a program stored in the ROM 12 or the RAM 13 and functioning as a control unit that controls various components provided in the electronic musical instrument 100.

When the power is turned on, the CPU 10 waits until an operation of selecting a song to be played is received from the user (step S101). Note that if there is no song selection operation even after a certain time has elapsed, the CPU 10 may determine that a song set by default has been selected. When the CPU 10 receives the selection of the song, it reads the lyric text data of the song data 14a of the selected song. Then, the CPU 10 sets the cursor position at the top syllable described in the lyric text data (step S102). Here, the cursor is a virtual index indicating the position of the syllable to be pronounced next. Next, the CPU 10 determines whether note-on has been detected based on the operation of the keyboard section KB (step S103). When the note-on is not detected, the CPU 10 determines whether the note-off is detected (step S107). On the other hand, when note-on is detected, that is, when a new key depression is detected, the CPU 10 stops the output of the sound if the sound is being output (step S104). Next, the CPU 10 executes an output sound generation process for producing a singing sound according to note-on (step S105).

This output sound generation process is briefly described. First, the CPU 10 reads voice segment data of a syllable corresponding to a cursor position, and outputs a sound of a waveform indicated by the read voice segment data at a pitch corresponding to note-on. Specifically, the CPU 10 obtains the difference between the pitch indicated by the segment pitch data included in the voice segment data and the pitch corresponding to the operated key, and the waveform data is obtained by the frequency corresponding to this difference. The spectral distribution shown is moved in the frequency axis direction. Thus, the electronic musical instrument 100 can output a singing sound at the pitch corresponding to the operated key. Next, the CPU 10 updates the cursor position (read position) (step S106), and advances the process to step S107.

Here, the determination of the cursor position and the sounding of the singing voice according to the processes of steps S105 and S106 will be described using a specific example. The details of the output sound generation process of step S105 will also be described with reference to FIG. First, updating of the cursor position will be described. FIG. 4 is a diagram showing an example of lyrics text data. In the example of FIG. 4, the lyrics of the five syllables c1 to c5 are described in the lyrics text data. Each character "ha", "ru", "yo", "ko", "i" indicates one Japanese hiragana character and each character corresponds to one syllable. The CPU 10 updates the cursor position in syllable units. For example, when the cursor is located at the syllable c3, the voice segment data corresponding to "Y" is read from the data storage unit 14, and the singing voice of "Y" is pronounced. When the sound generation of "Yo" is completed, the CPU 10 moves the cursor position to the next syllable c4. Thus, the CPU 10 sequentially moves the cursor position to the next syllable in response to the note-on.

Next, the pronunciation of the singing sound will be described. FIG. 5 is a diagram showing an example of the type of speech segment data. The CPU 10 extracts speech segment data corresponding to syllables from the phonological information database in order to pronounce syllables corresponding to the cursor position. There are two types of phonetic segment data: phoneme chain data and stationary partial data. The phoneme chain data is data indicating a speech segment when the pronunciation changes, such as "silence (#) to consonant", "consonant to vowel", "vowel to consonant or vowel (of the next syllable)" . The steady part data is data indicating a speech segment when the pronunciation of the vowel continues. For example, when the cursor position is set to “ha” of syllable c 1, the sound source 19 includes voice chain data “# -h” corresponding to “silence → consonant h”, “consonant h → vowel a The voice chain data “ha” corresponding to “” and the stationary partial data “a” corresponding to “vowel a” are selected. Then, when the performance is started and the key depression is detected, the CPU 10 operates the singing voice based on the voice chain data “# -h”, the voice chain data “ha,” and the steady part data “a”. Output according to the pitch according to, the velocity according to the operation. Thus, the determination of the cursor position and the sounding of the singing sound are performed.

When the note-off is detected in step S107 of FIG. 3, if the sound is being output, the CPU 10 stops the output of the sound (step S108), and the process proceeds to step S109. On the other hand, if note-off is not detected, the CPU 10 advances the process to step S109. In step S109, the CPU 10 determines whether the performance has ended. Then, the CPU 10 returns the process to step S103 when the performance has not ended. On the other hand, when the performance is ended, if the sound is being outputted, the CPU 10 stops the output of the sound (step S110), and the processing shown in FIG. 3 is ended. Note that the CPU 10 determines whether the performance has ended, for example, whether the last syllable of the selected song has been pronounced, or whether the operation to end the performance has been performed by the other operating element 16 or the like. It can be determined based on

FIG. 6 is a schematic view of phoneme type information. The ROM 12 stores phoneme type information Q shown in FIG. The phoneme type information Q designates the type of each phoneme that can be included in the singing voice. Specifically, the phoneme type information Q distinguishes each phoneme constituting the speech segment applied to the speech synthesis process into a first type q1 and a second type q2. Here, the vowel start delay amount differs depending on the syllable. The vowel start delay amount is a delay amount from the start of sound production of the syllable to the start of sound production of the vowel in the syllable, and can also be referred to as the duration (consonant section length) of the consonant located immediately before the vowel. For example, vowels themselves (a [a], i [i], u [M], e [o], [o]) have a vowel start delay of 0 (in [] conformed to X-SAMPA Phonetic notation). The first type q1 is a type of a phoneme having a relatively large vowel start delay (for example, a phoneme whose vowel start delay exceeds a predetermined threshold), and the second type q2 is a vowel start delay having a first type q1 The type of a relatively small phoneme (for example, a phoneme whose vowel start delay amount falls below a threshold) as compared to the phoneme of. For example, half vowels (/ w /, / y /), nasal sounds (/ m /, / n /), tears (/ ts /), frictional noises (/ s /, / f /), roars (/ kja /, Consonants of / kju /, / kjo /, etc. are classified into the first category q1, and vowels (/ a /, / i /, / u /), flowing sounds (/ r /, / l /), plosives ( Phonemes such as / t /, / k /, / p / are classified into the second type q2. For example, a double vowel in which two vowels are made continuous is classified into the first type q1 when there is an accent in the rear vowel, and classified into a second type q2 when there is an accent in the forward vowel. You may adopt the treatment of doing.

The CPU 10 refers to the phoneme type information Q, and specifies the phoneme type corresponding to the syllable (the first phoneme when composed of a plurality of phonemes) specified by the read syllable information. For example, the CPU 10 determines which of the first type q1, the second type q2 and the vowel corresponds to the first phoneme of the syllable. The first phoneme can be obtained from phoneme chain data in speech segment data. In the song data 14a, phoneme type information shown in FIG. 6 may be associated with each of a plurality of syllable information. In that case, the CPU 10 may specify the phoneme type corresponding to the syllable specified by the read syllable information by the phoneme type information associated with the syllable information.

As described in FIG. 8, when the phoneme type can be specified (extracted), the CPU 10 determines the determination time width T based on the phoneme type (for example, determined according to the vowel start delay amount). Further, when the phoneme type can not be specified, the CPU 10 determines the determination time width T based on the waveform data of the volume envelope indicated by the read syllable information. Here, in the case where the phoneme type can not be specified (extracted), for example, the phoneme type information Q is not stored in the electronic musical instrument 100, and the phoneme type information is not associated with the read syllable information. Case is applicable. Besides, in the case where the phoneme type can not be specified (extracted), the phoneme type information Q is stored, but the phoneme type corresponding to the read syllable information is not registered in the phoneme type information Q, and A case where phoneme type information is not associated with the output syllable information corresponds to the case.

FIG. 7 is a diagram showing a volume envelope with respect to an elapsed time when producing syllables. When the phoneme type can not be specified, the CPU 10 determines the determination time width T based on, for example, the time from the rising of the waveform of the volume envelope indicated by the read syllable information to the peak. The time from the rise time point t1 to the peak time point t3 of the waveform is tP. The CPU 10 determines the time from the time point t1 to the time point t2 corresponding to a predetermined ratio (for example, 70%) of the time tP as the judgment time width T.

FIG. 8 is a flowchart of the output sound generation process executed in step S105 of FIG. First, the CPU 10 reads syllable information (speech segment data) of the syllable corresponding to the cursor position (step S201). The syllable corresponding to the cursor position is the syllable to be pronounced this time. Therefore, the CPU 10 acquires syllable information indicating one syllable to be pronounced from among the plurality of syllable information in a predetermined order. Next, the CPU 10 determines whether the phoneme type can be identified from the read syllable information (step S202). Here, as described above, the CPU 10 determines whether the phoneme type corresponding to the syllable specified by the read syllable information is registered in the phoneme type information Q (FIG. 6) or that the phoneme type information corresponds to the syllable information. If it is assigned, it is determined that the phoneme type can be identified. Even when the phoneme type information Q can not be referred to due to some circumstances, it corresponds to the case where the phoneme type corresponding to the syllable specified by the syllable information is not registered in the phoneme type information Q.

As a result of the determination, when the phoneme type can be specified, the CPU 10 specifies syllable information (step S203), and determines the determination time width T based on the specified syllable information (step S204). For example, the CPU 10 determines the determination time width T in accordance with the vowel start delay amount of the syllable to be pronounced (the first phoneme). Specifically, when the phoneme type is vowel, the CPU 10 determines the determination time width T to be 0 since the vowel start delay amount is 0. The CPU 10 determines the determination time width T to be a predetermined value when the phoneme type is the second type q2, and since the vowel start delay amount is relatively large when the phoneme type is the first type q1, the determination time width T The value is determined to be larger than the predetermined value. Thereafter, the process proceeds to step S206.

As a result of the determination in step S202, when the phoneme type can not be identified from the read syllable information, the CPU 10 determines the determination time width T based on the waveform data acquired from the read syllable information (step S205). . That is, as described above, the CPU 10 determines, as the determination time width T, a predetermined ratio (for example, 70%) of the time tP from the rising to the peak in the waveform (FIG. 7) of the volume envelope indicated by the syllable information. . In addition, a predetermined ratio is not limited to the value of an illustration. Further, the determination time width T may be shorter than the time tP, and a value obtained by subtracting a predetermined time from the time tP may be determined as the determination time width T. Thereafter, the process proceeds to step S206.

In step S206, the CPU 10 calculates a determination timing for determining a mistouch based on the note-on detection timing in step S103 and the determined determination time width T. The determination time width T (standby time) is a period provided for determination of an erroneous operation, and the point in time when the determination time width T has elapsed from the note-on detection timing is the determination timing. Note that, when note-on is detected in step S103, clocking is started. When an operation to specify a plurality of pitches is performed between the detection timing of the note-on and the determination timing, the CPU 10 can determine that there is a mistouch. Next, the CPU 10 extracts waveform data from the read syllable information (step S207). Next, the CPU 10 generates and outputs a sound of a waveform indicated by the extracted waveform data at a pitch corresponding to note-on. Specifically, the CPU 10 obtains the difference between the pitch indicated by the segment pitch data included in the speech segment data and the pitch corresponding to the note-on, and the spectrum indicated by the waveform data by the frequency corresponding to this difference. Move the distribution in the frequency axis direction. Thus, the electronic musical instrument 100 can output a singing sound at the pitch corresponding to the operated key.

Next, the CPU 10 secures a storage area for storing performance information in the RAM 13 (or the data storage unit 14) (step S209). This storage area is an area for storing information (note-on, note-off) indicating the performance operation of the performance operation element 15 performed until the determination timing comes. The performance operation of the performance operation element 15 corresponds to a sound generation or an instruction to release the sound generation with a specified pitch, and the CPU 10 corresponds to an instruction acquisition unit for acquiring this instruction. Next, the CPU 10 stores information (pitch and timing) on the note-on detected in step S103 in the storage area (step S210). Then, the CPU 10 determines whether or not the determination timing has come (step S211).

As a result of the determination, if the determination timing has not arrived, the CPU 10 determines whether a new performance operation (note on or note off) has been detected (step S212). Then, if a new performance operation is not detected, the CPU 10 returns the process to step S211 (step S212). When a new performance operation is detected, the CPU 10 stores performance information indicating the new performance operation in the storage area (step S213), and returns the process to step S211. Therefore, each time a new pronunciation instruction or a pronunciation cancellation instruction is detected, the information is accumulated.

As a result of the determination in step S211, when the determination timing has come, the CPU 10 advances the process to step S214. In steps S214 to S217, the CPU 10 executes a process of determining a single tone pitch based on the note on detected in step S103 and the note on or note off detected until the determination timing comes. Do. First, the CPU 10 determines, based on the performance information stored in the storage area, whether to immediately stop the output of sound (step S214). Specifically, the CPU 10 determines that the sound output should be immediately stopped when there is no key (not pressed and released) key in the note on state. When it is determined that the output of the sound should be immediately stopped (step S215: YES), the CPU 10 stops the sound being output (step S216), and ends the process shown in FIG. On the other hand, when it is determined that the output of the sound should not be stopped immediately, the CPU 10 detects a pitch to be output based on the performance information stored in the storage area (step S217). This determines a single pitch to be output.

Here, the detection of the pitch to be output will be specifically described. First, in step S103, the instruction (note-on) for the sound generation detected in the case where the instruction for sound generation for any pitch is not maintained (all keys are in the key release state), and the note-on pitch is Call it the first pitch. Therefore, in step S208, generation of the singing sound is started at the first pitch. In addition, after the sound generation instruction to specify the first pitch, before the determination timing arrives, the sound generation instruction (note-on) to specify the "second pitch" different from the first pitch is Consider the case.

In general, a plurality of patterns requiring correction of a mistouch at the time of playing are assumed. Here, the first pattern and the second pattern will be described as an example. The first pattern is a pattern in which, in order to operate a desired key (for example, C3), another key (for example, D3) is pressed and operated. In order to correct a mistouch, it is assumed that the user cancels the first pressing operation (note on) in a short time (note off) and presses the desired key again (note on). Then, the user normally maintains the operation for a desired length of time after pressing the desired key. The second pattern is a pattern in which an adjacent key (for example, D3) and a desired key are pressed at the same time to operate a desired key (for example, C3). In order to correct a mistouch, it is assumed that the user releases only the wrong operation key (note off) while maintaining the operation on the desired key among the operations (note on) for the two keys pressed first. Be done.

The CPU 10 refers to the performance information stored in the storage area, and when note-on for designating a pitch other than the first note-on is detected by the arrival of the determination timing, it is pressed at the time of arrival of the determination timing. It is determined that the key maintaining the key state is the desired key. Then, the CPU 10 detects a pitch (for example, C3) corresponding to the key determined as the desired key as a pitch to be output. In addition, there may be a mistouch in which the desired key is pressed at first, and then the other key is temporarily pressed and released while the desired key is kept pressed. In this case, the pitch of the first note-on is detected as the pitch to be output. It is also conceivable that the key pressed first is released and two or more keys different from the first key are pressed and the two or more keys are pressed at the arrival of the determination timing. In this case, the pitch of the key pressed last among the two or more keys in the pressed state may be detected as the pitch to be output.

In the case of mistouch, it is assumed that it is rare to operate a key spatially separated from the desired key. Therefore, the operation to be subjected to the mistouch determination may be limited to the operation of the adjacent key. In this case, non-adjacent key operations may be treated as newly performed normal operations. That is, the operation of the remote key may be treated as the operation detected in the process of step S103. The above-described method of determining the desired key is an example, and the CPU 10 may determine the desired key with reference to any information such as note-on, note-off, velocity included in the performance information.

Next, the CPU 10 determines whether to correct the currently output pitch (step S218). Specifically, the CPU 10 checks whether the pitch generated in step S208 and the pitch detected in step S217 do not match, and if the two do not match, corrects the pitch being output. Determine that it should be. When the CPU 10 determines that the tone pitch being output should not be corrected, the CPU 10 ends the process shown in FIG. In this case, no pitch correction is made. On the other hand, when it is determined that the pitch being output should be corrected, the CPU 10 adjusts (changes) the pitch of the sound being output to the pitch detected in step S217 (step S219) , And end the process shown in FIG. For example, the CPU 10 changes the pitch by pitch shift, and in this pitch shift, the spectrum distribution indicated by the waveform data is moved in the frequency axis direction by a frequency corresponding to the difference in pitch to be shifted. The CPU 10 may change the pitch gradually, for example, in units of 20 cents.

As described above, according to the process shown in FIG. 8, new syllables are not read out even if there is note-on until the determination timing comes. That is, there is no possibility that all of the erroneous operation correction operations are reflected in the reading of the lyrics and the unintended lyrics are not read. In particular, the process of determining a single tone pitch (steps S214 to S217) can be summarized as follows.

First, a sound generation instruction for instructing a second pitch different from the first pitch is acquired between the sound generation instruction for instructing the first pitch and the arrival of the determination timing, and the second When an instruction of sound generation instructing a pitch is maintained at the determination timing, the second pitch is detected as a pitch to be output. In this case, the sound production pitch of the generated singing sound is corrected from the first pitch to the second pitch (S219). As a result, within the range of the determination time width T, pitch correction can be performed by re-operation. Since the pitch can be corrected in a short time without muting without changing the syllable (i.e., the lyrics) to be produced in response to the mistouch, it is difficult for the user to feel that there is a mistouch.

On the other hand, even if the sound generation instruction instructing the second pitch different from the first pitch is acquired between the sound generation instruction instructing the first pitch and the arrival of the determination timing. If the sound generation instruction for instructing the first pitch is maintained at the arrival of the determination timing and the sound generation instruction for instructing the second pitch is not maintained, then the sound pitch of the generated singing sound is corrected I will not. As a result, even if a mistake operation is performed within the range of the determination time width T, the original pitch is maintained if it is eliminated before the determination timing comes.

According to the present embodiment, CPU 10 determines determination time width T according to the acquired syllable information, and based on the acquired pronunciation or pronunciation cancellation instruction, a single signal is output after determination time width T has elapsed. The tone pitch is determined, and a singing voice is generated based on the acquired syllable information and the determined tone pitch. Thus, the tone pitch of the generated singing sound can be determined in a period corresponding to the syllable to be pronounced.

In particular, the CPU 10 determines the determination time width T based on the phoneme type indicated by the acquired syllable information or determines based on the waveform of the volume envelope indicated by the acquired syllable information. As a result, for example, by lengthening the correction opportunity of the mistake operation with respect to syllable information having a thin pitch sense, it is possible to pronounce a regular pitch with small discomfort. That is, a syllable having a small vowel or vowel start delay amount is likely to be prominent in pitch correction for correcting a mistouch. On the other hand, the consonant section has a faint sense of pitch compared to the vowel section. Therefore, when determining the determination time width T based on the phoneme type, the CPU 10 determines the determination time width T according to the vowel start delay amount. That is, for a phoneme having a large vowel start delay amount, the CPU 10 sets the determination time width T relatively long to a phoneme having a small vowel start delay amount. As a result, it is possible to secure a long chance of correcting the mistouch for syllables having a large vowel start delay amount, while making the mistouch inconspicuous.

In addition, when determining the determination time width T based on the waveform of the sound volume envelope indicated by the syllable information (FIG. 7), the CPU 10 determines the time shorter than the time tP from the rising of the waveform to the peak as the determination time width T decide. As a result, it is possible to correct the generated pitch before the sound production volume rises sufficiently, and make it possible to make the mistouch inconspicuous.

Also, the CPU 10 determines a single tone pitch based on the first note-on and the note-on or note-off detected until the determination timing comes. Thus, it is possible to correct the pitch by redoing the operation before the arrival of the determination timing, and it is possible to avoid that the pitch is corrected one by one by a temporary mistake operation.

The determination time width T is not limited to defining the absolute time. For example, it may be a relative time according to the tempo. When the determination time width T is determined based on the phoneme type, it is not limited to two steps such as the first type q1 and the second type q2, but the determination time width of a value different for each phoneme type or each phoneme type group T may be determined.

In the present embodiment, although the case where the performance operation element 15 is a keyboard has been described, the performance operation element 15 may have a shape in which strings are arranged side by side like a guitar. In addition, the instruction for sound generation or sound generation cancellation designating the pitch is not limited to the configuration in which the input is made with an operator such as a key. Further, the performance control 15 may be a keyboard, a string, or a plurality of buttons for pitch designation displayed on the touch panel. For example, when a pitch is designated by a touch operation on the application, a mistake operation is also assumed in which the pitch changes while the note-on operation continues. In addition, the performance control 15 may be one in which operation receiving units for inputting a plurality of different pitches are spatially arranged side by side. Moreover, although the case where the data of MIDI format were handled was demonstrated, it is not restricted to this. For example, the CPU 10 may analyze voice data such as a microphone input and extract a timing of sound generation and a pitch to obtain an instruction of sound generation or sound generation cancellation. Therefore, the apparatus to which the singing sound generating apparatus of the present invention is applied is not limited to the keyboard instrument.

Further, in the present embodiment, Japanese lyrics are exemplified as the lyrics to be sung, but the present invention is not limited to this, and other languages may be used. One letter and one syllable do not necessarily correspond. For example, in the case of "da" (da) having a cloud point, two letters "ta" (ta) and "" "correspond to one syllable. For example, when the English lyrics are "september", it becomes three syllables of "sep" "tem" "ber". Although "sep" is one syllable, three characters "s" "e" "p" correspond to one syllable. Each time the user manipulates the performance operation element 15, the CPU 10 sequentially pronounces each syllable at the pitch of the operated key.

Although the present invention has been described in detail based on its preferred embodiments, the present invention is not limited to these specific embodiments, and various embodiments within the scope of the present invention are also included in the present invention. included.

Note that the storage medium storing the control program represented by the software for achieving the present invention may be read out to the present instrument to achieve the same effect, in which case, the storage medium is read from the storage medium. The program code itself implements the novel functions of the present invention, and the non-transitory computer readable recording medium storing the program code constitutes the present invention. Also, the program code may be supplied via a transmission medium or the like, in which case the program code itself constitutes the present invention. In addition to ROMs, floppy disks, hard disks, optical disks, magneto-optical disks, CD-ROMs, CD-Rs, magnetic tapes, non-volatile memory cards, etc. can be used as storage media in these cases. The “non-transitory computer readable recording medium” is a volatile memory (for example, a server or client internal to the computer system when the program is transmitted via a network such as the Internet or a communication line such as a telephone line) It also includes one that holds a program for a fixed time, such as a dynamic random access memory (DRAM).

10 CPU (syllable acquisition unit, determination unit, instruction acquisition unit, determination unit, generation unit)
100 electronic musical instruments

Claims

A syllable acquisition unit for acquiring syllable information indicating one syllable to be pronounced;
A determination unit that determines a standby time according to syllable information acquired by the syllable acquisition unit;
An instruction acquisition unit that acquires a sound generation or sound release instruction that specifies a pitch;
A single tone pitch after the waiting time determined by the determining unit has elapsed after the instruction acquiring unit has acquired the sounding instruction based on the sounding or releasing instruction acquired by the instruction acquiring unit. And a finalization unit to determine
And a generation unit configured to generate a singing sound based on syllable information acquired by the syllable acquisition unit and the tone pitch determined by the determination unit.
The generation unit starts generation of the singing sound at the first pitch based on the sound generation instruction instructing the first pitch when no sound generation instruction of any pitch is maintained. Between the sound generation instruction instructing the first pitch and the elapse of the waiting time, the sound generation instruction instructing the second pitch different from the first pitch is acquired, and the second When the sound generation instruction for instructing the pitch is maintained at the timing when the waiting time has elapsed, the sound production pitch of the singing voice to be generated is corrected from the first pitch to the second pitch. The song sound production | generation apparatus of Claim 1.
The generation unit starts generation of the singing sound at the first pitch based on the sound generation instruction instructing the first pitch when no sound generation instruction of any pitch is maintained. Even in the case where the sound generation instruction for instructing the second pitch different from the first pitch is acquired between the sound generation instruction for instructing the first pitch and the elapse of the waiting time. When the instruction of sound generation instructing the first pitch is maintained and the instruction of sound generation instructing the second pitch is not maintained at the timing when the standby time has elapsed, the song to be generated The song sound generation device according to claim 1, wherein the sound production pitch of the sound is not corrected.
The singing sound generating apparatus according to any one of claims 1 to 3, wherein the determination unit determines the waiting time based on a phoneme type indicated by the acquired syllable information.
The song sound generation device according to any one of claims 1 to 3, wherein the determination unit determines the standby time based on a waveform of a volume envelope indicated by the acquired syllable information.
The singing voice generating device according to any one of claims 1 to 5, wherein the syllable acquiring unit acquires syllable information indicating the one syllable in a predetermined order from a plurality of syllable information.
A syllable acquisition step of acquiring syllable information indicating one syllable to be pronounced;
A determination step of determining a waiting time according to syllable information acquired by the syllable acquisition step;
An instruction acquiring step of acquiring an instruction of pronunciation or pronunciation cancellation with a designated pitch;
A single sounding sound is generated after an elapse of the waiting time determined by the determination step after the instruction of the sound generation is acquired by the instruction acquisition step based on the instruction of the sound generation or the sound generation cancellation acquired by the instruction acquisition step. A determination step to determine the height;
A generation step of generating a singing sound based on syllable information acquired in the syllable acquiring step and the tone pitch decided in the deciding step.
A syllable acquisition step of acquiring syllable information indicating one syllable to be pronounced;
A determination step of determining a waiting time according to syllable information acquired by the syllable acquisition step;
An instruction acquiring step of acquiring an instruction of pronunciation or pronunciation cancellation with a designated pitch;
A single sounding sound is generated after an elapse of the waiting time determined by the determination step after the instruction of the sound generation is acquired by the instruction acquisition step based on the instruction of the sound generation or the sound generation cancellation acquired by the instruction acquisition step. A determination step to determine the height;
A program for causing a computer to execute a generation step of generating a singing sound based on syllable information acquired in the syllable acquisition step and the sound production pitch determined in the determination step.