JP5949607B2 - Speech synthesizer - Google Patents

Speech synthesizer Download PDF

Info

Publication number
JP5949607B2
JP5949607B2 JP2013052758A JP2013052758A JP5949607B2 JP 5949607 B2 JP5949607 B2 JP 5949607B2 JP 2013052758 A JP2013052758 A JP 2013052758A JP 2013052758 A JP2013052758 A JP 2013052758A JP 5949607 B2 JP5949607 B2 JP 5949607B2
Authority
JP
Japan
Prior art keywords
retake
singing
editing
user
sequence data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
JP2013052758A
Other languages
Japanese (ja)
Other versions
JP2014178512A (en
Inventor
入山 達也
達也 入山
Original Assignee
ヤマハ株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ヤマハ株式会社 filed Critical ヤマハ株式会社
Priority to JP2013052758A priority Critical patent/JP5949607B2/en
Publication of JP2014178512A publication Critical patent/JP2014178512A/en
Application granted granted Critical
Publication of JP5949607B2 publication Critical patent/JP5949607B2/en
Application status is Active legal-status Critical
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS
    • G10H2250/00Aspects of algorithms or signal processing methods without intrinsic musical character, yet specifically adapted for or used in electrophonic musical processing
    • G10H2250/315Sound category-dependent sound synthesis processes [Gensound] for musical use; Sound category-specific synthesis-controlling parameters or control means therefor
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS
    • G10H2250/00Aspects of algorithms or signal processing methods without intrinsic musical character, yet specifically adapted for or used in electrophonic musical processing
    • G10H2250/315Sound category-dependent sound synthesis processes [Gensound] for musical use; Sound category-specific synthesis-controlling parameters or control means therefor
    • G10H2250/455Gensound singing voices, i.e. generation of human voices for musical applications, vocal singing sounds or intelligible words at a desired pitch or with desired vocal effects, e.g. by phoneme synthesis

Description

  The present invention relates to a speech synthesis technique for electrically synthesizing speech.

  As an example of this type of speech synthesis technique, information indicating a note string constituting a melody of a song (that is, information indicating a melody prosody change: hereinafter, song information) and information indicating lyrics to be uttered in accordance with each note. There is a singing synthesis technique for electrically synthesizing a singing voice based on (information indicating a phoneme sequence constituting lyrics: hereinafter, lyric information) (for example, see Patent Documents 1 to 3). In recent years, application software that allows a general computer such as a personal computer to perform such singing synthesis is generally distributed. As an example of this kind of application software, there is a software that combines a singing synthesis database that stores waveform data of various phonemes extracted from voice actors and singer's voice, and a singing synthesis program.

  The singing synthesis program reads the waveform data of the phoneme specified by the lyric information from the database for singing synthesis, performs pitch conversion so as to be the pitch specified by the music information, combines them in the order of pronunciation, and converts the sound waveform of the singing voice A program for causing a computer to execute processing for generating waveform data to be represented. In addition, in the singing synthesis program, in order to obtain a natural singing voice that is close to a human singing voice, in addition to the phoneme sequence that composes the lyrics and the pitch when the lyrics are pronounced, the lyrics are also included. Some parameters can be meticulously specified for various voice parameters such as velocity and sound volume.

WO2007 / 010680 JP 2005-181840 A JP 2002-268664 A

  When recording the singer's singing voice for CD conversion, etc., “retake” may be performed in which the recording director or the like sings again until satisfactory, and the whole or part of the singing voice is recorded again. In such a retake, a recording director or the like designates a time section to be retaken (hereinafter referred to as a retake section) and a singing mode (for example, “more softly” or “make the lyrics clear”) a singer. The singer is sung again through trial and error so that the singing mode indicated by the director or the like is realized.

  Needless to say, in the singing synthesis, it is preferable that the singing voice of the singing mode desired by the user of the singing synthesis program is synthesized. In singing synthesis, by editing various parameters that define the utterance mode, the singing mode in the synthesized singing voice can be changed as in the case of retake when a person sings. However, from a general user's standpoint, it is often unclear how to edit which parameter and how it can be realized, such as “more soft”, and easily achieve the desired singing mode. I can't do it. This is information about the prosody change in the synthesis target voice (information corresponding to the music information in the singing synthesis) and the utterance content of the voice other than the singing voice such as the reading voice of the literary work and the guidance voice for various guidance. The same applies to the case of electrically synthesizing based on the information to be expressed (information corresponding to the lyric information in singing synthesis). In the following, re-synthesizing speech so that a desired utterance (singing if singing synthesis) is realized in speech synthesis is also referred to as retake.

  The present invention has been made in view of the above-described problems, and an object of the present invention is to provide a technique that enables the re-take of synthesized speech without directly editing various parameters representing speech utterance modes.

  In order to solve the above problems, the present invention provides a speech synthesizer that synthesizes speech according to sequence data including a plurality of types of parameters representing speech utterance modes. Of the parameters included in the data, the parameters in the retake section are edited by a predetermined editing process, and retake means for generating sequence data representing the retake result, and the sound represented by the sequence data generated by the retake means are presented. And a selection support means for allowing the user to select retake re-execution or retake completion.

  According to such a speech synthesizer, when a retake section for re-synthesizing speech is specified by the retake instruction means, parameters included in the sequence data of the retake section are edited by a predetermined editing process, and after editing The sound represented by the sequence data is presented to the user. The user can instruct completion of retake if the synthesized speech presented in this way is in his / her desired speech mode, and can instruct re-execution of retake if it is different from the desired one. The synthesized speech can be retaken without directly editing various parameters. Note that only one type of editing process may be prepared, or a plurality of types may be prepared. When a plurality of types of editing processes are determined in advance, the selection support means presents to the user the results of editing by each of the plurality of types of editing processes and provides the user with a desired utterance mode. It may be selected (that is, retake completion is instructed). In this case, if the user does not select any editing result, it is assumed that retake re-execution is instructed, and the processing by the retake means is performed again by adjusting the strength of the editing process. good.

  As a specific example of such a voice synthesizer, a singing voice synthesizer that synthesizes a singing voice based on music information and lyrics information can be considered. As another specific example of the speech synthesizer, information other than singing speech such as reading speech of literary works and guidance speech for various guidance, information indicating the prosody change in the speech to be synthesized and utterance content A speech synthesizer that electrically synthesizes based on the information to be represented is mentioned. Further, as another aspect of the present invention, the computer allows the user to specify speech synthesis means for performing speech synthesis in accordance with sequence data including a plurality of types of parameters representing speech utterance modes, and a retake section for re-synthesizing speech, Of the parameters included in the sequence data, a parameter in the retake section is edited by a predetermined editing process to generate sequence data representing a retake result, and each sequence data generated by the retake unit is represented. There can be considered a mode of providing a program that functions as selection support means for presenting a sound and allowing the user to select retake re-execution or retake completion.

  In a more preferred mode, there are a plurality of types of editing processes, and a voice utterance mode (singing mode such as “softly” or “consonant clearly” if singing synthesis is used) for singing synthesis). The retake means causes the user to specify the voice utterance mode in the retake section together with the retake section, and sequence data representing the retake result by editing processing corresponding to the voice utterance mode specified by the user Is generated. According to such an aspect, the user can perform the retake of the synthesized singing voice without directly editing various parameters by simply designating a desired utterance aspect and a retake section and instructing the retake.

  In another preferred aspect, the selection support means presents a voice that is synthesized according to the sequence data that has undergone editing by the editing process and has a small difference from the voice that is synthesized according to the sequence data before editing. It further has a pre-evaluation means to be excluded from the object. Although details will be described later, some of the above editing processes have phoneme dependency and have little effect on specific phonemes. According to this aspect, it is possible to exclude editing results that have little effect due to phoneme dependency or the like from the presentation target to the user.

  Further, as another more preferable aspect, a table that stores processing content data representing the processing content of the editing processing and priority data representing a priority using the editing processing in association with each other, and generated by the retaking means. The user's evaluation value for the sound represented by the sequence data is input for each sequence data, and the priority data associated with the processing content data indicating the processing content of the editing process used to generate the sequence data is the evaluation value. And an evaluation unit that updates in response to the selection, and the selection support unit may present sounds represented by the sequence data generated by the retaking unit in descending order of priority. Even in the editing process for realizing the same utterance mode, the evaluation with respect to the editing result often differs depending on the user's preference. According to such an aspect, it becomes possible to reflect the user's preference as to which editing processing is used when realizing a certain utterance aspect, and present the retake results in the order according to the user's preference. It becomes possible.

It is a figure which shows the structural example of 10A of song synthesizing | combining apparatuses of 1st Embodiment of this invention. It is a figure which shows an example of the input screen displayed on the display part of the user I / part 120 of 10 A of song synthesizing | combining apparatuses. It is a figure which shows an example of the retake assistance screen displayed on the display part of the user I / part 120 of 10 A of song synthesis apparatuses. It is a figure which shows an example of the retake assistance table 144c stored in the non-volatile memory | storage part 144 of 10A of song synthesis apparatuses. It is a flowchart which shows the flow of the process which the control part 110 performs according to the song synthesis program 144a stored in the non-volatile memory | storage part 144. FIG. It is a figure which shows an example of the sequence data for song synthesis | combination which the control part 110 produces | generates. It is a figure which shows an example of the edit process in this embodiment. It is a figure for demonstrating the effect of the edit process. It is a figure which shows the structural example of the song synthesizing | combining apparatus 10B of 2nd Embodiment of this invention. It is a flowchart which shows the flow of the process which the control part 110 of the song synthesis apparatus 10B performs according to song synthesis program 144d.

Hereinafter, embodiments of the present invention will be described with reference to the drawings.
(A: 1st Embodiment)
FIG. 1 is a diagram illustrating a configuration example of a singing voice synthesizing apparatus 10A according to the first embodiment of the present invention. The singing voice synthesizing apparatus 10A is similar to the conventional singing voice synthesizing apparatus, from music information representing a note string constituting a melody of a song to be synthesized of singing voice, and lyrics information representing lyrics to be sung according to each note. This is an apparatus for electrically generating waveform data of singing voice. As shown in FIG. 1, the singing voice synthesizing apparatus 10A includes a control unit 110, a user I / F unit 120, an external device I / F unit 130, a storage unit 140, and a bus 150 that mediates data exchange between these components. It is out.

  The control unit 110 is, for example, a CPU (Central Processing Unit). The control unit 110 reads and executes the song synthesis program 144a stored in the storage unit 140 (more precisely, the nonvolatile storage unit 144), and functions as a control center of the song synthesis apparatus 10A. The processing executed by the control unit 110 in accordance with the song synthesis program 144a will be clarified later.

  The user I / F unit 120 provides various user interfaces for allowing the user to use the singing voice synthesizing apparatus 10A. The user I / F unit 120 includes a display unit for displaying various screens and an operation unit for allowing a user to input various data and instructions (none of which are shown in FIG. 1). The display unit is composed of a liquid crystal display and a drive circuit thereof, and displays images representing various screens under the control of the control unit 110. The operation unit includes a keyboard having a large number of operators such as numeric keys and cursor keys, and a pointing device such as a mouse. When the user performs some operation on the operation unit, the operation unit provides data representing the operation content to the control unit 110 via the bus 150. Thereby, the user's operation content is transmitted to the control unit 110.

  As an example of a screen displayed on the display unit included in the user I / F unit 120, an input screen for allowing the user to input music information and lyrics information, and a retake support for supporting retake of the synthesized singing voice Screen. FIG. 2 is a diagram illustrating an example of the input screen. As shown in FIG. 2, this input screen has two areas, area A01 and area A02. In the area A01, an image simulating a piano roll is displayed. In this image, the vertical axis direction (key arrangement direction in the piano roll) represents pitch, and the horizontal axis direction represents time. The user inputs information related to notes (pitch, pronunciation start time, and note duration) by drawing a rectangle R1 in a region A01 using a mouse or the like at a position corresponding to a desired pitch and pronunciation time. Lyric information can be input by inputting a hiragana or phonetic symbol representing a phoneme to be pronounced according to the note into the rectangle R1. Further, the pitch time change can be designated by drawing the pitch curve PC under the rectangle R1 using a mouse or the like.

  Area A02 is a parameter that is not tune information but lyric information among parameters representing voice utterance modes such as velocity (indicated as “VEL” in FIG. 2) and volume (indicated as “DYN” in FIG. 2). This is an area for allowing the user to specify the value of and the time change. For example, FIG. 2 illustrates the case of specifying the velocity. The user designates a character string corresponding to a desired parameter using a mouse or the like, and draws a graph showing the value of the parameter (graphs G1 and G2 in the example shown in FIG. 2). Change can be specified.

  When a time interval for which retake is desired is designated by dragging with a mouse or the like on the input screen shown in FIG. 2, a retake support screen shown in FIG. 3A is displayed on the display unit. FIG. 3A illustrates the case where the third measure and the fourth measure are designated as the retake section. The user who has visually recognized the retake support screen can display the singing mode designation menu M1 by clicking the instruction button B1 with the mouse, and the plurality of types of singing modes displayed in the singing mode designation menu M1 (see FIG. 3). In the example shown, it is possible to select a desired one from “soft”, “hard”, “clear consonant”, and “clear vowel”) and to indicate the singing mode. In addition, designation | designated of the singing aspect is not restricted to the thing of a note unit, The thing over a several note may be sufficient. For example, as shown in FIG. 3 (b), when the singing mode "Nobibito" is selected, a button B2 for designating the strength of the instruction is displayed, and when the mouse is clicked on this button B2, the strength of the instruction is displayed. It is only necessary to display a graph curve GP for allowing the user to designate the time change and to cause the user to input the strength of the instruction by deforming the graph curve GP using a mouse or the like.

  It goes without saying that the synthetic singing voice can be retaken by directly editing various parameters by the operation on the input screen (see FIG. 2). In particular, a user who is familiar with singing synthesis can freely realize a desired singing mode by finely adjusting the values of various parameters. However, it is often difficult for a general user to know which parameter is edited and how a desired singing mode can be realized. In the singing voice synthesizing apparatus 10A according to the present embodiment, even a general user who does not know how to edit which parameter can realize a desired singing mode, designates a retake section, and further on the retake support screen. Retaking can be performed easily by designating the singing mode, and this point is characterized by this embodiment.

  The external device I / F unit 130 is a collection of various input / output interfaces such as a USB (Universal Serial Bus) interface and a NIC (Network Interface Card). When connecting an external device to the singing voice synthesizing apparatus 10 </ b> A, the external device is connected to a suitable one of various input / output interfaces included in the external device I / F unit 130. An example of an external device connected to the external device I / F unit 130 is a sound system that reproduces sound according to waveform data. In the present embodiment, lyrics information and music information are input to the singing voice synthesizing apparatus 10A via the user I / F unit 120. However, these pieces of information may be input via the external device I / F unit 130. . Specifically, a storage device such as a USB memory in which music information and lyric information about a song to be synthesized is connected to the external device I / F unit 130, and the information is read from the storage device. What is necessary is just to make the control part 110 perform a process.

  The storage unit 140 includes a volatile storage unit 142 and a nonvolatile storage unit 144. The volatile storage unit 142 is configured by, for example, a RAM (Random Access Memory). The volatile storage unit 142 is used by the control unit 110 as a work area when executing various programs. The non-volatile storage unit 144 is configured by a non-volatile memory such as a hard disk or a flash memory. The nonvolatile storage unit 144 stores a program and data for causing the control unit 110 to realize functions unique to the singing voice synthesizing apparatus 10A of the present embodiment.

  As an example of the program stored in the non-volatile storage unit 144, a song synthesis program 144a can be cited. The singing voice synthesizing program 144a causes the control unit 110 to execute processing for generating waveform data representing the synthesized singing voice based on the music information and the lyric information, as in the conventional singing voice synthesizing technique. The support processing is executed by the control unit 110. Examples of data stored in the non-volatile storage unit 144 include screen format data (not shown in FIG. 1) that defines various screen formats, a singing synthesis database 144b, and a retake support table 144c. The details of the song synthesis database 144b will not be described in detail because there is no particular difference from the song synthesis database of the conventional song synthesis apparatus.

FIG. 4 is a diagram illustrating an example of the retake support table 144c.
As shown in FIG. 4, in the retake support table 144c, the singing is associated with a singing mode identifier (character string information representing each singing mode) indicating a singing mode that can be specified on the retake support screen (see FIG. 3). Processing content data representing a plurality of types of editing processing capable of realizing the aspect is stored. In the example shown in FIG. 4, “(Method A): Decrease velocity (in other words, increase the duration of consonant)”, “(Method B): Stored is processing content data representing the processing content of three types of editing processing, “Raise the volume of consonants” and “(Method C): Lower the pitch of consonants”.

  As shown in FIG. 4, a plurality of types of editing processes are associated with one singing mode, which of the plurality of types of editing contents is most effective when realizing the singing mode. This is because it can vary depending on the context and type of phonemes included in the retake section. For example, if the consonant of the lyrics included in the retake section is “s”, since the consonant “s” has no pitch, (Method C) has no effect, and (Method A) and (Method B) are effective. it is conceivable that. If the consonant of the lyrics included in the retake section is “t”, (Method B) is considered effective, and if the consonant of the lyrics included in the retake section is “d”, (Method A) Both (Method B) and (Method C) are considered effective.

  Next, processing executed by the control unit 110 according to the song synthesis program 144a will be described. The control unit 110 reads the song synthesis program 144a into the volatile storage unit 142, and starts its execution. FIG. 5 is a flowchart showing a flow of processing executed by the control unit 110 in accordance with the song synthesis program 144a. As shown in FIG. 5, the process which the control part 110 performs according to the song synthesis program 144a is divided into a song synthesis process (step SA100 to step SA120) and a retake support process (step SA130 to step SA170).

  The control unit 110 that has started the execution of the song synthesis program 144a first displays the input screen shown in FIG. 2 on the display unit of the user I / F unit 120 (step SA100), and prompts input of music information and lyrics information. The user who visually recognizes the input screen shown in FIG. 2 operates the operation unit of the user I / F unit 120, inputs music information and lyric information of a song desired to synthesize singing voice, and instructs the start of synthesis. When the start of synthesis is instructed via the user I / F unit 120, the control unit 110 generates singing synthesis sequence data from the music information and the lyrics information received via the user I / F unit 120 (step SA110). .

  FIG. 6A is a diagram showing a singing synthesis score which is an example of singing synthesis sequence data. As shown in FIG. 6A, the singing synthesis score includes a pitch data track and a phonological data track. The pitch data track and the phonological data track are time series data having the same time axis. Various parameters representing the pitch and volume of each note constituting the musical composition are mapped to the pitch data track, and a sequence of phonemes constituting lyrics to be pronounced according to each note is mapped to the phoneme data track. That is, in the singing synthesis score shown in FIG. 6A, the time axis of the pitch data track and the time axis of the phonological data track are set to be the same so that the notes constituting the melody of the singing voice synthesis target song are related. The information and the phonemes of the lyrics that are sung in accordance with the notes are associated with each other.

  FIG. 6B is a diagram showing another specific example of the sequence data for singing synthesis. The sequence data for singing synthesis shown in FIG. 6 (b) is data in XML format, and for each note constituting the music, information on the sound represented by the note (sound generation time, note length, pitch, volume) And velocity), and data related to lyrics (phonetic characters and phonemes representing the lyrics) that are pronounced in accordance with the notes. For example, in the sequence data for song synthesis in the XML format shown in FIG. 6B, data divided by a tag <note> and a tag </ note> corresponds to one note. More specifically, among the data partitioned by the tag <note> and the tag </ note>, the data partitioned by the tag <posTick> and the tag </ posTick> indicates the time of note production, and the tag <duTick> The data divided by the tag </ durTick> represents the length of the note, and the data divided by the tag <noteNum> and the tag </ noteNum> represents the pitch of the note. Furthermore, the data partitioned by the tag <Lyric> and the tag </ Lylic> is a lyric that is pronounced according to the note, and the data partitioned by the tag <phnms> and the tag </ phnms> is a phoneme corresponding to the lyrics. Represent each.

  Various modes can be considered as to what unit the sequence data for singing synthesis is generated. For example, it is possible to generate one singing synthesis sequence data over the entire composition of the singing voice synthesis target. For each block such as the first or second piece of music or A melody, B melody, and chorus Alternatively, the singing composition sequence data may be generated. However, it is needless to say that the latter mode is preferable in consideration of performing retake.

In step SA120 subsequent to step SA110, control unit 110 first generates waveform data of the synthesized singing voice based on the singing synthesis sequence data generated in step SA110. In addition, about generation | occurrence | production of the waveform data of synthetic | combination song voice, since there is no place which changes especially in the thing in the conventional song synthesis apparatus, detailed description is abbreviate | omitted. Next, the control unit 110 gives the waveform data generated based on the singing synthesis sequence data to the sound system connected to the external device I / F unit 130 and outputs it as a sound.
The above is the song synthesis process.

Next, the retake support process will be described.
The user can listen to the synthesized singing sound output from the sound system and confirm whether or not the intended singing voice is synthesized. Then, the user can operate the operation unit of the user I / F unit 120 to give a compositing completion or retake instruction (specifically, information indicating a time section to be retaken). If the intended singing voice is synthesized, the completion of the synthesis is instructed, and if the singing voice is not synthesized as intended, the retake is instructed. The control unit 110 determines whether the instruction given via the user I / F unit 120 is completion of synthesis or retake (step SA130). When the given instruction is completion of synthesis, the control unit 110 stores the singing synthesis sequence data generated in step SA110 (or the waveform data generated in step SA120) in the nonvolatile storage unit 144 in a predetermined manner. Writing into the area ends the execution of the song synthesis program 144a. On the other hand, when retake is instructed, the processes after step SA140 are executed.

  In step SA140 executed when retake is instructed, control unit 110 causes the remake support screen shown in FIG. 3 to be displayed on the display unit of user I / F unit 120. A user who visually recognizes the retake support screen can operate the operation unit of the user I / F unit 120 to specify a desired singing mode. In this way, control unit 110, which is designated with the singing mode, first reads a plurality of processing content data stored in retake support table 144c in association with the singing mode (step SA150).

  Next, the control unit 110 performs the process of editing parameters according to the processing contents indicated by each of the plurality of types of processing content data read out in step SA150, to the singing synthesis sequence data belonging to the section specified in step SA140. The retake processing to be performed (step SA160) is executed. In this retake processing, not only editing processing is performed according to each of the plurality of types of processing content data read out in step SA150, but a plurality of the editing processing may be executed in combination.

  For example, when the singing mode designated by the user is “consonant clearly”, in addition to (Method A), (Method B), and (Method C) shown in FIG. B), (Method A) and (Method C), (Method B) and (Method C), and (Method A), (Method B) and (Method C) It is. When the tempo of the synthetic singing voice to be retaken is slow, the effect of clarifying the consonant pronunciation is obtained by executing any one of (Method A), (Method B), and (Method C). However, if the tempo is fast or the note length of the notes included in the retake section is short, it is considered that sufficient effects cannot be obtained unless a plurality of methods are used in combination.

  In addition, the phrase structure or music structure in the retake section may be used for the retake process. For example, when “stronger” is instructed as a singing mode, the entire retake section is strengthened in units of one measure, only the first beat is strengthened, and only the second beat is strengthened. It is also possible to present the user with options such as strengthening only the eyes by 10% and strengthening the first beat by 20%, and the processing contents of the retake processing may be varied according to the user's selection. In addition, referring to a dictionary storing information indicating the accent position for each word, the accent part of the word included in the lyrics of the retake section may be emphasized. You may present the option which makes a user designate.

  In the editing by (Method A) of the present embodiment, the control unit 110 calculates the velocity V1 after editing by multiplying the velocity V0 before editing by 1/10. In the editing by (Method B), the control unit 110 peaks at the parameter D0 [t] representing the volume before editing at the note-on time (t = 0 in this operation example), and in other time intervals. A parameter D1 [t] representing the volume after editing is calculated by multiplying a function k [t] (see FIG. 7A) representing a curve having a constant value (1 in the present embodiment). This increases the volume only near the note-on time. In the editing by (Method C), the control unit 110 represents a curve having a steep valley at the note-on time (t = 0 in this operation example) from the parameter P0 [t] representing the pitch before editing. The parameter P1 [t] representing the pitch after editing is calculated by subtracting the function k [t] (see FIG. 7B), and the parameter B1 [t] representing the pitch bend sensitivity is shown in FIG. 7B. The value of the function n [t] is used.

  When the retake processing is completed, the control unit 110 executes selection support processing (step SA170). In this selection support process, the control unit 110 presents the singing voice represented by each singing synthesis sequence data generated by the retaking process to the user, and prompts the user to select one of the singing synthesis sequence data. The user listens to the singing voice presented by the singing voice synthesizing apparatus 10A, and selects the one that thinks that the singing mode specified on the retake support screen is most realized, thereby instructing the singing voice synthesizing apparatus 10A to complete the retake. . The control part 110 preserve | saves the sequence data for singing synthesis | combination according to the instruction | indication given from the user, and, thereby, the take-up of the synthetic singing voice is completed.

  For example, when the lyrics in the retake section are “ASA” and the sound waveform before retake has the waveform shown in FIG. 8A, the sound waveform after editing is shown in FIG. The waveform shown in b) is obtained, and the edited sound waveform becomes the waveform shown in FIG. Also, when the lyrics in the retake section are “Ada” and the sound waveform before retake has the waveform shown in FIG. 8D, the sound waveform after editing is shown in FIG. The waveform shown in (e) is obtained. Difference between the sound waveform shown in FIG. 8 (a) and the sound waveform shown in FIG. 8 (b) (or FIG. 8 (c)), or the sound waveform shown in FIG. 8 (d) and the sound wave shown in FIG. 8 (e). The user perceives the difference from the shape as an audible difference such that the consonant can be heard clearly.

  As described above, according to the present embodiment, it is possible to realize retake of synthesized singing voice in a desired singing mode without directly editing parameters such as pitch, velocity, and volume. In the present embodiment, the singing synthesis sequence data is edited using each of the processing content data acquired in step SA150, and the singing synthesis sequence data corresponding to each processing content data is generated. Although the case of executing has been described, retake processing and presentation of retake results may be repeated for the number of pieces of processing content data. Specifically, editing the singing synthesis sequence data by the number of processing contents data → generation of waveform data based on the edited singing synthesis sequence data → outputting the waveform data as sound (ie, presenting the editing result) ) May be repeated.

  In addition, when the screen size that can be displayed as the singing mode designation menu M1 is smaller than the types of singing modes that can be specified, the singing modes are grouped in advance (for example, those related to the singing mode in units of notes, Singing in a way that is related to the singing mode of the note of the song), specifying the singing mode of the note unit → editing the sequence data for singing synthesis → generating waveform data based on the sequence data for singing after editing → Output the waveform data as a sound → Specify a singing mode for a plurality of notes → Edit a singing synthesis sequence data →..., And repeat the processing from step SA140 to step SA170 for the number of groups (or Upon completion of the processing of step SA140 to step SA170 for one group, the steps are performed. The processing in step SA130 is executed to prompt the user to input synthesis completion or retake instruction, and when a retake instruction is given (that is, when a retake reexecution instruction is given), the process for another group is executed. It is also possible to start the process and omit the process for other groups when the completion of the synthesis is instructed). When a retake re-execution is instructed, the retake section may be redesignated or the retake section may be omitted (ie, the retake section is the same as the previous group). According to such an aspect, it is possible to cope with a case where the singing aspect designation menu M1 cannot be displayed with a sufficient screen size, and it is possible to avoid confusion of the user due to presenting various singing aspects at once. There is also an effect.

  In addition, in the mode of grouping the singing mode into a note unit, a plurality of notes, a plurality of measures, etc., the singing mode is presented to the user in order from the group of the singing units in a note unit. This makes it possible to check the retake results systematically from notes to a wider editing range, and even for novice users who are unfamiliar with singing synthesis, it is easy and systematic to retake singing voice. It becomes possible to do. Of course, as a result of grouping the singing modes, there may be only one type of singing mode belonging to one group. In that case, when displaying the singing mode designation menu M1 for the group, the singing mode representing the singing mode. Instead of the identifier (for example, “clear consonant”), a singing mode designation menu M1 in which “retake” is simply displayed may be displayed. This is because, even if the detailed information is presented to the novice user, there is a possibility that hesitation or anxiety may occur, and it may be preferable to use a simple display.

(B: Second embodiment)
FIG. 9 is a diagram illustrating a configuration example of the singing voice synthesizing apparatus 10B according to the second embodiment of the present invention.
In FIG. 9, the same components as those in FIG. As apparent from a comparison between FIG. 9 and FIG. 1, the configuration of the singing voice synthesizing apparatus 10B is that the singing voice synthesizing program 144d is stored in the nonvolatile storage unit 144 instead of the singing voice synthesizing program 144a. Different from the configuration of 10A. Hereinafter, the singing synthesizing program 144d, which is a difference from the first embodiment, will be mainly described.

  FIG. 10 is a flowchart showing a flow of processing executed by the control unit 110 in accordance with the song synthesis program 144d. As apparent from the comparison between FIG. 10 and FIG. 5, the song synthesis program 144d of the present embodiment causes the control unit 110 to execute a pre-evaluation process (step SA165) subsequent to the retake process (step SA160). The point which makes the control part 110 perform a selection assistance process (step SA170) after execution of a prior evaluation process differs from the song synthesis program 144a of 1st Embodiment. Hereinafter, the prior evaluation process (step SA165) which is a difference from the first embodiment will be mainly described.

  In the pre-evaluation process (step SA165), the control unit 110 generates waveform data for each song synthesis sequence data generated by the retake process according to the song synthesis sequence data, and converts the data into the original song synthesis sequence data. Therefore, it is determined whether or not there is a difference from the generated waveform data, and the singing synthesis sequence data determined to have no difference is excluded from the objects to be presented to the user in the selection support process (step SA170). Here, a specific determination method for determining whether or not there is a difference between the waveform data generated according to the song synthesis sequence data generated by the retake processing and the waveform data generated according to the original song synthesis sequence data As for a sample string representing the former waveform data and a sample string representing the latter waveform data, a difference (for example, an amplitude difference) between samples at the same time is obtained, and the sum of the absolute values of the differences has a predetermined threshold value. There can be considered an aspect in which “there is a difference” when it exceeds the value, and an aspect in which the correlation coefficient between both sample sequences is obtained and the degree of the correlation coefficient value is determined to be below 1. The reason for providing such a pre-evaluation process is as follows.

  The editing process represented by each of the plurality of types of process content data associated with the singing mode identifier can realize the singing mode represented by the singing mode identifier, but the relationship with the phonemes included in the retake section Alternatively, as described above, a sufficient effect may not be obtained in relation to tempo and note length. The fact that there is no difference between the waveform data generated according to the singing synthesis sequence data generated by performing the editing indicated by the processing content data and the waveform data generated according to the original singing synthesis sequence data means that This means that the editing content indicated by the processing content data does not exhibit an effect sufficient for realizing the singing mode. That is, the pre-evaluation process of the present embodiment excludes the retake result that has not been able to sufficiently realize the singing mode specified by the user from the object to be confirmed by the user, and efficiently performs the confirmation work by the user. It is provided to make it.

  Also according to the present embodiment, similar to the first embodiment, it is possible to realize the recreation of the synthesized singing voice in a desired singing mode without directly editing parameters such as pitch, velocity, and volume. In addition, according to the present embodiment, the retake results that are not effective can be excluded from the presentation target to the user, and the user can efficiently check and select the retake results.

(C: deformation)
Although the first and second embodiments of the present invention have been described above, it goes without saying that the following modifications may be added to these embodiments.
(1) In the above embodiments, application examples to a singing voice synthesizing apparatus that electrically synthesizes a singing voice based on music information and lyrics information have been described. However, the application target of the present invention is not limited to the singing voice synthesizing device, and the information indicating the prosodic change in the voice to be synthesized with the reading voice or guidance voice of the literary work (information corresponding to the music information in the singing voice synthesis) Of course, the present invention may be applied to a speech synthesizer that electrically synthesizes based on the phoneme sequence of the speech (information corresponding to the lyric information in singing synthesis). In addition, it is not an apparatus that performs speech synthesis exclusively, for example, a game machine that executes a role-playing game that outputs a speech of a character or a toy with a voice playback function, etc. in parallel with other processes (or other processes) Of course, the present invention may be applied to an apparatus that executes speech synthesis processing (as a part of).

(2) In the above embodiments, the retake support table 144c is stored in the nonvolatile storage unit 144 as data separate from the singing synthesis program. However, the retake support table 144c may be integrated with the song synthesis program (that is, the retake support table 144c is incorporated in the song synthesis program) and stored in the nonvolatile storage unit 144.

(3) In each of the above embodiments, process content data representing different editing processes in association with the singing mode identifier indicating the singing mode is stored in the retake support table 144c. However, a plurality of processing content data representing the same editing content but having different editing strengths may be stored in the retake support table 144c as representing different editing content. For example, processing content data indicating that the velocity is halved is processing content data indicating (Method A1), and processing content data indicating that the velocity is 1 / is processing content data indicating (Method A2). As the processing content data indicating that the velocity is 1/10, the processing content data indicating (Method A3) is replaced with the processing content data indicating (Method A) described above in the retake support table 144c shown in FIG. For example, it is stored. In this case, the combination of (Method A1) and (Method A2) may be handled as an editing process for reducing the velocity to 1/6, and a combination of a plurality of editing processes that represent the same editing content but have different editing strengths. You may make it not.

(4) In each of the above embodiments, the retake support table 144c includes a plurality of types of editing processes that can realize the singing mode in association with the singing mode identifier indicating the singing mode that can be specified on the retake support screen. Content data was stored. However, in the retake support table 144c, only processing content data representing different processing content is stored, and editing processing according to each of the processing content data is performed on the singing synthesizing sequence data, and the editing result is given to the user. You may make it confirm, and you may make it select a desired retake result, and also it makes a user confirm what kind of effect was produced by the edit process, and may make a user classify process content data for every effect. .

(5) A priority according to the user's preference may be given to each of a plurality of types of editing processes that realize the same singing mode, and may be presented to the user in order from the retake result by the editing process with a higher priority. Specifically, priority data indicating the priority of the editing process indicated by the processing content data in association with the processing content data (all the same value in the initial state such as at the time of factory shipment) is stored in the retake support table 144c. In the selection support process, the evaluation value for the retake result (for example, 0 if it is considered that there is no effect, or a value that is large enough that the effect is considered to be great) is input by the user, The control unit 110 is caused to execute an evaluation process for updating the priority of each processing content data. And in a selection assistance process, it presents to a user in an order from the retake result produced | generated by the process content which the process content data with a high priority represents. According to such an aspect, it becomes possible to reflect a user's preference about which editing process is used when realizing a certain singing aspect, and to present a retake result in an order according to the user's preference. It becomes possible. Moreover, priority data may be stored for each phoneme included in the retake section, and the editing process may be selected according to the singing mode designated by the user and the phoneme included in the retake section.

  Retake processing, presentation of retake results, and evaluation input (processing to prompt input of either completion of composition or retake instruction) are performed for each processing content data in descending order of priority, and each time retake is instructed, priority is given. May be updated. According to such an aspect, there is a possibility that the order of adopting the editing process may be dynamically switched, and it is expected that the effect of efficiently confirming and selecting the retake result by the user can be further enhanced.

(6) In each of the above-described embodiments, the case where the input of music information and lyrics information and the specification of the retake section and the singing mode are performed via the user I / F unit 120 provided in the singing voice synthesizing apparatus has been described. However, a communication I / F unit for transmitting / receiving data to / from a communication partner via an electric communication line such as the Internet is provided instead of the user I / F unit 120, and music information and lyrics information are input via the electric communication line. In addition, the retake section and the singing mode are specified, and each singing synthesis sequence data generated by the retaking process (or waveform data generated according to the singing synthesis sequence data) is returned via the electric communication line. You may do it. According to such an aspect, it is possible to provide a so-called cloud singing synthesis service.

(7) In each of the above-described embodiments, a program that causes the control unit 110 to execute processing that significantly shows the characteristics of the present invention (the song synthesis program 144a in the first embodiment, the song synthesis program 144d in the second embodiment) is a song synthesis. It was stored in advance in the nonvolatile storage unit of the device. However, the program may be distributed by being written on a computer-readable recording medium such as a CD-ROM, or may be distributed by downloading via a telecommunication line such as the Internet. This is because a general computer can be made to function as the singing voice synthesizing apparatus of each of the above embodiments according to the distributed program.

  Further, in each of the above embodiments, the processing that clearly shows the features of the present invention (retake processing and selection support processing in the first embodiment, and prior evaluation processing in addition to these two processing in the second embodiment) is performed by software. Realized by. However, the retake means for executing the retake process is configured by an electronic circuit, and the selection support means for executing the selection support process is configured by an electronic circuit, and these electronic circuits are incorporated into a general singing voice synthesizing apparatus, and thus the first embodiment. The singing voice synthesizing apparatus 10A according to the second embodiment may be constructed by incorporating an electronic circuit for executing the preliminary evaluation processing as a preliminary evaluation means.

DESCRIPTION OF SYMBOLS 10A, 10B ... Singing composition apparatus, 110 ... Control part, 120 ... User I / F part, 130 ... External apparatus I / F part, 140 ... Memory | storage part, 142 ... Volatile memory part, 144 ... Nonvolatile memory part, 150 …bus.

Claims (3)

  1. In a speech synthesizer that synthesizes speech according to sequence data including a plurality of types of parameters representing speech utterance modes,
    Re-synthesizing speech and allowing the user to specify a retake section, among the parameters included in the sequence data, the parameters in the retake section are edited by each of a plurality of predetermined editing processes , and sequence data representing the retake result Retaking means to generate,
    Selection support means for presenting the sound represented by the sequence data generated by the retake means and allowing the user to select retake re-execution or retake completion;
    Of the voices synthesized according to the sequence data that has undergone editing by each of the plurality of types of editing processing , those that have a small difference from the voice synthesized according to the sequence data before editing are excluded from the presentation targets by the selection support means. Pre-evaluation means,
    A speech synthesizer characterized by comprising:
  2. Each of the plurality of types of editing processing is grouped for each voice utterance mode realized by performing the editing processing,
    The retake means causes the user to specify the voice utterance mode in the retake section together with the retake section, and generates sequence data representing the retake result by editing processing corresponding to the voice utterance mode specified by the user. The speech synthesizer according to claim 1.
  3. A table storing in association with each priority data representing a priority of using the process content data and each of the plural kinds of editing representing each of the processing contents of the plural kinds of editing processing,
    The user's evaluation value for the sound represented by the sequence data generated by the retake means is input, and the priority data associated with the processing content data representing the processing content of the editing processing used to generate the sequence data is evaluated. Evaluation means for updating according to the value,
    The speech synthesis apparatus according to claim 1, wherein the selection support unit presents sounds represented by sequence data generated by the retaking unit in descending order of priority.
JP2013052758A 2013-03-15 2013-03-15 Speech synthesizer Active JP5949607B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2013052758A JP5949607B2 (en) 2013-03-15 2013-03-15 Speech synthesizer

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
JP2013052758A JP5949607B2 (en) 2013-03-15 2013-03-15 Speech synthesizer
EP14157748.6A EP2779159A1 (en) 2013-03-15 2014-03-05 Voice synthesis device, voice synthesis method, and recording medium having a voice synthesis program stored thereon
US14/198,464 US9355634B2 (en) 2013-03-15 2014-03-05 Voice synthesis device, voice synthesis method, and recording medium having a voice synthesis program stored thereon
CN201410098488.6A CN104050961A (en) 2013-03-15 2014-03-17 Voice synthesis device, voice synthesis method, and recording medium having a voice synthesis program stored thereon

Publications (2)

Publication Number Publication Date
JP2014178512A JP2014178512A (en) 2014-09-25
JP5949607B2 true JP5949607B2 (en) 2016-07-13

Family

ID=50190344

Family Applications (1)

Application Number Title Priority Date Filing Date
JP2013052758A Active JP5949607B2 (en) 2013-03-15 2013-03-15 Speech synthesizer

Country Status (4)

Country Link
US (1) US9355634B2 (en)
EP (1) EP2779159A1 (en)
JP (1) JP5949607B2 (en)
CN (1) CN104050961A (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9159310B2 (en) 2012-10-19 2015-10-13 The Tc Group A/S Musical modification effects
EP2930714B1 (en) * 2012-12-04 2018-09-05 National Institute of Advanced Industrial Science and Technology Singing voice synthesizing system and singing voice synthesizing method
US9123315B1 (en) * 2014-06-30 2015-09-01 William R Bachand Systems and methods for transcoding music notation
US9384728B2 (en) * 2014-09-30 2016-07-05 International Business Machines Corporation Synthesizing an aggregate voice
JP2016177276A (en) 2015-03-20 2016-10-06 ヤマハ株式会社 Pronunciation device, pronunciation method, and pronunciation program
JP6004358B1 (en) * 2015-11-25 2016-10-05 株式会社テクノスピーチ Speech synthesis apparatus and speech synthesis method
JP2019066648A (en) * 2017-09-29 2019-04-25 ヤマハ株式会社 Method for assisting in editing singing voice and device for assisting in editing singing voice
JP2019066650A (en) * 2017-09-29 2019-04-25 ヤマハ株式会社 Method for assisting in editing singing voice and device for assisting in editing singing voice
JP2019066649A (en) * 2017-09-29 2019-04-25 ヤマハ株式会社 Method for assisting in editing singing voice and device for assisting in editing singing voice
JP2019101094A (en) * 2017-11-29 2019-06-24 ヤマハ株式会社 Voice synthesis method and program

Family Cites Families (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4731847A (en) * 1982-04-26 1988-03-15 Texas Instruments Incorporated Electronic apparatus for simulating singing of song
JP3333022B2 (en) * 1993-11-26 2002-10-07 富士通株式会社 Singing voice synthesizing apparatus
US5703311A (en) * 1995-08-03 1997-12-30 Yamaha Corporation Electronic musical apparatus for synthesizing vocal sounds using format sound synthesis techniques
US5895449A (en) * 1996-07-24 1999-04-20 Yamaha Corporation Singing sound-synthesizing apparatus and method
JPH117296A (en) * 1997-06-18 1999-01-12 Oputoromu:Kk Storage medium having electronic circuit and speech synthesizer having the storage medium
JP2000105595A (en) * 1998-09-30 2000-04-11 Victor Co Of Japan Ltd Singing device and recording medium
JP2002268664A (en) 2001-03-09 2002-09-20 Ricoh Co Ltd Voice converter and program
JP3815347B2 (en) * 2002-02-27 2006-08-30 ヤマハ株式会社 Singing voice synthesizing method and apparatus and a recording medium
JP3823930B2 (en) * 2003-03-03 2006-09-20 ヤマハ株式会社 Singing synthesizing apparatus, singing synthesis program
US20040193429A1 (en) * 2003-03-24 2004-09-30 Suns-K Co., Ltd. Music file generating apparatus, music file generating method, and recorded medium
JP4409279B2 (en) 2003-12-22 2010-02-03 株式会社日立製作所 Speech synthesis apparatus and the speech synthesis program
CN101223571B (en) * 2005-07-20 2011-05-18 松下电器产业株式会社 Voice tone variation portion locating device and method
US8244546B2 (en) * 2008-05-28 2012-08-14 National Institute Of Advanced Industrial Science And Technology Singing synthesis parameter data estimation system
JP5269668B2 (en) * 2009-03-25 2013-08-21 株式会社東芝 Speech synthesis apparatus, program, and method
WO2012011475A1 (en) * 2010-07-20 2012-01-26 独立行政法人産業技術総合研究所 Singing voice synthesis system accounting for tone alteration and singing voice synthesis method accounting for tone alteration
JP5743625B2 (en) * 2011-03-17 2015-07-01 株式会社東芝 Speech synthesis editing apparatus and speech synthesis editing method
KR101274961B1 (en) * 2011-04-28 2013-06-13 (주)티젠스 music contents production system using client device.
US20120304057A1 (en) * 2011-05-23 2012-11-29 Nuance Communications, Inc. Methods and apparatus for correcting recognition errors
JP5712818B2 (en) * 2011-06-30 2015-05-07 富士通株式会社 Speech synthesis apparatus, sound quality correction method and program
US8729374B2 (en) * 2011-07-22 2014-05-20 Howling Technology Method and apparatus for converting a spoken voice to a singing voice sung in the manner of a target singer

Also Published As

Publication number Publication date
US20140278433A1 (en) 2014-09-18
US9355634B2 (en) 2016-05-31
CN104050961A (en) 2014-09-17
EP2779159A1 (en) 2014-09-17
JP2014178512A (en) 2014-09-25

Similar Documents

Publication Publication Date Title
JP5007563B2 (en) Music editing apparatus and method, and program
JP4114888B2 (en) Voice change identifying apparatus
JP4130190B2 (en) Speech synthesis system
US20060230909A1 (en) Operating method of a music composing device
US7383186B2 (en) Singing voice synthesizing apparatus with selective use of templates for attack and non-attack notes
JP2012532340A (en) Music education system
JP5024711B2 (en) Singing voice synthesis parameter data estimation system
US8027631B2 (en) Song practice support device
US8338687B2 (en) Apparatus and method for creating singing synthesizing database, and pitch curve generation apparatus and method
JPH10153998A (en) Auxiliary information utilizing type voice synthesizing method, recording medium recording procedure performing this method, and device performing this method
US20100250257A1 (en) Voice quality edit device and voice quality edit method
CN1372246A (en) Text phonetic system matched rhythm module board
JP2007206317A (en) Authoring method and apparatus, and program
EP1872361A4 (en) Hybrid speech synthesizer, method and use
JPH0944171A (en) Karaoke device
US7094962B2 (en) Score data display/editing apparatus and program
JP2007249212A (en) Method, computer program and processor for text speech synthesis
EP1302927B1 (en) Chord presenting apparatus and method
JP2007527555A (en) Its use in the prosodic speech text codes and computerized voice system
US8447610B2 (en) Method and apparatus for generating synthetic speech with contrastive stress
US10002604B2 (en) Voice synthesizing method and voice synthesizing apparatus
US8907195B1 (en) Method and apparatus for musical training
US9595256B2 (en) System and method for singing synthesis
US20120031257A1 (en) Tone synthesizing data generation apparatus and method
JP6070010B2 (en) Music data display device and music data display method

Legal Events

Date Code Title Description
A621 Written request for application examination

Free format text: JAPANESE INTERMEDIATE CODE: A621

Effective date: 20141023

A977 Report on retrieval

Free format text: JAPANESE INTERMEDIATE CODE: A971007

Effective date: 20150219

A131 Notification of reasons for refusal

Free format text: JAPANESE INTERMEDIATE CODE: A131

Effective date: 20150324

A521 Written amendment

Free format text: JAPANESE INTERMEDIATE CODE: A523

Effective date: 20150519

A131 Notification of reasons for refusal

Free format text: JAPANESE INTERMEDIATE CODE: A131

Effective date: 20151110

A521 Written amendment

Free format text: JAPANESE INTERMEDIATE CODE: A523

Effective date: 20160108

TRDD Decision of grant or rejection written
A01 Written decision to grant a patent or to grant a registration (utility model)

Free format text: JAPANESE INTERMEDIATE CODE: A01

Effective date: 20160510

A61 First payment of annual fees (during grant procedure)

Free format text: JAPANESE INTERMEDIATE CODE: A61

Effective date: 20160523