WO2005050624A1

WO2005050624A1 - Voice changer

Info

Publication number: WO2005050624A1
Application number: PCT/JP2004/017139
Authority: WO
Inventors: Natsuki Saito; Takahiro Kamai
Original assignee: Matsushita Electric Industrial Co., Ltd.
Priority date: 2003-11-21
Filing date: 2004-11-18
Publication date: 2005-06-02
Also published as: JP2007041012A

Abstract

There is provided a voice changer having improved user-friendliness from the viewpoint of user interface. The voice changer includes: a voice quality adjustment unit (103) for indicating a range where the voice quality can be changed and receiving a voice quality specified by a user in the indicated range; an adjustment control unit (104) for acquiring a feature parameter string (p1) and changing the range indicated by the voice quality adjustment unit (103) to an appropriate range where no crash occurs in the voice quality indicated by a changed feature parameter string (p2) according to the acquired feature parameter string (p1) and the voice quality received by the voice quality adjustment unit (103); and a conversion unit (101) for acquiring the feature parameter string (p1) and converting the acquired feature parameter string (p1) to the changed feature parameter string (p2) indicating the voice of the quality received by the voice quality adjustment unit (103).

Description

Technical field

The present invention relates to a voice quality conversion device that converts voice quality of voice.

Background art

[0002] Some voice synthesizers that artificially generate voice include a voice quality conversion device that converts the voice quality of a synthesized voice. (E.g., see Patent Documents 1 and 2.) ₀ Akira

[0003] The voice quality conversion device of Patent Document 1 described above uses a synthesis unit generated from voices of a plurality of speakers.

A database in which data is stored in advance is provided. When the synthesis unit and the voice quality used for speech synthesis are specified, the voice conversion device first selects the synthesis unit data closest to the specified synthesis unit also in the database. Next, the voice conversion device checks how different the voice quality of the speaker of the selected synthesized unit data is from the specified voice quality, and if the voice quality differs from the specified voice quality by more than a predetermined level, the voice quality conversion device approaches the specified voice quality. In this way, voice conversion is performed on the synthesized unit data. Specifically, the voice conversion device performs codebook mapping from the codebook (information representing characteristics of voice quality) of the selected synthesized unit data to a codebook having a voice quality that matches the specified voice quality. The voice quality of the synthesized unit data is converted to the specified voice quality.

[0004] Further, the voice quality conversion device of Patent Document 2 converts voice quality of synthesized voice by converting a sampling frequency when converting digital voice data into an analog voice signal. In addition, this voice quality conversion device appropriately sets so-called prosodic information (spectral parameters) such as a fundamental frequency and a phoneme duration in accordance with a change in the sampling frequency so that the output voice is appropriate. I have.

[0005] In such voice quality conversion devices of Patent Document 1 and Patent Document 2, the converted voice quality may fail. Therefore, a voice quality conversion device that corrects a parameter indicating voice quality so that the converted voice quality does not break down has been proposed (for example, see Patent Document 3). Patent Document 1: Japanese Patent Application Laid-Open No. 07-319495

Patent Document 2: JP 08-152900A Patent Document 3: Japanese Patent Application Laid-Open No. 2000-187491

Disclosure of the invention

Problems to be solved by the invention

[0006] However, the voice quality conversion device of Patent Document 3 described above has a problem in that it is used from the viewpoint of a user interface, and is not easy to use.

[0007] That is, with the voice quality conversion device of Patent Document 3, it is possible to prevent the voice quality from breaking down, and the user cannot grasp the extent to which the voice quality can be converted while the voice quality is still low. Therefore, the user may instruct the voice conversion device to convert to a desired voice quality even though the voice quality is broken. As a result, the voice quality conversion device converts the voice quality to a voice quality different from the voice quality specified by the user in order to prevent the voice quality from being broken.

[0008] Therefore, the present invention has been made in view of a powerful problem, and an object of the present invention is to provide a voice quality conversion device in which the viewpoint of a user interface is improved in usability. Means for solving the problem

[0009] To achieve the above object, a voice quality conversion device according to the present invention is a voice quality conversion device for converting feature data indicating a feature of a voice into conversion feature data indicating a voice having a voice quality different from the voice. Acquisition means for acquiring the characteristic data, presentation means for presenting a range in which voice quality can be converted, reception means for receiving a voice quality specified by a user within the range presented by the presentation means, In accordance with the acquired feature data and the voice quality received by the receiving means, the range presented by the presenting means is changed to an appropriate range in which the voice quality indicated by the converted feature data does not fail. Means, and conversion means for converting the characteristic data obtained by the obtaining means into conversion characteristic data indicating voice of voice quality received by the receiving means. And butterflies.

[0010] With this, the convertible range of the voice quality presented by the presentation means is changed to an appropriate range according to the feature data and the voice quality specified by the user, and the user can change the voice quality specified by the voice quality to another voice quality. When specifying the voice quality, if the voice quality is specified within the proper range without being aware of whether or not the voice quality of the converted feature data will be broken, the converted feature data indicating the voice quality expected by the user is generated. can do. As a result, The point of view of the one-interface can also improve usability.

[0011] Further, the presenting means presents, for each of a plurality of types of voice qualities, a range in which the voice qualities can be converted, and the accepting means presents a range within each of the voice qualities presented to the presenting means. Receiving the degree of the voice quality specified by the user as a parameter, and the range changing unit determines a range of another voice quality presented by the presentation unit in accordance with the parameter of the voice quality received to be converted by the reception unit. The conversion unit may change the feature data into the converted feature data according to the parameters of each voice quality received by the reception unit. For example, the presenting means presents, for each of the plurality of voice qualities, a graphic and a pointer that moves on the graphic according to a user operation, thereby presenting a range in which the voice qualities can be converted. Then, the accepting unit identifies a parameter specified by the user based on the position of the pointer on the graphic, and accepts the parameter.

[0012] With this, when the user causes the accepting unit to accept a parameter that increases the brightness within a range in which, for example, the voice quality indicating the brightness presented by the presenting unit can be converted, the presenting unit For example, the range in which the conversion of voice quality indicating fast-talking can be converted is reduced, and when the user tries to specify a voice quality that further increases the rate of fast-talking, the voice quality of the conversion feature data is broken. By specifying the parameters within the reduced range of the fast voice without being aware of whether or not it will occur, it is possible to generate converted feature data that shows the voice quality expected by the user.

[0013] Further, the range changing means may change the range that can be converted by moving the pointer. For example, the presenting means displays the graphic in a bar shape, and the range changing means changes the convertible range by moving the pointer along the longitudinal direction of the graphic.

[0014] Thereby, the user can easily understand visually from the position of the pointer to one end of the figure as a range that can be converted.

[0015] Further, the presenting means arranges the figures and pointers for the respective voice qualities in parallel so that the more similar the change content based on the respective voice qualities, the narrower the gap between them. It is good. Alternatively, the presenting means may include a figure and a port for each voice quality. Inters are arranged along the same circumference so that the more similar the content of change based on each voice quality, the smaller the angle between them.

For example, the content of change based on voice quality indicating brightness and the content of conversion based on voice quality indicating fast-talking are similar. As a result, when the user moves the pointer corresponding to the voice quality of the brightness to one end of the figure so as to increase the degree of brightness, the pointer corresponding to the voice quality of the fast voice is also changed by the range changing means. Move to one end of the figure so that the range that can be increased is reduced. Therefore, by arranging and displaying the figures and pointers corresponding to these voice qualities near each other, the user can easily recognize a change in the range in which the voice qualities can be converted.

[0017] Further, the range changing means may change the convertible range by deforming the figure. For example, the presenting means displays the graphic in a bar shape, and the range changing means changes the range of the changeable extent by expanding and contracting the length of the graphic in the longitudinal direction.

This allows the user to easily understand visually from the position of the pointer to one end of the figure as a range that can be converted.

Here, the speech synthesizer according to the present invention is a speech synthesizer that converts a text indicated by text data into a synthesized speech, acquires the text data, and corresponds to the text of the text data. Characteristic data generating means for generating characteristic data indicating characteristics of a sound to be reproduced, obtaining means for obtaining characteristic data generated by the characteristic data generating means, presenting means for presenting a convertible range of voice quality, and the presenting Within the range presented by the means, receiving means for receiving the voice quality specified by the user, the characteristic data acquired by the acquiring means, and the voice quality received by the receiving means, are presented by the presentation means. Range changing means for changing the range of the synthesized voice to an appropriate range in which the voice quality of the synthesized voice does not break down, and the feature data acquired by the Conversion means for converting into conversion characteristic data indicating voice of voice quality received by the reception means; and voice output means for generating and outputting the synthesized voice based on the conversion characteristic data converted by the conversion means. It is characterized by.

[0020] Thereby, the range in which the voice quality presented by the presentation means can be converted is the range of the feature data and Since the voice quality is changed to an appropriate range according to the voice quality specified by the user, the user should be conscious of whether or not the voice quality of the conversion feature data will fail when trying to specify another voice quality. If the voice quality is specified within the proper range, the text indicated by the text data can be converted into a synthesized voice with the voice quality expected by the user. As a result, the user interface viewpoint can be improved in usability.

[0021] The present invention can be realized not only as such a voice conversion device or a voice synthesis device, but also as a method and a program of an operation performed by the device, and also as a storage medium for storing the program. can do.

The invention's effect

[0022] The voice quality conversion device of the present invention has an operational effect that the viewpoint power of the user interface and the usability can be improved.

Brief Description of Drawings

FIG. 1 is a configuration diagram of a voice quality conversion device according to Embodiment 1 of the present invention.

FIG. 2 is an explanatory diagram illustrating an example of an operation of the voice quality conversion device according to the first embodiment.

FIG. 3 is an explanatory diagram for explaining another example of the operation of the voice quality conversion device of the above.

FIG. 4 is an explanatory diagram for explaining still another example of the operation of the voice quality conversion device of the above.

FIG. 5 is an explanatory diagram for explaining still another example of the operation of the above voice quality conversion device.

FIG. 6 is an explanatory diagram for explaining still another example of the operation of the voice quality conversion device of the above.

[FIG. 7] FIG. 7 is an explanatory diagram for explaining a constraint condition by the Indigo algorithm of the above.

FIG. 8 is a configuration diagram of a voice quality conversion device according to Embodiment 2 of the present invention.

FIG. 9 is an explanatory diagram for explaining the content presented by the voice quality adjustment unit according to the embodiment.

FIG. 10 is a flowchart showing an operation of an adjustment control unit according to the embodiment.

[FIG. 11] FIG. 11 is for explaining the content presented by the voice quality adjustment unit according to the first modification of the above. FIG.

[FIG. 12] FIG. 12 is an explanatory diagram for describing contents presented by a voice quality adjustment unit according to the second modification of the above.

FIG. 13A is an explanatory diagram for describing a distance between voice qualities according to Modification 3 of the above.

[FIG. 13B] FIG. 13B is a diagram showing a display content of a voice quality adjustment unit according to the third modification of the above.

FIG. 14A is a diagram showing a display content of a voice quality adjustment unit according to Modification 4 of the above.

[FIG. 14B] FIG. 14B is an explanatory diagram for explaining how the voice quality adjusting unit according to Modification 4 of the above changes the display content.

FIG. 15 is a configuration diagram of a speech synthesis device according to Embodiment 3 of the present invention.

FIG. 16 is a configuration diagram of a speech synthesis device according to a first modification of the above.

FIG. 17 is a configuration diagram of a speech synthesis device according to Modification 2 of the above.

FIG. 18 is a configuration diagram of a speech synthesizer according to a third modification of the above.

FIG. 19 is a configuration diagram of a speech synthesizer according to a fourth modification of the above.

FIG. 20 is a configuration diagram of a speech synthesizer according to a fifth modification of the above embodiment.

Explanation of symbols

101 converter

103, 103a Voice quality adjustment unit

104, 104a—104d Adjustment control unit

105 Transform coefficient storage

105a coefficient data

106 Limit value storage

201 Voice synthesis unit

202 Speech Synthesis Database

203 Waveform generator

204 Speaker

205 Feature table storage

206 Voice Analysis Unit B range bar

P pointer

pi feature parameter sequence

p2 Deformation feature parameter sequence

si waveform signal

tdl text data

td2 Waveform feature table

BEST MODE FOR CARRYING OUT THE INVENTION

Hereinafter, embodiments of the present invention will be described with reference to the drawings.

(Embodiment 1)

[0026] The voice quality conversion apparatus according to the present embodiment converts voice quality while preventing the occurrence of voice quality breakdown, and includes conversion section 101, voice quality adjustment section 103, adjustment control section 104, conversion coefficient A storage unit 105 and a limit value storage unit 106 are provided.

[0027] The conversion unit 101 acquires a feature parameter sequence pi indicating an acoustic feature of the speech. The characteristic parameter sequence pi is data indicating the acoustic characteristics of the speech obtained as a result of analyzing the speech for each frame as a parameter, and the original speech is obtained by performing resynthesis based on this. It is. The conversion unit 101 generates a deformed feature parameter sequence p2 by converting the parameter of the acoustic feature indicated by the feature parameter sequence pi according to the instruction from the voice quality adjustment unit 103. The deformed feature parameter sequence p2 indicates the acoustic feature of the voice as a parameter, similar to the feature parameter sequence pi, and is used to generate a synthesized voice. The voice quality of the synthesized speech (voice waveform) generated using the deformed feature parameter sequence p2 and the voice quality of the synthesized voice (voice waveform) generated using the feature parameter sequence pi Depending on.

[0028] The conversion coefficient storage unit 105 holds coefficient data serving as a template when the conversion unit 101 performs the conversion process.

When operated by the user, voice quality adjusting section 103 receives the converted voice quality expected by the user, and receives an instruction to change the voice quality from adjustment control section 104. Further, the voice quality adjustment unit 103 stores the By using the coefficient data, the conversion content according to the operation result of the user and the instruction of the adjustment control unit 104 is specified, and the conversion content is instructed to the conversion unit 101.

[0030] Specifically, the voice quality adjustment unit 103 includes, for each type of voice quality, for example, for each brightness and quickness, a range bar B indicating a convertible range of the voice quality, and a movable range bar B on the range bar B. A pointer P indicating the degree of voice quality conversion is displayed. The user operates the pointer P and moves it along the range bar B to set a desired voice quality.

[0031] The limit value storage unit 106 stores limit conditions (such as a limit value of a parameter indicating each acoustic feature) for obtaining a synthesized speech that maintains naturalness for the deformed feature parameter sequence p2.

[0032] Adjustment control section 104 obtains feature parameter sequence pi, and also obtains the operation result of user on voice quality adjustment section 103.

[0033] Adjustment control section 104 estimates deformed characteristic parameter sequence p2 based on the characteristic parameter sequence pi and the operation result. Then, the adjustment control unit 104 compares the estimated modified feature parameter sequence p2 with the limit condition in the limit value storage unit 106. If the modified feature meta-string _P2 does not satisfy the limit condition, the adjustment control unit 104 instructs the voice quality adjustment unit 103 to change the operation result of the user so as to satisfy the limit condition. That is, the adjustment control unit 104 determines whether or not there is a force that causes sound quality deterioration (breakage of voice quality) in the deformed feature parameter sequence p2 based on the conversion content set in the voice quality adjustment unit 103, and does not cause sound quality deterioration. Adjust the conversion details.

[0034] Hereinafter, a process when the voice conversion device of the present embodiment performs voice conversion will be specifically described.

FIG. 2 is an explanatory diagram illustrating an example of an operation of the voice quality conversion device according to the present embodiment.

[0035] Voice quality adjustment section 103 accepts voice quality expected by the user by operating the plurality of pointers P by the user. For example, as shown in FIG. 2, the voice quality adjustment unit 103 indicates four voice qualities that can be converted: brightness, darkness, masculinity, and fast voice. The conversion range of these voice qualities is indicated by a range bar B scaled from 0 to 10. The user designates the voice quality and the conversion amount to be converted by moving the pointer P corresponding to each voice quality within the range of 0 to 10 scales on the range bar B. When the value (indicated value) of the scale at which the pointer P is located is 0, the voice quality adjusting unit 103 determines that conversion is not required for the voice quality, and as the indicated value approaches 10, the voice quality adjusting unit 103 Judge that a large conversion is required.

In FIG. 2, the brightness is indicated by “bright”, the length is indicated by “dark”, the masculineness is indicated by “male”, and the quickness is indicated by “early”. Further, the voice quality adjustment unit 103 may be constituted by a volume switch or the like.

[0038] The characteristic parameter sequence pi is a parameter of an acoustic characteristic that can be adjusted, and is a fundamental frequency FO, a first formant frequency Fl, a second formant frequency F2, a frame duration FR, and a sound source power PW for each analysis frame. One parameter is shown.

[0039] The coefficient data 105a held in the conversion coefficient storage unit 105 stores the above five acoustic characteristics of the feature parameter sequence pi when the indicated value power is increased by ^ in each voice quality of the voice quality adjustment unit 103. Indicates the value (coefficient) to be added to the parameter of.

That is, as shown in FIG. 2, when all the instruction values are set to 0 by the voice quality adjustment unit 103, the conversion unit 101 of the voice quality conversion apparatus acquires the feature parameter sequence pi, and The same modified feature parameter sequence _{P2 as the} parameter sequence pi is output.

FIG. 3 is an explanatory diagram for describing another example of the operation of the voice quality conversion device according to the present embodiment.

The user sets the instruction value of the brightness of the voice quality adjustment unit 103 to 5 and the instruction value of the fast-talk to 3 The conversion unit 101 calculates the coefficient of each acoustic feature with respect to the brightness of the coefficient data 105a, Integrate with the brightness indication value (5). Further, the conversion unit 101 integrates the coefficient of each acoustic feature of the coefficient data 105a for the fast-talking and the indicated value (3) of the fast-talking. The conversion unit 101 adds up these integrated values for each acoustic feature, and further adds the result to the value of the feature parameter sequence pi. As a result, the conversion unit 101 generates the deformation feature parameter sequence p2.

[0043] For example, since the value of the fundamental frequency FO of one analysis frame of the feature parameter sequence pi is 300, the conversion unit 101 calculates the fundamental frequency F0 of the corresponding analysis frame of the transformed feature parameter sequence p2 by 300 + It is calculated as 5 X (+5) +3 X (+1) = 328. Parametric of other acoustic features The same calculation is performed for the meta.

FIG. 4 is an explanatory diagram for explaining still another example of the operation of the voice quality conversion device according to the present embodiment.

After the state of voice quality adjusting section 103 shown in FIG. 3, the user sets the instruction value of darkness to 7 for voice quality adjusting section 103.

When set as described above, adjustment control section 104 changes the instruction value set by voice quality adjustment section 103 to an instruction value that is easy for the user to operate.

That is, as shown in the coefficient data 105a of the conversion coefficient storage unit 105, the coefficient of each acoustic feature for darkness has an inverse relationship to the coefficient of each acoustic feature for brightness. Therefore, instead of increasing the darkness reading, the brightness reading is first reduced, which has the same effect as increasing the darkness reading. Therefore, the adjustment control unit 104 specifies the above-described relationship between the brightness and the darkness from the coefficient data 105a, and when the darkness indication value is set to 7, first reduces the brightness indication value from 5 to 0. To the voice quality adjustment unit 103, and further instructs the voice quality adjustment unit 103 to reduce the indicated value of darkness from 7 to 2.

[0048] Specifically, in the case shown in Fig. 4, first, the instruction value of darkness is set to 7 while the instruction value of brightness is 5; Here, the effect of changing the indicated value of brightness from 5 to 0 is the same as the effect of increasing the indicated value of darkness by 5. Therefore, instead of setting the indicated value of darkness to 7, the adjustment control unit 104 determines that the indicated value of brightness should be set to 0 and the indicated value of darkness should be set to 2. Then, adjustment control unit 104 instructs voice quality adjustment unit 103 of the determination result, and changes the instruction value set by the user.

As described above, the adjustment control unit 104 adjusts the instruction values set in the voice quality adjustment unit 103 by the user so that the values of the respective instruction values become minimum, so that the user can easily operate V, An interface can be built.

FIG. 5 is an explanatory diagram for explaining still another example of the operation of the voice conversion device according to the present embodiment.

The user sets, for example, an instruction value of brightness to 10. In addition, the characteristic parameter sequence pi is the fundamental frequency FO = 300, the first formant frequency F1 = 500, the second formant frequency F2 = 1600, the frame duration FR = 50, and the sound source power in one analysis frame. Indicates PW = 30. In such a case, as described above, the conversion unit 101 converts the fundamental frequency FO = 300 + 10X (+5) = 350 and the first formant frequency Fl = 500 + 10X (+2) corresponding to the analysis frame. = 520, second formant frequency F2 = 1600 + 10 X (+1) = 1610, frame duration FR = 50 + 10 X (-1) = 40, and sound source power PW = 30 + 10 X (+1) The transformed feature parameter sequence p2 indicating = 40 is output.

Here, limit value storage section 106 stores limit conditions indicating that the maximum value of fundamental frequency F0 is 350. That is, the limit condition indicates that when the value of the fundamental frequency F0 of the modified feature parameter sequence p2 exceeds 350, the sound quality of the synthesized sound generated based on the modified feature parameter sequence p2 is significantly deteriorated.

FIG. 6 is an explanatory diagram for explaining still another example of the operation of the voice quality conversion device according to the present embodiment.

From the state of voice quality adjustment section 103 shown in FIG. 5, that is, the state in which the brightness instruction value is 10 and all other instruction values are 0, the user further sets the instruction value of the fast-talk to 5 .

[0055] Adjustment control section 104 estimates modified feature parameter sequence p2 when conversion processing according to the instruction value set by the user for voice quality adjustment portion 103 is performed on feature parameter sequence pi. The adjustment control unit 104 determines whether or not the parameters of each acoustic feature of the estimated deformation feature parameter sequence p2 satisfy the limit condition of the limit value storage unit 106. When at least one parameter of the estimated deformation feature parameter sequence p2 does not satisfy the limit condition, the adjustment control unit 104 controls the voice quality adjustment unit 103 to change the indicated value so that the parameter satisfies the limit condition. Make instructions. At this time, for example, the adjustment control unit 104 gives an instruction to give priority to the instruction value recently set by the user, or gives an instruction to give priority to the largest instruction value.

Specifically, as shown in FIG. 6, when the indicated value of the fast-talk is increased by 5, the adjustment control unit 104 estimates the deformed feature parameter sequence p2 in this case, and It is determined that the fundamental frequency F0 (355) in column p2 does not satisfy the limit condition (350 or less). As a result, the adjustment control unit 104 instructs the voice quality adjustment unit 103 to reduce the brightness instruction value by 1 so that the value of the fundamental frequency F0 of the deformed feature parameter sequence p2 is set to 350 or less. I do.

As a result, voice quality adjusting section 103 changes the indicated value of brightness from 10 to 9. in this way, The adjustment control unit 104 adjusts the indicated value according to the limit condition, so that the user can perform the voice conversion operation so that the voice quality does not break down without being aware of the limit value of the parameter of each acoustic feature. it can.

Note that the adjustment control unit 104 may refer to the limit conditions stored in the limit value storage unit 106 as needed. In addition, the limit condition indicates the limit value for each parameter of each acoustic feature such that the value of the fundamental frequency FO must not exceed 350, or adds the value of the fundamental frequency FO and the value of the second formant frequency F2. It may show data that the result should not exceed 2000.

Note that the conversion given to the characteristic parameter sequence pi by the conversion unit 101 may not be uniform for all analysis frames, and the coefficient data 105a of the conversion coefficient storage unit 105 may be different for each analysis frame! /, You can! / ,.

Note that the adjustment of the indicated value by the adjustment control unit 104 may be automatically performed using a constraint satisfaction algorithm. An example of the constraint satisfaction algorithm is the Indigo algorithm (A. Borning, R. Anderson, B. Freeman-Benson: The Indigo Algontnm, TR

96— 05— 01, Department of Computer Science and Engineering, University of

Washington, July 1996).

FIG. 7 is an explanatory diagram for describing a constraint condition by the Indigo algorithm.

The constraint condition shown in FIG. 7 is for adjusting the indicated value shown in FIG. 6 with respect to the fundamental frequency F0, and is described as follows in the constraint hierarchy of the Indigo algorithm.

[0062] REQUIRED constraint C1: output F0≤ 350

REQUIRED constraint C2: Input FO = 300

REQUIRED constraint C3: Brightness X 5 = tl

REQUIRED constraint C4: Length X— 5 = t2

REQUIRED constraint C5: masculinity X— 3 = t3

REQUIRED constraint C6: Fast X1 = t4

REQUIRED constraint C7: tl + t2 = t5

REQUIRED constraint C8: t3 + t4 = t6 REQUIRED constraint C9: t5 + t6 = t7

REQUIRED constraint CIO: Input F0 + t7 = t8

REQUIRED constraint C 11: t8 = output FO

STRONG constraint C 12: Fast = 5

WEAK constraint C13: masculinity = 0

WEAK constraint C14: Length = 0

WEAK constraint C15: Brightness = 10

Note that the variables tl to t8 are variables for holding intermediate results of the calculation. Although omitted in FIG. 7 for the sake of simplicity, in order to obtain more desirable results, it is desirable to provide a REQUIRED constraint that binds each indicated value to a value between 0 and 10.

An outline of the processing when the above constraints are solved by the Indigo algorithm is shown below. Initial state: The range of all variables is [1∞, + ∞]

C1 follow-up: The range of output F0 is [∞, 350]

C2 follow-up car: human power F0 range power becomes S [300, 300]

C3—C10 added: No change in the value range of each variable

C11 added: t8 range becomes [になる, 350]

Propagate C10 and t7 becomes [-∞, 50]

C12 additional caro: The range of the fast mouth becomes [5, 5]

Propagate C6 and the range of t4 becomes [5, 5]

C13 added: The range of masculinity is [0, 0]

Propagate C5 and the range of t3 becomes [0, 0]

Propagate C8 and t6 becomes [5, 5]

Propagating C9, the range of t5 becomes [∞, 45]

C14 additional caro: Dark range becomes [0, 0]

Propagate C4 and the range of t2 becomes [0, 0]

Propagate C7 and the range of tl becomes [∞, 45]

Propagating through C3, brightness range becomes [∞, 9]

C15 added: Brightness range is [9, 9] (Embodiment 2)

The voice quality conversion apparatus according to the present embodiment has improved viewpoint and usability of a user interface, and includes a conversion section 101, a voice quality adjustment section 103a, an adjustment control section 104a, and a conversion coefficient storage section 105. And a limit value storage unit 106. Note that, in the present embodiment, the components denoted by the same reference numerals as those of the first embodiment are the same as those denoted by the same reference numerals of the first embodiment. Omitted.

[0066] The voice quality adjusting unit 103a receives the converted voice quality expected by the user when operated by the user. That is, the voice quality adjustment unit 103a has a function as a receiving unit that receives a voice quality specified by the user. Further, the voice quality adjustment unit 103a specifies the conversion content according to the operation result of the user by using the coefficient data 105a stored in the conversion coefficient storage unit 105, and sends the conversion content to the conversion unit 101. Instruct. Specifically, similarly to the voice quality adjusting section 103 of the first embodiment, the voice quality adjusting section 103a includes, for each type of voice quality, for example, for each brightness or fast-talking, a range indicating a convertible range (absolute range) of the voice quality. A bar B and a pointer P which is movable on the range bar B and indicates the degree of the voice quality are displayed. The user operates the pointer P to move along the range bar B to set a desired voice quality. The voice quality adjustment unit 103a has a function as a presentation unit that presents a range bar B and a pointer P to present a range that can be further converted from the current voice quality conversion degree. .

[0067] Further, voice quality adjusting section 103a in the present embodiment receives an instruction of a conversion range for each voice quality from adjustment control section 104, and presents only the instructed conversion range to the user. That is, the voice quality adjustment unit 103a changes the length of the range bar B to a length corresponding to the conversion range instructed by the adjustment control unit 104, and moves the pointer P to a position other than on the range bar B. Ban.

The adjustment control unit 104a acquires the characteristic parameter sequence pi, the operation result of the user on the voice quality adjustment unit 103a, and the limit condition of the limit value storage unit 106. Then, the adjustment control unit 104 derives an appropriate conversion range of each voice quality in the voice quality adjustment unit 103a based on the characteristic parameter sequence pi, the operation result, and the limit condition. The adjustment control unit 104a determines the derived The switching range is instructed to the voice quality adjustment unit 103a. That is, the adjustment control unit 104a breaks down the range presented by the voice quality adjustment unit 103a to the voice quality indicated by the modified feature parameter sequence p2 according to the characteristic parameter sequence pl and the voice quality received by the user in the voice quality adjustment unit 103a. It has a function as a range changing means for changing to an appropriate range in which no problem occurs.

FIG. 9 is an explanatory diagram for describing the content presented by voice quality adjusting section 103a of the present embodiment.

[0070] For example, as shown in (a) of Fig. 9, first, the user sets the pointers P of the respective voice qualities of the voice timbre adjusting unit 103a so that the indicated values are all zero. Next, the user sets the pointer P of the voice quality indicating the brightness of the voice quality adjusting unit 103a so that the indicated value becomes 10.

Here, as described with reference to FIG. 5, the adjustment control unit 104a generates the deformed feature parameter sequence p2 in which the fundamental frequency FO indicates 350 with the setting shown in FIG. 9A. Estimate that. Further, the adjustment control unit 104a derives an appropriate conversion range that satisfies the limit condition of the limit value storage unit 106 for voice qualities other than voice qualities indicating brightness. For example, the estimated deformation feature parameter sequence p2 is obtained by calculating the fundamental frequency FO = 350, the first formant frequency Fl = 520, the second formant frequency F2 = 1610, the frame duration FR = 40, and the sound source power PW = 40. Show. The limit conditions are that the fundamental frequency F0 of the deformed feature parameter sequence p2 is 350 or less, the first formant frequency F1 is 600 or less, the second formant frequency F2 is 1700 or less, the frame duration FR is 100, and the sound source power PW is 50. The following is shown. At this time, the adjustment control unit 104a compares each parameter of the deformation feature parameter sequence p2 with the limit condition, and determines that the fundamental frequency F0 cannot be further increased. That is, the adjustment control unit 104a determines that the conversion range of the voice quality indicating the fast-talking is limited to only the 0 scale, and instructs the voice quality adjustment unit 103a of the determination result.

As a result, as shown in FIG. 9 (b), the voice quality adjusting unit 103a shortens the range bar B of the length of 10 scales corresponding to the voice quality of the fast-talk to a length of 0 scale. To display this. As described above, in conjunction with the setting of the brightness indication value to 10, the length of the fast-talk range bar B becomes only 0 divisions, so that the user cannot move the fast-talk pointer P. . Therefore, it is possible to prevent the occurrence of voice deterioration, that is, the breakdown of voice quality, by increasing the calorie of the instruction value of the fast mouth. Further, when the user adjusts the pointer P corresponding to the voice quality of the brightness of the voice quality adjustment unit 103a and sets the indicated value from 10 to 9, the adjustment control unit 104a performs the following based on the setting: Again, as described above, for voice qualities other than voice qualities that indicate brightness, the limit condition in the limit value storage unit 106 (fundamental frequency FO of the deformed feature parameter sequence p2 is 350 or less). Derive a range. That is, the adjustment control unit 104a determines that the conversion range of the voice quality indicating the fast-talking is limited to five scales, and instructs the voice quality adjustment unit 103a of the determination result.

As a result, as shown in FIG. 9 (c), the voice quality adjustment unit 103a sets the range bar B corresponding to the voice quality of the fast-talking to a length of five scales, that is, a length corresponding to scales 0 to 5. Display this after a long time.

FIG. 10 is a flowchart showing the operation of adjustment control section 104a in the present embodiment.

First, the adjustment control unit 104a acquires the characteristic parameter sequence pi (step S100), and specifies the contents of settings made by the user for the voice quality adjustment unit 103a (step S102).

Next, the adjustment control unit 104a estimates a modified feature parameter sequence p2 based on the feature parameter sequence pi and the settings of the voice quality adjustment unit 103a (step S104). The adjustment control unit 104a derives an appropriate conversion range for each voice quality of the voice quality adjustment unit 103a based on the estimated deformation feature parameter sequence p2 and the limit condition of the limit value storage unit 106 (Step S106).

[0077] Then, adjustment control section 104a instructs voice quality adjustment section 103a of the derived proper conversion range, and displays range bar B having a length corresponding to the conversion range (step S108).

As described above, in the present embodiment, the convertible range of the voice quality presented by voice quality adjusting section 103a is changed to an appropriate range according to feature parameter sequence pi and the voice quality specified by the user. Therefore, when the user wants to specify another voice quality, the user can specify the voice quality within an appropriate range without being aware of whether or not the voice quality of the deformed feature parameter sequence p2 is broken. Thus, it is possible to generate a deformed feature parameter sequence indicating the voice quality expected by the user. As a result, usability can be improved from the viewpoint of the user interface.

(Modification 1) Here, a first modified example regarding the display method of voice quality adjusting section 103a in the present embodiment will be described.

[0080] The voice quality adjustment unit 103a according to the present modification is configured such that the pointer P can be moved by the conversion range instructed by the adjustment control unit 104a without changing the length of the range bar B. Change position.

FIG. 11 is an explanatory diagram for describing the content presented by the voice quality adjusting unit 103a according to the present modification.

For example, as shown in FIG. 11 (a), first, the user sets the pointers P of the respective voice qualities of the voice timbre adjusting unit 103a such that the indicated values are all zero. Next, the user sets the pointer P of the voice quality indicating the brightness of the voice quality adjusting unit 103a so that the indicated value becomes 10.

Here, similarly to the above, adjustment control section 104a determines that the conversion range of voice quality indicating fast-talking is limited to only 0 scales, and instructs voice quality adjustment section 103a of the determination result.

The voice quality adjustment unit 103a that has received such an instruction moves the pointer P corresponding to the voice quality of the fast-talking to the position of the scale 10, and displays it, as shown in (b) of FIG. . That is, the instruction content of the adjustment control unit 104a indicates that the conversion range of the voice quality indicating the fast-talking is limited to only the 0 scale, and indicates that the pointer P cannot be moved in the increasing direction of the scale. Therefore, the voice quality adjustment unit 103a according to the present modification moves the pointer P to a position where the pointer P cannot be moved in the scale increasing direction, that is, the position of the scale 10, and displays it. However, in this case, the voice quality adjusting unit 103a merely moves the pointer P corresponding to the voice quality of the voice, and the conversion of the parameter of the acoustic feature with the indication value of the voice quality of the voice being 10 to the conversion unit 101. I will not tell you. In this way, the pointer P of the fast-talk is displayed on the scale 10 (maximum value) in conjunction with the setting of the brightness indication value being 10, so that the voice degradation caused by increasing the fast-talk indication value is provided. , That is, breakdown of voice quality can be prevented.

Further, when the user adjusts the pointer P corresponding to the voice quality of the brightness of the voice quality adjustment unit 103a and sets the indicated value from 10 to 9, the adjustment control unit 104a performs, based on the setting, Again, as described above, it is determined that the conversion range of the voice quality indicating the fast voice is limited to five scales, and the determination result is instructed to the voice quality adjustment unit 103a.

[0086] Upon receiving such an instruction, the voice quality adjustment unit 103a, as shown in FIG. Move the pointer P corresponding to the quality to the position of the scale 5 to display it. That is, the instruction content of the adjustment control unit 104a indicates that the conversion range of the voice quality indicating the fast-talking is limited to only five graduations, and indicates that the pointer P is powered in the increasing direction by five graduations. Therefore, the voice quality adjustment unit 103a according to the present modification moves the pointer P to a position where the pointer P can be moved in the increasing direction by five graduations, that is, the position of the graduation 5, and displays it. However, in this case as well, the voice quality adjustment unit 103a merely moves the pointer P corresponding to the voice quality of the voice, and sets the indicator value of the voice quality of the voice to 5 to convert the parameter of the acoustic feature to the conversion unit 101. I will not tell you.

(Modification 2)

Here, a second modified example regarding the display method of voice quality adjusting section 103a in the present embodiment will be described.

[0088] The voice quality adjustment unit 103a according to the present modification displays the movable range of the pointer P without changing the length of the range bar B in characters.

FIG. 12 is an explanatory diagram for describing the content presented by voice quality adjusting section 103a according to the present modification.

For example, as shown in (a) of FIG. 12, first, the user sets the pointers P of each voice quality of the voice quality adjustment unit 103a such that the indicated values are all zero. Next, the user sets the pointer P of the voice quality indicating the brightness of the voice quality adjusting unit 103a so that the indicated value becomes 10.

Here, as described above, adjustment control section 104a determines that the conversion range of voice quality indicating fast-talking is limited to only 0 scales, and instructs voice quality adjustment section 103a of the determination result.

[0092] Receiving such an instruction, the voice quality adjusting unit 103a, as shown in FIG. 12 (b), places the word "up to here" at the position of the scale 0 of the range bar B corresponding to the voice quality of the fast-talk. indicate. Further, even if the user operates to move the pointer P corresponding to the voice quality of the fast-talked voice while such characters are displayed, the voice quality adjustment unit 103a does not accept the operation and the position of the pointer P Is fixed.

[0093] Further, when the user adjusts the pointer P corresponding to the voice quality of the brightness of the voice quality adjustment unit 103a and sets the indicated value from 10 to 9, the adjustment control unit 104a performs, based on the setting, Again, as described above, it was determined that the conversion range of the voice quality indicating the quick speech was limited to only 5 scales. The result of the determination is instructed to the voice quality adjusting unit 103a.

[0094] Receiving such an instruction, the voice quality adjusting unit 103a, as shown in FIG. 12 (c), displays the word "up to here" at the position of the scale 5 on the range bar B corresponding to the voice quality of the fast-talk. indicate. Further, even if such a character is displayed and the user operates to move the pointer P corresponding to the quick voice quality to the scale 5 or more while the character is displayed, the voice quality adjusting unit 103a does not accept the operation. Keep the position of pointer P at scale 5 or less.

[0095] In this modification, other characters and figures may be displayed as long as they indicate the movable range of the force pointer P indicating "up to" t and the letter U.

[0096] (Modification 3)

A modification example regarding the arrangement of the range bar B and the pointer P of the voice quality adjustment unit 103a according to the present embodiment will be described.

[0097] The voice quality adjustment unit 103a according to the present modification arranges the range bar B and the pointer P corresponding to each voice quality such that the closer the change in the voice quality, the closer to each other, and presents it to the user.

[0098] The voice quality adjustment unit 103 obtains the coefficient data 105a stored in the transform coefficient storage unit 105, and based on the coefficient data 105a, determines the similarity of the change content between the voice qualities such as brightness and darkness. Specify the degree. For example, the voice quality adjustment unit 103 derives a difference value of a coefficient for each acoustic feature between voice qualities indicated by the coefficient data 105a, and obtains a Euclidean distance (hereinafter simply referred to as a distance) between voice qualities from the difference value. Based on this distance, voice quality adjusting section 103 specifies the similarity between voice qualities.

FIG. 13A is an explanatory diagram for describing the distance between voice qualities.

The voice quality adjusting unit 103a calculates the distance between the voice qualities as shown in FIG. 13A. For example, the distance between voice quality indicating masculinity and voice quality indicating loudness is 5.4, and the distance between voice quality indicating brightness and voice quality indicating loudness is 11.3. .

[0100] The voice quality adjusting unit 103a determines that the two voice qualities are more similar as the voice qualities are closer in the calculated distance, and the range bars B and the pointer P indicating the voice qualities are closer to each other.

Present B and pointer P.

FIG. 13B is a diagram showing display contents of voice quality adjustment section 103a.

For example, based on the voice quality indicating masculinity, the voice quality between masculinity and voice quality Is 5.4, the distance between the voice quality indicating masculinity and the voice quality indicating brightness is 10.2, and the distance between the voice quality indicating masculinity and the voice quality indicating fast-talking is , 10.8. Therefore, as shown in FIG.13B, the voice quality adjusting unit 103 sets the range bar B of each voice quality in the order of voice quality indicating masculinity, voice quality indicating length, voice quality indicating brightness, and voice quality indicating fast-talking. Pointer P is presented.

[0102] This modification example, when combined with the first modification example, can provide a voice quality conversion operation that is intuitive and easy for the user to compose.

That is, in this embodiment, since the range bar B and the pointer P having similar voice quality changes are arranged close to each other, when the user operates the pointer P having a certain voice quality, the other nearby bars are arranged. The pointer P of voice quality moves in the same direction and moves farther away, and the pointer P of another voice quality moves in the opposite direction. Therefore, the user can intuitively understand how the voice quality is converted by operating the pointer P.

[0104] (Modification 4)

Another modified example of the arrangement of the range bar B and the pointer P of the voice quality adjusting unit 103a according to the present embodiment will be described.

[0105] The voice quality adjusting unit 103a of the third modification arranges the range bars B and the pointers P corresponding to the voice qualities in a line so that the closer the voice variance is, the closer the variation is. On the other hand, the voice quality adjustment unit 103a according to the present modification sets the range bar B and the pointer P corresponding to each voice quality on the same circumference so that the closer the voice content changes, the smaller the angle between them becomes. Place along.

FIG. 14A is a diagram showing the display content of voice quality adjusting section 103a.

As shown in FIG. 14A, the voice quality adjustment unit 103a summarizes the lower limit of the range bar B of each voice quality at one point and displays each range bar B along the same circle.

Further, as shown in FIG. 13A, the distance between the voice quality indicating masculinity and the voice quality indicating height is 5.4, and the distance between the voice quality indicating masculinity and the voice quality indicating brightness is Is 10.2, and the distance between the voice quality indicating masculinity and the voice quality indicating fast-talking is 10.8. Therefore, as shown in FIG. 14A, the voice quality adjustment unit 103 determines that the angle between the masculinity range bar B and the darkness range bar B is the smallest, and that the manhood range bar B and the fast-talk range bar B The angle between Present each range bar B to be the largest.

[0108] The voice quality adjustment unit 103a according to the present modification may also have the function of changing the position of the pointer P based on the display method described in the first modification, that is, the instruction from the adjustment control unit 104a.

FIG. 14B is an explanatory diagram for explaining how the voice quality adjusting unit 103a changes the display content.

[0110] For example, when the user moves the voice quality pointer P indicating the tone in the scale increasing direction, the voice quality adjustment unit 103a, based on the instruction from the adjustment control unit 104a, the voice quality pointer indicating the masculinity. Move P in the direction of increase in the scale, and move each pointer P of the voice quality indicating the brightness and the quickness in the direction of decrease in the scale.

[0111] Thus, as the pointer P moves, the other pointers P appear to move in the same direction. As a result, it is possible to provide a voice quality interface that is intuitive and easy for the user to make a contribution.

(Embodiment 3)

This speech synthesizer is a device capable of acquiring text data and performing speech synthesis with various voice qualities. The speech synthesizer according to the second embodiment, a speech synthesizer 201, and a speech synthesizer A database 202, a waveform generation unit 203, and a speaker 204 are provided.

[0114] The speech synthesis database 202 accumulates segment data indicating a plurality of speech segments. Upon acquiring the text data tdl based on the operation of the user, the speech synthesis unit 201 selects the segment data corresponding to the text indicated by the text data tdl from the speech synthesis database 202. Then, the speech synthesis unit 201 generates a feature parameter sequence pi using the selected segment data, and outputs the feature parameter sequence pi to the voice conversion device.

[0115] As described above, upon acquiring the characteristic parameter sequence pi, the voice conversion device converts the voice represented by the characteristic parameter sequence pi. Then, the voice conversion device generates and outputs a transformed feature parameter sequence p2 indicating the result of the conversion.

[0116] Upon acquiring the deformed feature parameter sequence p2 from the voice conversion device, the waveform generating unit 203 generates a waveform signal si indicating the deformed feature parameter sequence p2 as a speech waveform, and generates the waveform signal s 1 is output to the speaker 204. The speaker 204 outputs a synthesized voice corresponding to the waveform signal si.

[0117] As described above, the speech synthesis device according to the present embodiment includes the voice quality conversion device according to the second embodiment, and outputs the contents of text data tdl in a voice with a desired voice quality of the user without failure. And the usability can be further improved.

[0118] Note that, instead of the voice conversion device of the second embodiment, the voice conversion device of the first embodiment may be provided in the speech synthesis device of the present embodiment.

[0119] (Modification 1)

Here, a modified example regarding the operation of the adjustment control unit of the voice quality conversion device according to the present embodiment will be described.

FIG. 16 is a configuration diagram of a speech synthesizer according to the present modification.

The adjustment control unit 104b of the voice conversion apparatus according to the present modification is different from the force adjustment control unit 104a having the same function as the adjustment control unit 104a of the second embodiment in that it acquires the characteristic parameter sequence pi, The unit data stored in the database 202 is obtained.

That is, the adjustment control unit 104b according to the present modification detects the sound quality deterioration of the synthesized speech based on the segment data of the speech synthesis database 202 instead of the feature parameter sequence pi, and thereby the voice quality adjustment unit 103 Change the position of the pointer P, or change the length of the range bar B. In other words, the adjustment control unit 104b predicts the tendency of the parameter of the acoustic feature indicated by the feature parameter sequence pi by using a part or all of the segment data stored in the speech synthesis database 202, and generates a prediction result. Based on this, the position of the pointer P and the length of the range bar B are changed. For example, the adjustment control unit 104b selects all the segment data one by one from the speech synthesis database 202 and determines whether or not the quality of the synthesized speech is degraded when the segment data is converted according to the voice quality adjustment unit 103a. Change the position of pointer P to the reference.

[0122] The speech synthesizer according to the present embodiment can make the processing content of the adjustment control unit 104b the same no matter what text data tdl is input unless the speech synthesis database 202 is replaced. , The processing content can be simplified. However, if the contents of the characteristic parameter sequence pi greatly differ depending on the contents of the text data tdl, The quality of synthesized speech may be degraded depending on the content of the data tdl.

[0123] Note that the feature parameter sequence pi in the present modified example does not have to have the unit data power of the speech synthesis database 202 generated by the speech synthesis processing by the speech synthesis unit 201. In other words, if the feature parameter sequence pi used in the present modification is sufficiently similar to the voice quality indicated by the feature parameter sequence pi generated in this way, if it is similar to the feature parameter sequence pi generated by some other method, There may be.

[0124] (Modification 2)

Here, another modified example of the present embodiment will be described.

FIG. 17 is a configuration diagram of a speech synthesizer according to the present modification.

The speech synthesizer according to the present modification stores a feature table that holds, as a feature table, only data necessary for estimating quality degradation of synthesized speech among a plurality of segment data stored in the speech synthesis database 202. A part 205 is provided.

[0126] Specifically, the feature table held in the feature table storage unit 205 includes, for example, an upper limit value and a lower limit value of a parameter for each acoustic feature among all segment data stored in the speech synthesis database 202. Only the value and the average value are extracted.

The adjustment control unit 104c according to the present modification is different from the force adjustment control unit 104a having the same function as the adjustment control unit 104a of the second embodiment in The above-mentioned feature table stored in the feature table storage unit 205 is obtained.

That is, the adjustment control unit 104c according to the present modification estimates the quality degradation of the synthesized speech based on the feature table of the feature table storage unit 205 instead of the feature parameter sequence pi, and performs voice quality adjustment. The position of the pointer P in the section 103 is changed, and the length of the range bar B is changed.

As a result, the adjustment control unit 104c according to the present modification uses a feature table having a small amount of information, unlike the adjustment control unit 104b according to the first modification, in which a large amount of segment data of the speech synthesis database 202 is used. Thus, the position of the pointer P and the length of the range bar B can be quickly changed.

[0130] Note that the feature parameter sequence pi in this modification example is also generated by the speech synthesis processing by the speech synthesis unit 201, as in the first modification example. It doesn't have to be something. In other words, if the feature parameter sequence pi used in this modification is sufficiently similar to the voice quality indicated by the feature parameter sequence pi generated in this way, it is a feature parameter sequence pi generated by some other method. Also good.

[0131] (Modification 3)

Here, another modified example of the present embodiment will be described.

FIG. 18 is a configuration diagram of a speech synthesizer according to the present modification.

The speech synthesis device according to the present modification includes a speech synthesis unit 201a instead of speech synthesis unit 201 in the present embodiment. Further, the voice quality conversion device according to the present modification includes a conversion unit 101a and an adjustment control unit 104b instead of the conversion unit 101 and the adjustment control unit 104a.

As described in the first modification, the adjustment control unit 104b changes the position of the pointer P of the voice quality adjustment unit 103a or changes the length of the range bar B based on the unit data of the speech synthesis database 202. Or change it.

[0134] The conversion unit 101a performs processing on the segment data stored in the speech synthesis database 202.

In response to an instruction from the voice quality adjusting unit 103a, the audio characteristic indicated by the segment data is converted.

[0135] Upon acquiring the text data tdl, the speech synthesis unit 201a converts the segment data corresponding to the text indicated by the text data tdl and converted for the voice quality (acoustic feature) into a conversion unit. Obtained from 101a. Then, the speech synthesis unit 201a generates a deformed feature parameter sequence p2 using the obtained converted unit data, and outputs the deformed feature parameter sequence p2 to the waveform generating unit 203.

The voice synthesizing apparatus according to the present modification includes the feature table storage unit 205 according to the second modification, and includes the adjustment control unit 104c according to the second modification instead of the adjustment control unit 104b of the voice conversion device. May be.

[0137] (Modification 4)

Here, another modified example of the present embodiment will be described.

FIG. 19 is a configuration diagram of a speech synthesis device according to the present modification.

The speech synthesis apparatus according to the present modification includes a speech analysis unit 206 instead of the speech synthesis unit 201 and the speech synthesis database 202. [0138] The voice analysis unit 206 acquires voice waveform data dl that is a real voice and indicates the voice waveform, and generates a feature parameter sequence p1 based on the voice waveform data dl.

The conversion unit 101 and the adjustment control unit 104a of the voice quality conversion device obtain the characteristic parameter sequence p 1 generated as described above from the voice analysis unit 206.

[0140] The voice synthesizer of the present modified example converts the voice quality of the voice spoken by the user in real time and outputs the voice as synthesized voice. Further, with this configuration, it is possible to perform voice quality conversion processing on synthesized voice generated from the real voice voice waveform data dl while preventing quality deterioration through an interface that is intuitively easy to operate.

[0141] The voice quality conversion device may include the voice analysis unit 206.

[0142] (Modification 5)

Here, another modified example of the present embodiment will be described.

FIG. 20 is a configuration diagram of a speech synthesizer according to the present modification.

The speech synthesis apparatus according to the present modification includes a speech analysis unit 206 instead of the speech synthesis unit 201 and the speech synthesis database 202, similarly to the speech synthesis apparatus of the fourth modification. Further, the voice quality conversion device according to the present modification includes an adjustment control unit 104d instead of the adjustment control unit 104a.

The adjustment control unit 104d acquires the waveform feature table td2 instead of acquiring the feature parameter sequence pi as in the force adjustment control unit 104a having the same function as the adjustment control unit 104a. That is, the adjustment control unit 104d according to the present modification estimates the position of the pointer P of the voice quality adjustment unit 103a by estimating the quality deterioration of the synthesized speech based on the waveform feature table td2 instead of the feature parameter sequence pi. Or change the length of the range bar B.

[0145] The waveform feature table td2 contains, for example, only data necessary for estimating the quality degradation of the synthesized speech from the result of analyzing the sample speech previously uttered by the same speaker who uttered the speech waveform data dl. Is extracted. For example, the waveform feature table td2 is obtained by extracting only the upper limit value, the lower limit value, and the average value from the parameters of each acoustic feature that is the analysis result of the sample voice.

[0146] The adjustment control unit 104d may select any one of the plurality of waveform feature tables td2 from which the plurality of waveform feature tables td2 may be acquired. For example, the adjustment control unit 10 4d selects and uses the waveform feature table t2 that best represents the features of the speech waveform data dl and the feature parameter sequence pi based on attributes such as the age and gender of the speaker.

As described above, in the speech synthesizer of the present modification, by using the waveform feature table td2, the position of the pointer P of the voice quality adjustment unit 103 and the position of the pointer P before the acquisition of the feature parameter sequence p1 by the conversion unit 101 A change in the length of the range bar B can be performed. In addition, the adjustment control unit 104d of the present modification uses the waveform feature table td2 with a small amount of information instead of using the feature parameter sequence pi with a large amount of information, thereby obtaining the position of the pointer P and the length of the range bar B. Changes can be made quickly.

Industrial applicability

The voice conversion device of the present invention has an effect that the viewpoint power of the user interface and the usability can be improved. For example, an agent application using a synthetic sound ゃ a text-to-speech application, a communication using a voice conversion function It is useful as a device or a voice quality editor device.

Claims

The scope of the claims

[1] A voice quality conversion device for converting feature data indicating a feature of a voice into conversion feature data indicating a voice having a voice quality different from the voice,

Acquisition means for acquiring the feature data;

Presentation means for presenting a range in which voice quality can be converted;

Receiving means for receiving a voice quality specified by the user within a range presented by the presenting means;

In accordance with the characteristic data acquired by the acquiring means and the voice quality received by the receiving means, the range presented by the presenting means is changed to an appropriate range in which the voice quality indicated by the converted characteristic data does not fail. Means for changing the range,

Converting means for converting the characteristic data obtained by the obtaining means into converted characteristic data indicating voice of voice quality received by the receiving means;

A voice quality conversion device comprising:

[2] The presenting means presents, for each of a plurality of types of voice qualities, a range in which the voice qualities can be converted,

The receiving means receives, as a parameter, a degree of voice quality specified by a user within each range of voice quality presented by the presenting means,

The range changing means changes a range of another voice quality presented by the presenting means according to a parameter of the voice quality received to be converted by the receiving means,

The converting means converts the characteristic data into the converted characteristic data according to the parameters of each voice quality received by the receiving means.

2. The voice quality conversion device according to claim 1, wherein:

[3] The presenting means presents, for each of the plurality of voice qualities, a graphic and a pointer that moves on the graphic in response to a user operation, thereby presenting a range in which the voice qualities can be converted. And

The receiving unit specifies a parameter specified by a user based on a position of a pointer on the graphic and receives the parameter.

3. The voice conversion device according to claim 2, wherein:

[4] The range changing means changes the range in which the conversion is possible by moving the pointer.

4. The voice conversion device according to claim 3, wherein:

[5] The presenting means displays the figure in a bar shape,

The range changing means changes the convertible range by moving the pointer along the longitudinal direction of the figure.

5. The voice conversion device according to claim 4, wherein:

[6] The presenting means is characterized in that the figure and the pointer for each voice quality are arranged in parallel so that the more similar the change content based on each voice quality, the narrower the space between them. Item 5. The voice quality conversion device according to Item 5.

[7] The presenting means arranges the figure and the pointer for each voice quality along the same circumference so that the more similar the change content based on each voice quality, the smaller the angle between them is.

6. The voice quality conversion device according to claim 5, wherein:

[8] The range changing means changes the convertible range by deforming the figure.

4. The voice conversion device according to claim 3, wherein:

[9] The presenting means displays the figure in a bar shape,

The range changing means changes the range of the changeable extent by expanding and contracting the length of the figure in the longitudinal direction.

9. The voice conversion device according to claim 8, wherein:

[10] The voice quality conversion device further comprises:

A limit storing means for storing limit data indicating a limit of an acoustic feature that does not cause a breakdown in voice quality;

The range change unit specifies the appropriate range based on the characteristic data, the parameter received by the reception unit, and a limit indicated by the limit data, and sets the range presented by the presentation unit to the range. Change to an appropriate range

4. The voice conversion device according to claim 3, wherein:

[11] The plurality of types of voice quality presented by the presentation means are at least two of voice quality indicating brightness, voice quality indicating height, voice quality indicating masculinity, and voice quality indicating fast-talking.

4. The voice conversion device according to claim 3, wherein:

[12] The voice conversion device further comprises:

Data generating means for obtaining a voice and generating the feature data indicating the voice;

12. The voice conversion device according to claim 11, wherein:

[13] A voice quality conversion method for converting feature data indicating a feature of a voice into conversion feature data indicating a voice having a voice quality different from the voice,

An acquisition step of acquiring the feature data;

A presentation step of presenting a convertible range of voice quality;

A receiving step of receiving a voice quality specified by the user within the range presented in the presenting step;

The range presented in the presenting step is changed to an appropriate range in which the voice quality indicated by the converted feature data does not fail according to the characteristic data acquired in the acquiring step and the voice quality received in the receiving step. And a conversion step of converting the feature data obtained in the obtaining step into conversion feature data indicating voice of voice quality received in the receiving step.

A voice quality conversion method comprising:

[14] The presenting step presents, for each of a plurality of types of voice qualities, a range in which the voice qualities can be converted,

In the receiving step, in each range of each voice quality presented in the presenting step, V, a degree of voice quality specified by the user is received as a parameter,

In the range changing step, the range of another voice quality presented in the presenting step is changed and presented according to a parameter of the voice quality received to be converted in the receiving step,

In the converting step, the characteristic data is converted into the converted characteristic data according to the parameters of each voice quality received in the receiving step. 14. The voice quality conversion method according to claim 13, wherein:

[15] In the presenting step, for each of the plurality of voice qualities, a graphic and a pointer that moves on the graphic in accordance with a user operation are displayed, thereby presenting a range in which the voice qualities can be converted. And

In the receiving step, a parameter specified by a user is specified based on the position of the pointer on the graphic, and the parameter is received.

15. The voice quality conversion method according to claim 14, wherein:

[16] In the range changing step, the range that can be converted is changed and presented by moving the pointer.

16. The voice conversion method according to claim 15, wherein:

[17] In the range changing step, the figure is deformed to change and present the range that can be converted.

16. The voice conversion method according to claim 15, wherein:

[18] A program for converting feature data indicating a feature of a voice into conversion feature data indicating a voice having a voice quality different from the voice,

An acquisition step of acquiring the feature data;

A presentation step of presenting a convertible range of voice quality;

Which causes a computer to execute the program.

[19] A speech synthesizer for converting a text indicated by text data into synthesized speech, wherein the feature data acquires the text data and generates feature data indicating a feature of speech corresponding to the text of the text data. Generating means; Acquisition means for acquiring the feature data generated by the feature data generation means; presentation means for presenting a convertible range of voice quality;

A range in which the range presented by the presenting means is changed to an appropriate range in which the voice quality of the synthesized speech does not break down according to the characteristic data acquired by the acquiring means and the voice quality received by the receiving means. Change means;

Voice output means for generating and outputting the synthesized voice based on the conversion characteristic data converted by the conversion means;

A speech synthesis device comprising:

[20] A speech synthesis method for converting a text indicated by text data into a synthesized speech, wherein the text data is acquired and feature data is generated which indicates feature of speech corresponding to the text of the text data. Generating step;

An acquisition step of acquiring the feature data generated in the feature data generation step; a presentation step of presenting a convertible range of voice quality;

According to the characteristic data acquired in the acquiring step and the voice quality received in the receiving step, the range presented in the presenting step is changed to an appropriate range in which the voice quality of the synthesized voice does not break down. A range change step to be presented;

A conversion step of converting the feature data obtained in the obtaining step into conversion feature data indicating voice of voice quality received in the receiving step;

A voice output step of generating and outputting the synthesized voice based on the converted feature data converted in the conversion step;

A speech synthesis method comprising: