WO2014142200A1

WO2014142200A1 - Voice processing device

Info

Publication number: WO2014142200A1
Application number: PCT/JP2014/056570
Authority: WO
Inventors: 隆一成山; 克己石川; 松本　秀一
Original assignee: ヤマハ株式会社
Priority date: 2013-03-15
Filing date: 2014-03-12
Publication date: 2014-09-18
Also published as: CN105051811A; TW201443874A; JP2014178620A; KR20150118974A

Abstract

A storage device (12) stores singing expression data DS that indicate singing expression, and attribute data DA that pertain to the singing expression, regarding a plurality of different singing expressions. A section designation unit (34) designates each target section of a selection voice signal X according to an instruction from a user. An expression selection section (36) refers to each attribute data DA so as to select singing expression data DS according to the instruction from the user (search condition) for each target section. To each target section of the selection voice signal X, an expression imparting unit (38) imparts singing expression indicated by the singing expression data DS selected by the expression selection section (36) regarding the target section.

Description

Audio processing device

The present invention relates to a technique for controlling the singing expression of the singing voice.

Various techniques for processing singing voice have been proposed. For example, Patent Document 1 discloses a technique for collecting segment data used for segment-connected singing synthesis. Singing voices of arbitrary lyrics can be synthesized by appropriately selecting and connecting the piece data collected by the technique of Patent Document 1 to each other.

Japanese Unexamined Patent Publication No. 2003-108179

The actual singing voice is given a unique singing expression (singing) for the singer. However, since the technique of Patent Document 1 does not take into account various singing expressions of the singing voice, there is a problem that the singing voice synthesized using the segment data tends to be audibly monotonous. In view of the above circumstances, an object of the present invention is to generate singing voices with various singing expressions.

In order to solve the above-described problems, the speech processing apparatus according to the present invention includes an expression selection unit that selects application target song expression data from a plurality of song expression data indicating different song expressions, and a song selected by the expression selection unit. An expression providing unit that assigns the singing expression indicated by the expression data to a specific section of the singing voice.
In the above aspect, since the singing expression indicated by the singing expression data is given to the singing voice, it is possible to generate singing voices of various singing expressions as compared with the technique of Patent Document 1. In particular, since a plurality of singing expressions indicated by the singing expression data are selectively given to a specific section of the singing voice, the effect of generating singing voices with various singing expressions is particularly remarkable.

An expression selection part selects the 1st song expression data and 2nd song expression data which show different song expressions, and an expression provision part provides the song expression which 1st song expression data shows to the 1st area of song voice. In addition, the singing expression indicated by the second singing expression data may be given to the second section of the singing voice that is different from the first section.
In the above aspect, since a separate singing expression is given for each section of the singing voice, the effect that the singing voices of various singing expressions can be generated is particularly remarkable.

The expression selection unit selects two or more singing expression data indicating different singing expressions, and the expression providing unit specifies the singing voice indicated by each of the two or more singing expression data selected by the expression selecting unit. You may give redundantly to a section.
In the above aspect, since a plurality of singing expressions (typically different types of singing expressions) are added to the singing voice, the effect of generating singing voices with various singing expressions is particularly remarkable. is there.

The storage unit stores attribute data related to the singing expression in association with the singing expression data of the singing expression, and the expression selecting unit refers to the attribute data of each singing expression data to obtain the singing expression data from the storing unit. You may choose.
In the above aspect, attribute data is associated with each singing expression data, so that the singing expression data of the singing expression given to the singing voice can be selected (searched) by referring to the attribute data.

The expression selection unit may select singing expression data in accordance with an instruction from the user.
In the above aspect, since the song expression data according to the instruction | indication from a user is selected, there exists an advantage that the various singing voice reflecting a user's intention and preference can be produced | generated.
An expression provision part may provide the singing expression which the singing expression data which the expression selection part selected to the specific area according to the instruction | indication from a user among song voices shows.
In the above aspect, since the singing expression is given to the section according to the instruction from the user in the singing voice, there is an advantage that various singing voices reflecting the user's intention and preference can be generated.

By the way, various techniques for evaluating the skill of singing have been proposed. For example, the singing voice is evaluated by comparing the transition of the pitch and volume of the singing voice with the transition of the pitch and volume of the standard (exemplary) singing voice prepared in advance. However, the evaluation of actual singing depends not only on the accuracy of pitch and volume but also on the skill of singing expression.
In consideration of the above circumstances, the speech processing apparatus of the present invention corresponds to the singing expression data of the singing expression similar to the singing voice among the plurality of singing expression data, and according to the evaluation value indicating the evaluation of the singing expression. You may comprise the song evaluation part which evaluates a song voice.
In the above aspect, since the singing voice is evaluated according to the evaluation value corresponding to the singing expression data of the singing expression similar to the singing voice, there is an advantage that the singing voice can be appropriately evaluated from the viewpoint of skill of the singing expression. .

The singing evaluation unit selects singing expression data of the singing expression similar to the singing expression of the target section for each of the plurality of target sections of the singing voice, and the singing voice is selected according to the evaluation value corresponding to each singing expression data. You may evaluate.
In the above aspect, since the singing voice is evaluated according to the evaluation value corresponding to the singing expression data selected for each of the plurality of target sections of the singing voice, the specific target section of the singing voice can be evaluated with priority. There is an advantage. However, the target section can be the entire section of the audio signal (the entire music piece).
The speech processing apparatus includes a storage unit that stores singing expression data indicating singing expression and evaluation values indicating evaluation of the singing expression for a plurality of different singing expressions, and the singing evaluation unit includes the plurality of singing expression data. The singing voice may be evaluated according to an evaluation value stored in the storage unit, corresponding to singing expression data of a singing expression similar to the singing voice.
In the above aspect, since the singing voice is evaluated according to the evaluation value corresponding to the singing expression data of the singing expression similar to the singing voice, the singing from the viewpoint of whether or not it is similar to the singing expression registered in the storage unit There is an advantage that sound can be properly evaluated.
In the present invention, there is provided a speech processing method for selecting singing expression data to be applied from a plurality of singing expression data indicating different singing expressions, and providing the singing expression indicated by the selected singing expression data to a specific section of the singing voice. Is done.

The audio processing device according to each of the above aspects is realized by hardware (electronic circuit) such as DSP (Digital Signal Processor) dedicated to processing of singing voice, and general-purpose arithmetic such as CPU (Central Processing Unit) This is also realized by cooperation between the processing device and the program. Specifically, the program according to the first aspect of the present invention includes an expression selection process for selecting application target song expression data from a plurality of song expression data indicating different song expressions, and a song expression selected in the expression selection process. The expression provision process which assign | provides the song expression which data shows to the specific area of song voice is performed. Moreover, the program which concerns on the 2nd aspect of this invention is a computer which comprises the memory | storage part which memorize | stores the song value data which shows song expression, and the evaluation value which shows the evaluation of the said song expression about different song expressions. The singing evaluation process is performed to evaluate the singing voice according to the evaluation value corresponding to the singing expression data of the singing expression similar to the singing voice among the singing expression data.

The program according to each of the above aspects can be provided in a form stored in a computer-readable recording medium and installed in the computer. The recording medium is, for example, a non-transitory recording medium, and an optical recording medium (optical disk) such as a CD-ROM is a good example, but a known arbitrary one such as a semiconductor recording medium or a magnetic recording medium This type of recording medium can be included. For example, the program of the present invention can be provided in the form of distribution via a communication network and installed in a computer.

1 is a block diagram of a speech processing apparatus according to a first embodiment of the present invention. It is a functional block diagram of the element relevant to an expression registration process. It is a block diagram of a song division part. It is a flowchart of an expression registration process. It is a functional block diagram of the element relevant to an expression provision process. It is a flowchart of an expression provision process. It is explanatory drawing of the specific example (giving of vibrato) of an expression provision process. It is explanatory drawing of an expression provision process. It is explanatory drawing of an expression provision process. It is a functional block diagram of the element relevant to the song evaluation process of 2nd Embodiment. It is a flowchart of a song evaluation process. It is a block diagram of the audio processing apparatus which concerns on a modification.

<First Embodiment>
FIG. 1 is a block diagram of a speech processing apparatus 100 according to the first embodiment of the present invention. As shown in FIG. 1, the sound processing device 100 is realized by a computer system including an arithmetic processing device 10, a storage device 12, a sound collecting device 14, an input device 16, and a sound emitting device 18.

The arithmetic processing device 10 controls each element of the speech processing device 100 by executing a program stored in the storage device 12. The storage device 12 stores a program executed by the arithmetic processing device 10 and various data used by the arithmetic processing device 10. A known recording medium such as a semiconductor recording medium or a magnetic recording medium or a combination of a plurality of types of recording media is arbitrarily employed as the storage device 12. The storage device 12 is installed in an external device (for example, an external server device) separate from the speech processing device 100, and the speech processing device 100 writes and reads information to and from the storage device 12 via a communication network such as the Internet. A configuration for performing the above can also be adopted. That is, the storage device 12 is not an essential element of the voice processing device 100.

The storage device 12 of the first embodiment stores a plurality of audio signals X indicating time waveforms of different singing voices (for example, singing voices of different singers). Each of the plurality of audio signals X is prepared in advance by recording a singing voice of singing a song (singing song). Further, the storage device 12 stores a plurality of song expression data DS indicating different song expressions and a plurality of attribute data DA related to the song expressions indicated by each song expression data DS. The singing expression is a characteristic of the singing (singing or singing method peculiar to the singer). Singing expression data DS is stored in the storage device 12 for a plurality of types of singing expressions extracted from singing voices pronounced by different singers, and attribute data DA is associated with each of the plurality of singing expression data DS.

The singing expression data DS includes, for example, a pitch or volume (distribution range), a characteristic amount of a frequency spectrum (for example, a spectrum within a specific band), a frequency or intensity of a formant of a specific order, and a characteristic amount (for example, a harmonic overtone). Specify various features related to the musical expression of the singing voice, such as the intensity ratio between the component and fundamental component, the intensity ratio between the harmonic component and the non-harmonic component), or MFCC (Mel-Frequency Cepstrum Coefficients) . In addition, the singing expression exemplified above is a tendency of singing voice for a relatively short time, but the tendency of the pitch or volume to change with time and various singing techniques (for example, vibrato, fall, long tone). A configuration in which the singing expression data DS specifies the tendency of the singing voice over a long time such as a tendency is also suitable.

The attribute data DA of each singing expression is information (metadata) related to the singer of the singing voice and the music, and is used for searching the singing expression data DS. Specifically, information on the singer sung in each singing expression (for example, name, age, birthplace, age, gender, race, native language, range), information on the song sung in each singing expression (for example, The attribute data DA specifies the song name, composer, songwriter, genre, tempo, key, chord, range, and language. The attribute data DA can also specify words (for example, words such as “rhythmic” and “sweet”) that express the impression and atmosphere of the singing voice. Further, the attribute data DA of the first embodiment includes an evaluation value (a skill evaluation index of the singing expression of the singing expression data DS) Q according to the evaluation result of the singing voice sung by each singing expression. For example, the attribute value DA includes an evaluation value Q calculated by a known singing evaluation process and an evaluation value Q reflecting the evaluation by each user other than the singer. The items specified by the attribute data DA are not limited to the above examples. For example, the attribute data DA can specify which section of the music structure into which the music is divided (for example, each phrase such as A melody, chorus, B melody) in which the singing expression is sung.

1 is a device (microphone) that collects ambient sounds. The sound collection device 14 according to the first embodiment generates an audio signal R by collecting a singing voice in which a singer sang a song (singing song). An A / D converter that converts the audio signal R from analog to digital is not shown for convenience. A configuration in which the audio signal R is stored in the storage device 12 (therefore, the sound collection device 14 can be omitted) is also suitable.

The input device 16 is an operation device that receives an instruction from the user to the voice processing device 100, and includes, for example, a plurality of operators that can be operated by the user. For example, an operation panel installed in the casing of the voice processing device 100 or a remote control device separate from the voice processing device 100 is employed as the input device 16.

The arithmetic processing unit 10 executes various control processes and arithmetic processes by executing the programs stored in the storage device 12. Specifically, the arithmetic processing device 10 extracts the singing expression data DS by analyzing the audio signal R supplied from the sound collecting device 14 and stores it in the storage device 12 (hereinafter referred to as “expression registration processing”). The process of generating the audio signal Y by adding the singing expression indicated by each singing expression data DS stored in the storage device 12 in the expression registration process to the audio signal X in the storage apparatus 12 (hereinafter referred to as “expression adding process”). ) And execute. That is, the audio signal Y is an acoustic signal in which the singing expression of the audio signal X matches or resembles the singing expression of the singing expression data DS while maintaining the pronunciation content (lyrics) of the audio signal X. For example, one of the expression registration process and the expression providing process is selectively executed in accordance with an instruction from the user to the input device 16. The sound emitting device 18 (for example, a speaker or a headphone) in FIG. 1 reproduces sound corresponding to the audio signal Y generated by the arithmetic processing device 10 in the expression providing process. Note that a D / A converter that converts the audio signal Y from digital to analog and an amplifier that amplifies the audio signal Y are omitted for the sake of convenience.

<Expression registration process>
FIG. 2 is a functional configuration diagram of elements related to the expression registration process in the voice processing apparatus 100. The arithmetic processing unit 10 executes a program (expression registration program) stored in the storage device 12 to thereby execute a plurality of elements (analysis processing unit 20, song) for realizing the expression registration process as shown in FIG. It functions as a classification unit 22, a song evaluation unit 24, a song analysis unit 26, and an attribute acquisition unit 28). A configuration in which each function of FIG. 2 is distributed over a plurality of integrated circuits, or a configuration in which a dedicated electronic circuit (for example, DSP) realizes a part of the functions illustrated in FIG. 2 may be employed.

The analysis processing unit 20 in FIG. 2 analyzes the audio signal R supplied from the sound collection device 14. As illustrated in FIG. 3, the analysis processing unit 20 according to the first embodiment includes a music structure analysis unit 20A, a singing technique analysis unit 20B, and a voice quality analysis unit 20C. The music structure analysis unit 20A analyzes a section (for example, each phrase such as A melody, chorus, and B melody) on the music structure of the music corresponding to the audio signal R. The singing technique analysis unit 20B includes vibrato (singing technique for finely changing the pitch), shackle (singing technique for changing from a pitch lower than the target pitch to a target pitch) and fall (exceeding the target pitch). Various singing techniques such as a singing technique for changing from a pitch to a target pitch are detected from the audio signal R. The voice quality analysis unit 20C analyzes the voice quality of the singing voice (for example, the intensity ratio between the harmonic component and the fundamental component and the intensity ratio between the harmonic component and the non-harmonic component).

2 defines each section (hereinafter referred to as “unit section”) applied to the generation of the singing expression data DS for the audio signal R supplied from the sound collection device 14. The singing section 22 of the first embodiment defines each unit section of the audio signal R according to the music structure, singing technique, and voice quality. Specifically, the singing section 22 includes end points of each section on the music structure of the music analyzed by the music structure analysis unit 20A, end points of each section where the singing technique analysis unit 20B detects various singing techniques, The voice signal R is divided into unit sections with the time point when the voice quality analyzed by the voice quality analysis unit 20C fluctuates. Note that the method of dividing the audio signal R into a plurality of unit sections is not limited to the above examples. For example, the audio signal R can be divided using a section specified by the user by an operation on the input device 16 as a unit section. Further, the audio signal R is divided into a plurality of unit sections according to the configuration in which the audio signal R is divided into a plurality of unit sections when set on the time axis at random, or the evaluation value Q calculated by the singing evaluation unit 24. A configuration (for example, a configuration in which each unit section is defined with a time point at which the evaluation value Q fluctuates) as a boundary may be employed. It is also possible to set the entire section of the audio signal R (the whole piece of music) as a unit section.

The song evaluation unit 24 evaluates the skill of the song indicated by the audio signal R supplied from the sound collection device 14. Specifically, the singing evaluation unit 24 sequentially calculates an evaluation value Q obtained by evaluating the skill of singing the audio signal R for each unit section defined by the singing division unit 22. For the calculation of the evaluation value Q by the song evaluation unit 24, a known song evaluation process is arbitrarily employed. Note that the singing technique analyzed by the singing technique analyzing unit 20B and the voice quality analyzed by the voice quality analyzing unit 20C can be applied to the singing evaluation by the singing evaluation unit 24.

The singing analysis unit 26 in FIG. 2 generates the singing expression data DS for each unit section by analyzing the audio signal R. Specifically, the singing analysis unit 26 extracts an acoustic feature quantity (feature quantity that affects the singing expression) such as pitch and volume from the voice signal R, and a short-term or long-term tendency of each feature quantity. Singing expression data DS indicating (that is, singing expression) is generated. A known acoustic analysis technique (for example, the technique disclosed in Japanese Patent Application Laid-Open No. 2011-013454 and Japanese Patent Application Laid-Open No. 2011-028230) is arbitrarily employed for extracting the singing expression. It is also possible to generate a plurality of song expression data DS corresponding to different types of song expressions from one unit section. In the above example, one singing expression data DS is generated for each unit section. However, one singing expression data DS can be generated from a plurality of feature quantities of different unit sections. For example, a configuration in which the singing expression data DS is generated by averaging feature quantities of a plurality of unit sections that are approximated or matched with the attribute data DA, and a weight value corresponding to the evaluation value Q of each unit section by the singing evaluation unit 24 is used. A configuration is adopted in which singing expression data DS is generated by applying weighted addition of feature quantities over a plurality of unit sections.

Attribute acquisition unit 28 generates attribute data DA for each unit section defined by singing section 22. Specifically, the attribute acquisition unit 28 registers, in the attribute data DA, various types of information that the user has instructed by operating the input device 16. Further, the attribute acquisition unit 28 includes the evaluation value Q (for example, the average of the evaluation values in the unit section) calculated by the singing evaluation unit 24 for each unit section in the attribute data DA of the unit section.

The singing expression data DS generated by the singing analysis unit 26 for each unit section and the attribute data DA generated by the attribute acquisition unit 28 for each unit section are stored in association with each other in common unit sections. It is stored in the device 12. The expression registration process exemplified above is repeated for the audio signals R of a plurality of different singing voices, so that each of a plurality of types of singing expressions extracted from the singing voices uttered by each of a plurality of singers, Singing expression data DS and attribute data DA are stored in the storage device 12. That is, a database of various singing expressions (singing expressions with different singers and singing expressions with different types) is constructed in the storage device 12. It is also possible to integrate a plurality of song expression data DS to generate one song expression data DS. For example, by applying a weighting value corresponding to the evaluation value Q by the singing evaluation unit 24 or a configuration for generating new singing expression data DS by averaging a plurality of singing expression data DS that approximate or match the attribute data DA. A configuration is adopted in which new song expression data DS is generated by weighted addition of a plurality of song expression data DS.

FIG. 4 is a flowchart of the expression registration process. As shown in FIG. 4, when the user instructs the execution of the expression registration process by operating the input device 16 (SA1), the analysis processing unit 20 analyzes the audio signal R supplied from the sound collection device 14 (SA2). ). The singing section 22 classifies the voice signal R into each unit section according to the analysis result by the analysis processing section 20 (SA3), and the singing analysis section 26 analyzes the voice signal R to express the singing expression for each unit section. Data DS is generated (SA4). The singing evaluation unit 24 calculates an evaluation value Q corresponding to the skill of the singing indicated by the audio signal R for each unit section (SA5), and the attribute acquisition unit 28 calculates the singing evaluation unit 24 for each unit section. Attribute data DA including the evaluation value Q is generated for each unit section (SA6). The song expression data DS generated by the song analysis unit 26 and the attribute data DA generated by the attribute acquisition unit 28 are stored in the storage device 12 for each unit section (SA7). The singing expression specified by the singing expression data DS accumulated in the storage device 12 by the expression registration process described above is given to the audio signal X by the expression giving process described below.

<Expression assignment processing>
FIG. 5 is a functional configuration diagram of elements related to the expression providing process in the audio processing apparatus 100. The arithmetic processing unit 10 executes a program (expression providing program) stored in the storage device 12 to perform a plurality of functions (singing selection unit 32, section) for realizing the expression providing process as shown in FIG. It functions as a designation unit 34, an expression selection unit 36, and an expression giving unit 38). A configuration in which each function of FIG. 5 is distributed over a plurality of integrated circuits, or a configuration in which a dedicated electronic circuit (for example, DSP) executes a part of the functions illustrated in FIG. 5 may be employed.

The singing selection unit 32 selects one of the plurality of audio signals X stored in the storage device 12 (hereinafter referred to as “selected audio signal X”). For example, the singing selection unit 32 selects the selected sound signal X from the plurality of sound signals X in the storage device 12 in accordance with an instruction (selection instruction for the sound signal X) from the user to the input device 16.

The section designating unit 34 designates one or more sections (hereinafter referred to as “target section”) to which the singing expression of the singing expression data DS is to be added in the selected audio signal X selected by the singing selecting unit 32. Specifically, the section specifying unit 34 specifies each target section in accordance with an instruction from the user to the input device 16. For example, the section specifying unit 34 defines a section between two points specified by the user on the time axis (for example, on the waveform of the selected audio signal X) by operating the input device 16 as a target section. A plurality of target sections specified by the section specifying unit 34 may overlap each other on the time axis. Note that it is also possible to specify the entire section (the entire music piece) of the selected audio signal X as the target section.

The expression selection unit 36 shown in FIG. 5 selects the singing expression data DS (hereinafter referred to as “target expression data DS”) that is actually applied to the expression providing process among the plurality of singing expression data DS stored in the storage device 12. The designation unit 34 sequentially selects the target sections designated. The expression selection unit 36 of the first embodiment selects the target expression data DS from the plurality of song expression data DS in the search process using the attribute data DA stored in the storage device 12 in association with each song expression data DS. .

For example, the user can designate the search condition (for example, search word) of the target expression data DS for each target section by appropriately operating the input device 16. The expression selection unit 36 selects the singing expression data DS corresponding to the attribute data DA that matches the search condition specified by the user among the plurality of singing expression data DS in the storage device 12 as the target expression data DS for each target section. . For example, when the user specifies a search condition (for example, age or gender) of the singer, the target expression data DS corresponding to the attribute data DA of the singer that matches the search condition (that is, the singing expression of the singer that matches the search condition) ) Is searched. When the user designates a music search condition (for example, genre or range), the target expression data DS corresponding to the music attribute data DA that matches the search condition (that is, the song singing expression that matches the search condition). Is searched. When the user specifies a search condition (for example, a numerical range) for the evaluation value Q of the singing voice, the target expression data DS corresponding to the attribute data DA of the evaluation value Q that matches the search condition (that is, the level intended by the user). Singing expression of the singer). As understood from the above description, the expression selection unit 36 of the first embodiment is expressed as an element that selects the singing expression data DS (target expression data DS) in accordance with an instruction from the user.

5 generates the audio signal Y by adding the singing expression of the target expression data DS to the selected audio signal X selected by the singing selection unit 32. Specifically, the expression providing unit 38 sings the target expression data DS selected by the expression selection unit 36 for the target section for each of a plurality of target sections specified by the section specifying unit 34 in the selected audio signal X. Give expression. That is, the singing expression according to the instruction | indication from a user (designation of search conditions) is provided with respect to each object area according to the instruction | indication from a user among the selection audio | voice signals X. FIG. A well-known technique is arbitrarily employ | adopted for provision of the song expression with respect to the selection audio | voice signal X. FIG. In addition to the configuration that replaces the singing expression of the selected audio signal X with the singing expression of the target expression data DS (the configuration in which the singing expression of the selected audio signal X does not remain in the audio signal Y), the singing expression of the selected audio signal X A configuration in which the singing expression of the target expression data DS is cumulatively given (for example, a structure in which both the singing expression of the selected audio signal X and the singing expression of the target expression data DS are reflected in the audio signal Y) may be employed.

FIG. 6 is a flowchart of the expression providing process. As shown in FIG. 6, when the user instructs the execution of the expression providing process by operating the input device 16 (SB1), the singing selection unit 32 selects the selected audio signal from the plurality of audio signals X stored in the storage device 12. X is selected (SB2), and the section designating unit 34 designates one or more target sections for the selected audio signal X (SB3). In addition, the expression selection unit 36 selects the target expression data DS from the plurality of song expression data DS stored in the storage device 12 (SB4), and the expression giving unit 38 selects the selected voice signal X selected by the song selection unit 32. The voice signal Y is generated by giving the singing expression of the target expression data DS to each of the target sections (SB5). The sound signal Y generated by the expression providing unit 38 is reproduced from the sound emitting device 18 (SB6).

FIG. 7 is an explanatory diagram of a specific example of the expression providing process to which the singing expression data DS indicating vibrato is applied. FIG. 7 illustrates the time change of the pitch (pitch) of the selected voice signal X and a plurality of song expression data DS (DS [1] to DS [4]). Each singing expression data DS is generated by an expression registration process for each audio signal R containing singing voices of different singers. Therefore, the vibrato represented by each song expression data DS (DS [1] to DS [4]) has different characteristics such as pitch fluctuation period (speed) and fluctuation width (depth). As shown in FIG. 7, for example, the target section of the selected audio signal X is designated according to an instruction from the user (SB3), and the target expression data DS is selected from a plurality of song expression data DS, for example, according to an instruction from the user. When [3] is selected (SB4), the audio signal Y in which the vibrato indicated by the target expression data DS [3] is added to the target section of the selected audio signal X is generated by the expression adding process (SB5). As understood from the above description, the desired singing expression in the desired target section in the audio signal X of the singing voice sung without vibrato (for example, the singing voice of a singer who is not good at vibrato) Vibrato for data DS is given. In addition, the structure for a user to select object expression data DS from several song expression data DS is arbitrary. For example, a predetermined singing voice to which the singing expression of each singing expression data DS is given is reproduced from the sound emitting device 18 and listened to (ie, auditioned) by the user, and the input device 16 ( For example, a configuration in which the target expression data DS is selected by operating a button or a touch panel is preferable.

In FIG. 8, when the expression selection unit 36 selects the target expression data DS1 for the target section S1 of the selected audio signal X, and the expression selection unit 36 selects the target expression data DS2 for the target section S2 different from the target section S1. Is assumed. The expression giving unit 38 gives the singing expression E1 indicated by the target expression data DS1 to the target section S1, and also gives the singing expression E2 indicated by the target expression data DS2 to the target section S2.

Also, as shown in FIG. 9, when the target section S1 and the target section S2 overlap (when the target section S2 is included in the target section S1), the target section S1 and the target section S2 of the selected audio signal X The overlapping section (that is, the target section S2) is given the song expression E1 of the target expression data DS1 and the song expression E2 of the target expression data DS2 redundantly. That is, a plurality of (typically a plurality of types) singing expressions are given to the specific section of the selected audio signal X in an overlapping manner. For example, both the singing expression E1 related to pitch fluctuation and the singing expression E2 related to volume fluctuation are given to the selected voice signal X (target section S2). The sound signal Y generated by the above processing is supplied to the sound emitting device 18 and reproduced as sound.

As described above, in the first embodiment, each singing expression of the plurality of singing expression data DS indicating different singing expressions is selectively given to the target section of the selected audio signal X. Therefore, it is possible to generate singing voices (voice signal Y) with various singing expressions as compared with the technique of Patent Document 1.

In the first embodiment, in particular, since a separate singing expression is given to each of a plurality of target sections specified in the selected voice signal X (FIGS. 8 and 9), the target section to which the singing expression is given is the selected voice. Compared with the configuration limited to one section of the signal X, the above-described effect of generating singing voices with various singing expressions is particularly remarkable. In the first embodiment, a plurality (several types) of singing expressions can be given redundantly to the target section of the selected audio signal X (FIG. 9), so the singing expression given to the target section is limited to one type. The effect of being able to generate singing voices with various singing expressions is particularly remarkable as compared to the configuration to be performed. However, the configuration in which the target section to which the singing expression is given is limited to one section of the selected audio signal X and the configuration in which the singing expression given to the target section is limited to one type are also within the scope of the present invention. Is included.

In the first embodiment, the target section of the selected audio signal X is designated according to an instruction from the user, and the search condition for the attribute data DA is set according to the instruction from the user. There is also an advantage that various singing voices that sufficiently reflect the intentions and preferences of the person can be generated.

Second Embodiment
A second embodiment of the present invention will be described. In the speech processing apparatus 100 of the first embodiment, the plurality of singing expression data DS stored in the storage device 12 is used for adjusting the singing expression of the audio signal X. In the speech processing apparatus 100 according to the second embodiment, a plurality of singing expression data DS stored in the storage device 12 are used for the evaluation of the speech signal X. In addition, about the element which an effect | action and function are the same as that of 1st Embodiment in each form illustrated below, the reference | standard referred by description of 1st Embodiment is diverted, and each detailed description is abbreviate | omitted suitably.

FIG. 10 is a functional configuration diagram of elements related to a process of evaluating the audio signal X (hereinafter referred to as “singing evaluation process”) in the audio processing apparatus 100 of the second embodiment. The storage device 12 of the second embodiment stores a plurality of sets of song expression data DS and attribute data DA generated by the same expression registration process as that of the first embodiment. The attribute data DA corresponding to each singing expression data DS is the evaluation value calculated by the singing evaluation unit 24 in FIG. 2 (the evaluation index of the skill of the singing expression data DS) Q as described above for the first embodiment. It is comprised including.

The arithmetic processing unit 10 executes a program (singing evaluation program) stored in the storage device 12 and thereby, as shown in FIG. 10, a plurality of elements (singing selection unit 42, section) for realizing the singing evaluation processing It functions as a designation unit 44 and a singing evaluation unit 46). For example, according to the instruction | indication from the user with respect to the input device 16, the expression provision process of 1st Embodiment and the song evaluation process explained in full detail below are selectively performed. However, in the second embodiment, the expression providing process can be omitted. Note that it is also possible to adopt a configuration in which the functions in FIG. 10 are distributed over a plurality of integrated circuits, or a configuration in which a dedicated electronic circuit (for example, DSP) realizes some of the functions illustrated in FIG.

10 selects the selected audio signal X to be evaluated from among the plurality of audio signals X stored in the storage device 12. Specifically, the singing selection unit 42 selects the selected audio signal X from the storage device 12 in accordance with an instruction from the user to the input device 16, similarly to the singing selection unit 32 of the first embodiment. In addition, the section specifying unit 44 specifies one or more target sections to be evaluated from the selected audio signal X selected by the song selection unit 42. Specifically, the section specifying unit 44 specifies each target section in accordance with an instruction from the user to the input device 16, similarly to the section specifying unit 34 of the first embodiment. It is also possible to specify the entire section of the selected audio signal X as the target section.

The singing evaluation unit 46 of FIG. 10 uses each singing expression data DS and each attribute data DA (evaluation value Q) stored in the storage device 12 to perform the singing of the selected voice signal X selected by the singing selection unit 42. Evaluate skill. That is, the singing evaluation unit 46 sets the evaluation value Q in the attribute data DA corresponding to the singing expression data DS of the singing expression similar to each target section of the selected voice signal X among the plurality of singing expression data DS of the storage device 12. Accordingly, the evaluation value Z of the selected audio signal X is calculated. The specific operation of the singing evaluation unit 46 will be described below.

The singing evaluation unit 46 first determines the similarity (correlation or distance) between the singing expression indicated by the singing expression data DS and the singing expression of the target section of the selected audio signal X in each of the plurality of singing expression data DS in the storage device 12. Is calculated for each target section, and the singing expression data DS having the maximum similarity to the singing expression of the target section among the plurality of singing expression data DS is sequentially selected for each of the plurality of target sections of the selected audio signal X. . A known technique for comparing feature quantities is arbitrarily employed for calculating the similarity of singing expressions.

Then, the singing evaluation unit 46 weights and adds the evaluation value Q of the attribute data DA corresponding to the singing expression data DS selected for each target section of the selected speech signal X for a plurality of target sections of the selected speech signal X (or The evaluation value Z of the selected speech signal X is calculated by averaging. As understood from the above description, the evaluation value Z of the selected sound signal X is larger as the target section sung with a singing expression similar to the singing expression having a higher evaluation value Q is included in the selected sound signal X. Set to a numeric value. The evaluation value Z calculated by the singing evaluation unit 46 is notified to the user by, for example, image display by a display device (not shown) or sound reproduction by the sound emitting device 18.

FIG. 11 is a flowchart of the song evaluation process. As shown in FIG. 11, when the user instructs the execution of the song evaluation process by operating the input device 16 (SC1), the song selection unit 42 selects the selected voice signal from the plurality of voice signals X stored in the storage device 12. X is selected (SC2), and the section designating unit 44 designates one or more target sections for the selected audio signal X (SC3). The song evaluation unit 46 calculates the evaluation value Z of the selected voice signal X using each song expression data DS and each attribute data DA stored in the storage device 12 (SC4). The evaluation value Z calculated by the singing evaluation unit 46 is notified to the user (SC5).

As described above, in the second embodiment, the evaluation value Z of the selected voice signal X is calculated according to the evaluation value Q of the song expression data DS whose singing expression is similar to the selected voice signal X. Therefore, it is possible to appropriately evaluate the selected speech signal X from the viewpoint of skill of singing expression (similarity with the singing expression registered in the expression registration process). As can be understood from the above description, in the second embodiment, information other than the evaluation value Q in the attribute data DA can be omitted. That is, the memory | storage device 12 of 2nd Embodiment is expressed as an element which memorize | stores the song expression data DS which show a song expression, and the evaluation value Q which shows the evaluation of the said song expression about several different song expressions.

<Modification>
Each of the above-described embodiments can be variously modified. Specific modifications are exemplified below. Two or more aspects arbitrarily selected from the following examples can be appropriately combined.

(1) The target of the expression providing process of the first embodiment and the target of the song evaluation process of the second embodiment are not limited to the audio signal X recorded in advance and stored in the storage device 12. For example, the audio signal X generated by the sound collection device 14, the audio signal X reproduced from a portable or built-in recording medium (for example, a CD), and the audio signal received from another communication terminal via the communication network ( For example, a streaming-format audio signal) X can be used as an object of the expression imparting process or the song evaluation process. Moreover, the structure which performs an expression provision process and a song evaluation process is also employ | adopted about the audio | voice signal X produced | generated by the well-known audio | voice synthesis | combination process (for example, segment connection type song synthesis process). In each of the above-described forms, the expression providing process and the singing evaluation process are performed on the recorded audio signal X. For example, if each target section on the time axis is specified in advance, the audio signal X is supplied. In parallel, it is also possible to execute the expression providing process and the singing evaluation process in real time.

In each of the above-described embodiments, one of the plurality of audio signals X is selected as the selected audio signal X, but the selection of the audio signal X (the singing selection unit 32 or the singing selection unit 42) may be omitted. In the configuration in which the entire section of the audio signal X (the entire music piece) is specified as the target section, the section specifying unit 34 can be omitted. Therefore, the speech processing apparatus that executes the expression providing process is selected by the expression selecting unit 36 that selects the singing expression data DS to be applied from the plurality of singing expression data DS and the expression selecting unit 36, as illustrated in FIG. The singing expression indicated by the singing expression data DS is comprehensively expressed as an apparatus including an expression providing unit 38 that applies a singing expression to a specific section of the singing voice (audio signal X).

Similarly, the target of the expression registration process is not limited to the audio signal R generated by the sound collection device 14. For example, an audio signal R reproduced from a portable or built-in recording medium, or an audio signal R received from another communication terminal via a communication network can be used as an expression registration process target. It is also possible to execute the expression registration process in real time in parallel with the supply of the audio signal R.

(2) In each above-mentioned form, although the expression provision process of 1st Embodiment and the song evaluation process of 2nd Embodiment were performed for the audio | voice signal X which shows the time waveform of a singing voice, expression provision process and song evaluation are performed. The expression format of the singing voice to be processed is arbitrary. Specifically, the singing voice can be expressed by synthetic information (for example, a file in the VSQ format) in which pitches and pronunciation characters (lyrics) are specified in time series for each musical note. For example, the expression providing unit 38 of the first embodiment synthesizes the singing expression by the same expression providing process as that of the first embodiment while sequentially synthesizing the singing voice specified by the synthesis information by, for example, the unit connection type speech synthesis process. Give. Similarly, the song evaluation part 46 of 2nd Embodiment performs the song evaluation process similar to 2nd Embodiment, synthesize | combining the singing voice designated by synthetic | combination information sequentially by a voice synthesis process.

(3) In the first embodiment, one target expression data DS is selected for each target section. However, the expression selection unit 36 selects a plurality (typically a plurality of types) of target expression data DS for one target section. It is also possible to select. Each singing expression of the plurality of target expression data DS selected by the expression selecting unit 36 is given to one target section of the selected audio signal X in an overlapping manner. Further, the singing expression of one singing expression data DS (for example, singing expression data DS obtained by weighted addition of a plurality of object expressing data DS) obtained by integrating a plurality of object expressing data DS selected for one target section is the object. It is also possible to give to a section.

(4) In the first embodiment, the singing expression data DS corresponding to the instruction from the user is selected by specifying the search condition, but the method of selecting the singing expression data DS by the expression selecting unit 36 is arbitrary. For example, the singing voice of the singing expression indicated by each singing expression data DS is reproduced by the user from the sound emitting device 18, and the singing expression data DS designated by the user in consideration of the result of the trial listening is expressed as an expression selection unit. It is also possible for 36 to select. Moreover, the structure which selects each song expression data DS memorize | stored in the memory | storage device 12 at random, and the structure which selects each song expression data DS by the predetermined rule selected beforehand are also employ | adopted.

(5) In the first embodiment, the audio signal Y generated by the expression providing unit 38 is supplied to the sound emitting device 18 and reproduced, but the method of outputting the audio signal Y is arbitrary. For example, a configuration in which the audio signal Y generated by the expression providing unit 38 is stored in a specific recording medium (for example, the storage device 12 or a portable recording medium), or a configuration in which the audio signal Y is transmitted from the communication device to another communication terminal. Is also adopted.

(6) In the first embodiment, the speech processing apparatus 100 that executes both the expression registration process and the expression imparting process is illustrated. However, the speech processing apparatus that executes the expression registration process and the speech processing apparatus that executes the expression imparting process are provided. It can also be configured separately. A plurality of singing expression data DS generated in the expression registration process of the registration speech processing apparatus is transferred to the expression providing voice processing apparatus and applied to the expression providing process. Similarly, in the second embodiment, the voice processing device that executes the expression registration process and the voice processing device that executes the song evaluation process can be configured separately.

(7) The voice processing device 100 can be realized by a server device that communicates with a terminal device such as a mobile phone. For example, the voice processing device 100 extracts the singing expression data DS by analyzing the voice signal R received from the terminal device and stores it in the storage device 12 or the singing expression indicated by the singing expression data DS as the voice signal X. An expression providing process for transmitting the audio signal Y given to the terminal device is executed. That is, the present invention can be realized as a voice processing system including a voice processing device (server device) and a terminal device that communicate with each other. Moreover, the speech processing apparatus 100 of each embodiment described above can be realized as a system (speech processing system) in which each function is distributed to a plurality of apparatuses.

(8) In the second embodiment, the song evaluation unit 46 uses the singing expression data DS and the attribute data DA (evaluation value Q) stored in the storage device 12 to sing the audio signal X. Although evaluated, the song evaluation part 46 may obtain the evaluation value Q from a device different from the storage device 12 and evaluate the skill of singing the audio signal X.

This application is based on a Japanese patent application filed on March 15, 2013 (Japanese Patent Application No. 2013-053983), the contents of which are incorporated herein by reference.

According to the present invention, it is possible to generate singing voices with various singing expressions.

DESCRIPTION OF SYMBOLS 100 ... Voice processing device, 10 ... Arithmetic processing device, 12 ... Memory | storage device, 14 ... Sound collection device, 16 ... Input device, 18 ... Sound emission device, 20 ... Analysis processing part, 20A ... Music structure analysis unit, 20B ... Singing technique analysis unit, 20C ... Voice quality analysis unit, 22 ... Singing section, 24,46 ... Singing evaluation unit, 26 ... Singing analysis unit, 28 ... Attribute acquisition unit, 32, 42 ... Singing selection unit, 34, 44 ... Section designation unit, 36 ... Expression selection unit, 38 ... Expression giving unit.

Claims

An expression selection unit that selects singing expression data to be applied from a plurality of singing expression data indicating different singing expressions;
A speech processing apparatus comprising: an expression providing unit that applies a singing expression indicated by the singing expression data selected by the expression selecting unit to a specific section of the singing voice.
The expression selection unit selects two or more song expression data indicating different song expressions,
The speech processing apparatus according to claim 1, wherein the expression providing unit provides the singing expression indicated by each of the two or more singing expression data selected by the expression selecting unit so as to overlap with a specific section of the singing sound.
A storage unit for storing attribute data related to the singing expression in association with the singing expression data of the singing expression;
The voice processing device according to claim 1, wherein the expression selection unit selects song expression data from the storage unit with reference to attribute data of each song expression data.
The expression selection unit selects the singing expression data according to an instruction from a user,
The said expression provision part provides the song expression which the song expression data which the said expression selection part has selected to the specific area according to the instruction | indication from a user among song voices. Audio processing device.
The singing evaluation part corresponding to the singing expression data of the singing expression similar to the singing voice among the plurality of singing expression data, and evaluating the singing voice according to an evaluation value indicating the evaluation of the singing expression. Voice processing device.
A storage unit that stores singing expression data indicating singing expression and an evaluation value indicating evaluation of the singing expression for a plurality of different singing expressions,
The singing evaluation unit corresponds to singing expression data of a singing expression similar to a singing voice among the plurality of singing expression data, and evaluates the singing voice according to an evaluation value stored in the storage unit. Voice processing device.
Select the song expression data to be applied from a plurality of song expression data showing different song expressions,
The voice processing method of assigning the singing expression indicated by the selected singing expression data to a specific section of the singing voice.