WO2014142200A1 - Voice processing device - Google Patents

Voice processing device Download PDF

Info

Publication number
WO2014142200A1
WO2014142200A1 PCT/JP2014/056570 JP2014056570W WO2014142200A1 WO 2014142200 A1 WO2014142200 A1 WO 2014142200A1 JP 2014056570 W JP2014056570 W JP 2014056570W WO 2014142200 A1 WO2014142200 A1 WO 2014142200A1
Authority
WO
WIPO (PCT)
Prior art keywords
singing
expression
data
song
voice
Prior art date
Application number
PCT/JP2014/056570
Other languages
French (fr)
Japanese (ja)
Inventor
隆一 成山
克己 石川
松本 秀一
Original Assignee
ヤマハ株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ヤマハ株式会社 filed Critical ヤマハ株式会社
Priority to CN201480014605.4A priority Critical patent/CN105051811A/en
Priority to KR1020157024316A priority patent/KR20150118974A/en
Publication of WO2014142200A1 publication Critical patent/WO2014142200A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H1/00Details of electrophonic musical instruments
    • G10H1/0091Means for obtaining special acoustic effects
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2210/00Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
    • G10H2210/155Musical effects
    • G10H2210/195Modulation effects, i.e. smooth non-discontinuous variations over a time interval, e.g. within a note, melody or musical transition, of any sound parameter, e.g. amplitude, pitch, spectral response, playback speed
    • G10H2210/201Vibrato, i.e. rapid, repetitive and smooth variation of amplitude, pitch or timbre within a note or chord
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2250/00Aspects of algorithms or signal processing methods without intrinsic musical character, yet specifically adapted for or used in electrophonic musical processing
    • G10H2250/315Sound category-dependent sound synthesis processes [Gensound] for musical use; Sound category-specific synthesis-controlling parameters or control means therefor
    • G10H2250/455Gensound singing voices, i.e. generation of human voices for musical applications, vocal singing sounds or intelligible words at a desired pitch or with desired vocal effects, e.g. by phoneme synthesis

Definitions

  • the present invention relates to a technique for controlling the singing expression of the singing voice.
  • Patent Document 1 discloses a technique for collecting segment data used for segment-connected singing synthesis. Singing voices of arbitrary lyrics can be synthesized by appropriately selecting and connecting the piece data collected by the technique of Patent Document 1 to each other.
  • an object of the present invention is to generate singing voices with various singing expressions.
  • the speech processing apparatus includes an expression selection unit that selects application target song expression data from a plurality of song expression data indicating different song expressions, and a song selected by the expression selection unit.
  • An expression providing unit that assigns the singing expression indicated by the expression data to a specific section of the singing voice.
  • An expression selection part selects the 1st song expression data and 2nd song expression data which show different song expressions, and an expression provision part provides the song expression which 1st song expression data shows to the 1st area of song voice.
  • the singing expression indicated by the second singing expression data may be given to the second section of the singing voice that is different from the first section.
  • the expression selection unit selects two or more singing expression data indicating different singing expressions, and the expression providing unit specifies the singing voice indicated by each of the two or more singing expression data selected by the expression selecting unit. You may give redundantly to a section.
  • a plurality of singing expressions typically different types of singing expressions
  • the effect of generating singing voices with various singing expressions is particularly remarkable. is there.
  • the storage unit stores attribute data related to the singing expression in association with the singing expression data of the singing expression, and the expression selecting unit refers to the attribute data of each singing expression data to obtain the singing expression data from the storing unit. You may choose.
  • attribute data is associated with each singing expression data, so that the singing expression data of the singing expression given to the singing voice can be selected (searched) by referring to the attribute data.
  • the expression selection unit may select singing expression data in accordance with an instruction from the user.
  • indication from a user since the song expression data according to the instruction
  • An expression provision part may provide the singing expression which the singing expression data which the expression selection part selected to the specific area according to the instruction
  • the singing expression is given to the section according to the instruction from the user in the singing voice, there is an advantage that various singing voices reflecting the user's intention and preference can be generated.
  • the singing voice is evaluated by comparing the transition of the pitch and volume of the singing voice with the transition of the pitch and volume of the standard (exemplary) singing voice prepared in advance.
  • the evaluation of actual singing depends not only on the accuracy of pitch and volume but also on the skill of singing expression.
  • the speech processing apparatus of the present invention corresponds to the singing expression data of the singing expression similar to the singing voice among the plurality of singing expression data, and according to the evaluation value indicating the evaluation of the singing expression. You may comprise the song evaluation part which evaluates a song voice.
  • the singing voice is evaluated according to the evaluation value corresponding to the singing expression data of the singing expression similar to the singing voice, there is an advantage that the singing voice can be appropriately evaluated from the viewpoint of skill of the singing expression. .
  • the singing evaluation unit selects singing expression data of the singing expression similar to the singing expression of the target section for each of the plurality of target sections of the singing voice, and the singing voice is selected according to the evaluation value corresponding to each singing expression data. You may evaluate.
  • the specific target section of the singing voice can be evaluated with priority.
  • the target section can be the entire section of the audio signal (the entire music piece).
  • the speech processing apparatus includes a storage unit that stores singing expression data indicating singing expression and evaluation values indicating evaluation of the singing expression for a plurality of different singing expressions, and the singing evaluation unit includes the plurality of singing expression data.
  • the singing voice may be evaluated according to an evaluation value stored in the storage unit, corresponding to singing expression data of a singing expression similar to the singing voice.
  • the singing voice since the singing voice is evaluated according to the evaluation value corresponding to the singing expression data of the singing expression similar to the singing voice, the singing from the viewpoint of whether or not it is similar to the singing expression registered in the storage unit.
  • a speech processing method for selecting singing expression data to be applied from a plurality of singing expression data indicating different singing expressions, and providing the singing expression indicated by the selected singing expression data to a specific section of the singing voice. Is done.
  • the audio processing device is realized by hardware (electronic circuit) such as DSP (Digital Signal Processor) dedicated to processing of singing voice, and general-purpose arithmetic such as CPU (Central Processing Unit) This is also realized by cooperation between the processing device and the program.
  • the program according to the first aspect of the present invention includes an expression selection process for selecting application target song expression data from a plurality of song expression data indicating different song expressions, and a song expression selected in the expression selection process.
  • provides the song expression which data shows to the specific area of song voice is performed.
  • the program which concerns on the 2nd aspect of this invention is a computer which comprises the memory
  • the singing evaluation process is performed to evaluate the singing voice according to the evaluation value corresponding to the singing expression data of the singing expression similar to the singing voice among the singing expression data.
  • the program according to each of the above aspects can be provided in a form stored in a computer-readable recording medium and installed in the computer.
  • the recording medium is, for example, a non-transitory recording medium, and an optical recording medium (optical disk) such as a CD-ROM is a good example, but a known arbitrary one such as a semiconductor recording medium or a magnetic recording medium This type of recording medium can be included.
  • the program of the present invention can be provided in the form of distribution via a communication network and installed in a computer.
  • FIG. 1 is a block diagram of a speech processing apparatus according to a first embodiment of the present invention. It is a functional block diagram of the element relevant to an expression registration process. It is a block diagram of a song division part. It is a flowchart of an expression registration process. It is a functional block diagram of the element relevant to an expression provision process. It is a flowchart of an expression provision process. It is explanatory drawing of the specific example (giving of vibrato) of an expression provision process. It is explanatory drawing of an expression provision process. It is explanatory drawing of an expression provision process. It is explanatory drawing of an expression provision process. It is a functional block diagram of the element relevant to the song evaluation process of 2nd Embodiment. It is a flowchart of a song evaluation process. It is a block diagram of the audio processing apparatus which concerns on a modification.
  • FIG. 1 is a block diagram of a speech processing apparatus 100 according to the first embodiment of the present invention.
  • the sound processing device 100 is realized by a computer system including an arithmetic processing device 10, a storage device 12, a sound collecting device 14, an input device 16, and a sound emitting device 18.
  • the arithmetic processing device 10 controls each element of the speech processing device 100 by executing a program stored in the storage device 12.
  • the storage device 12 stores a program executed by the arithmetic processing device 10 and various data used by the arithmetic processing device 10.
  • a known recording medium such as a semiconductor recording medium or a magnetic recording medium or a combination of a plurality of types of recording media is arbitrarily employed as the storage device 12.
  • the storage device 12 is installed in an external device (for example, an external server device) separate from the speech processing device 100, and the speech processing device 100 writes and reads information to and from the storage device 12 via a communication network such as the Internet.
  • an external device for example, an external server device
  • the speech processing device 100 writes and reads information to and from the storage device 12 via a communication network such as the Internet.
  • a configuration for performing the above can also be adopted. That is, the storage device 12 is not an essential element of the voice processing device 100.
  • the storage device 12 of the first embodiment stores a plurality of audio signals X indicating time waveforms of different singing voices (for example, singing voices of different singers). Each of the plurality of audio signals X is prepared in advance by recording a singing voice of singing a song (singing song). Further, the storage device 12 stores a plurality of song expression data DS indicating different song expressions and a plurality of attribute data DA related to the song expressions indicated by each song expression data DS.
  • the singing expression is a characteristic of the singing (singing or singing method peculiar to the singer). Singing expression data DS is stored in the storage device 12 for a plurality of types of singing expressions extracted from singing voices pronounced by different singers, and attribute data DA is associated with each of the plurality of singing expression data DS.
  • the singing expression data DS includes, for example, a pitch or volume (distribution range), a characteristic amount of a frequency spectrum (for example, a spectrum within a specific band), a frequency or intensity of a formant of a specific order, and a characteristic amount (for example, a harmonic overtone).
  • a pitch or volume distributed range
  • a characteristic amount of a frequency spectrum for example, a spectrum within a specific band
  • a frequency or intensity of a formant of a specific order for example, a harmonic overtone.
  • Specify various features related to the musical expression of the singing voice such as the intensity ratio between the component and fundamental component, the intensity ratio between the harmonic component and the non-harmonic component), or MFCC (Mel-Frequency Cepstrum Coefficients) .
  • the singing expression exemplified above is a tendency of singing voice for a relatively short time, but the tendency of the pitch or volume to change with time and various singing techniques (for example, vibrato, fall, long tone).
  • a configuration in which the singing expression data DS specifies the tendency of the singing voice over a long time such as a tendency is also suitable.
  • the attribute data DA of each singing expression is information (metadata) related to the singer of the singing voice and the music, and is used for searching the singing expression data DS. Specifically, information on the singer sung in each singing expression (for example, name, age, birthplace, age, gender, race, native language, range), information on the song sung in each singing expression (for example, The attribute data DA specifies the song name, composer, songwriter, genre, tempo, key, chord, range, and language. The attribute data DA can also specify words (for example, words such as “rhythmic” and “sweet”) that express the impression and atmosphere of the singing voice.
  • words for example, words such as “rhythmic” and “sweet” that express the impression and atmosphere of the singing voice.
  • the attribute data DA of the first embodiment includes an evaluation value (a skill evaluation index of the singing expression of the singing expression data DS) Q according to the evaluation result of the singing voice sung by each singing expression.
  • the attribute value DA includes an evaluation value Q calculated by a known singing evaluation process and an evaluation value Q reflecting the evaluation by each user other than the singer.
  • the items specified by the attribute data DA are not limited to the above examples.
  • the attribute data DA can specify which section of the music structure into which the music is divided (for example, each phrase such as A melody, chorus, B melody) in which the singing expression is sung.
  • the sound collection device 14 is a device (microphone) that collects ambient sounds.
  • the sound collection device 14 according to the first embodiment generates an audio signal R by collecting a singing voice in which a singer sang a song (singing song).
  • An A / D converter that converts the audio signal R from analog to digital is not shown for convenience.
  • a configuration in which the audio signal R is stored in the storage device 12 (therefore, the sound collection device 14 can be omitted) is also suitable.
  • the input device 16 is an operation device that receives an instruction from the user to the voice processing device 100, and includes, for example, a plurality of operators that can be operated by the user.
  • an operation panel installed in the casing of the voice processing device 100 or a remote control device separate from the voice processing device 100 is employed as the input device 16.
  • the arithmetic processing unit 10 executes various control processes and arithmetic processes by executing the programs stored in the storage device 12. Specifically, the arithmetic processing device 10 extracts the singing expression data DS by analyzing the audio signal R supplied from the sound collecting device 14 and stores it in the storage device 12 (hereinafter referred to as “expression registration processing”). The process of generating the audio signal Y by adding the singing expression indicated by each singing expression data DS stored in the storage device 12 in the expression registration process to the audio signal X in the storage apparatus 12 (hereinafter referred to as “expression adding process”). ) And execute.
  • the audio signal Y is an acoustic signal in which the singing expression of the audio signal X matches or resembles the singing expression of the singing expression data DS while maintaining the pronunciation content (lyrics) of the audio signal X.
  • one of the expression registration process and the expression providing process is selectively executed in accordance with an instruction from the user to the input device 16.
  • the sound emitting device 18 (for example, a speaker or a headphone) in FIG. 1 reproduces sound corresponding to the audio signal Y generated by the arithmetic processing device 10 in the expression providing process.
  • a D / A converter that converts the audio signal Y from digital to analog and an amplifier that amplifies the audio signal Y are omitted for the sake of convenience.
  • FIG. 2 is a functional configuration diagram of elements related to the expression registration process in the voice processing apparatus 100.
  • the arithmetic processing unit 10 executes a program (expression registration program) stored in the storage device 12 to thereby execute a plurality of elements (analysis processing unit 20, song) for realizing the expression registration process as shown in FIG. It functions as a classification unit 22, a song evaluation unit 24, a song analysis unit 26, and an attribute acquisition unit 28).
  • a configuration in which each function of FIG. 2 is distributed over a plurality of integrated circuits, or a configuration in which a dedicated electronic circuit (for example, DSP) realizes a part of the functions illustrated in FIG. 2 may be employed.
  • the analysis processing unit 20 in FIG. 2 analyzes the audio signal R supplied from the sound collection device 14.
  • the analysis processing unit 20 includes a music structure analysis unit 20A, a singing technique analysis unit 20B, and a voice quality analysis unit 20C.
  • the music structure analysis unit 20A analyzes a section (for example, each phrase such as A melody, chorus, and B melody) on the music structure of the music corresponding to the audio signal R.
  • the singing technique analysis unit 20B includes vibrato (singing technique for finely changing the pitch), shackle (singing technique for changing from a pitch lower than the target pitch to a target pitch) and fall (exceeding the target pitch).
  • the voice quality analysis unit 20C analyzes the voice quality of the singing voice (for example, the intensity ratio between the harmonic component and the fundamental component and the intensity ratio between the harmonic component and the non-harmonic component).
  • the singing section 22 of the first embodiment defines each unit section of the audio signal R according to the music structure, singing technique, and voice quality. Specifically, the singing section 22 includes end points of each section on the music structure of the music analyzed by the music structure analysis unit 20A, end points of each section where the singing technique analysis unit 20B detects various singing techniques, The voice signal R is divided into unit sections with the time point when the voice quality analyzed by the voice quality analysis unit 20C fluctuates. Note that the method of dividing the audio signal R into a plurality of unit sections is not limited to the above examples.
  • the audio signal R can be divided using a section specified by the user by an operation on the input device 16 as a unit section. Further, the audio signal R is divided into a plurality of unit sections according to the configuration in which the audio signal R is divided into a plurality of unit sections when set on the time axis at random, or the evaluation value Q calculated by the singing evaluation unit 24.
  • a configuration for example, a configuration in which each unit section is defined with a time point at which the evaluation value Q fluctuates
  • the song evaluation unit 24 evaluates the skill of the song indicated by the audio signal R supplied from the sound collection device 14. Specifically, the singing evaluation unit 24 sequentially calculates an evaluation value Q obtained by evaluating the skill of singing the audio signal R for each unit section defined by the singing division unit 22. For the calculation of the evaluation value Q by the song evaluation unit 24, a known song evaluation process is arbitrarily employed. Note that the singing technique analyzed by the singing technique analyzing unit 20B and the voice quality analyzed by the voice quality analyzing unit 20C can be applied to the singing evaluation by the singing evaluation unit 24.
  • the singing analysis unit 26 in FIG. 2 generates the singing expression data DS for each unit section by analyzing the audio signal R. Specifically, the singing analysis unit 26 extracts an acoustic feature quantity (feature quantity that affects the singing expression) such as pitch and volume from the voice signal R, and a short-term or long-term tendency of each feature quantity. Singing expression data DS indicating (that is, singing expression) is generated.
  • a known acoustic analysis technique for example, the technique disclosed in Japanese Patent Application Laid-Open No. 2011-013454 and Japanese Patent Application Laid-Open No. 2011-028230 is arbitrarily employed for extracting the singing expression.
  • one singing expression data DS is generated for each unit section.
  • one singing expression data DS can be generated from a plurality of feature quantities of different unit sections.
  • a configuration in which the singing expression data DS is generated by averaging feature quantities of a plurality of unit sections that are approximated or matched with the attribute data DA, and a weight value corresponding to the evaluation value Q of each unit section by the singing evaluation unit 24 is used.
  • a configuration is adopted in which singing expression data DS is generated by applying weighted addition of feature quantities over a plurality of unit sections.
  • Attribute acquisition unit 28 generates attribute data DA for each unit section defined by singing section 22. Specifically, the attribute acquisition unit 28 registers, in the attribute data DA, various types of information that the user has instructed by operating the input device 16. Further, the attribute acquisition unit 28 includes the evaluation value Q (for example, the average of the evaluation values in the unit section) calculated by the singing evaluation unit 24 for each unit section in the attribute data DA of the unit section.
  • the evaluation value Q for example, the average of the evaluation values in the unit section
  • the singing expression data DS generated by the singing analysis unit 26 for each unit section and the attribute data DA generated by the attribute acquisition unit 28 for each unit section are stored in association with each other in common unit sections. It is stored in the device 12.
  • the expression registration process exemplified above is repeated for the audio signals R of a plurality of different singing voices, so that each of a plurality of types of singing expressions extracted from the singing voices uttered by each of a plurality of singers, Singing expression data DS and attribute data DA are stored in the storage device 12. That is, a database of various singing expressions (singing expressions with different singers and singing expressions with different types) is constructed in the storage device 12.
  • FIG. 4 is a flowchart of the expression registration process.
  • the analysis processing unit 20 analyzes the audio signal R supplied from the sound collection device 14 (SA2). ).
  • the singing section 22 classifies the voice signal R into each unit section according to the analysis result by the analysis processing section 20 (SA3), and the singing analysis section 26 analyzes the voice signal R to express the singing expression for each unit section.
  • Data DS is generated (SA4).
  • the singing evaluation unit 24 calculates an evaluation value Q corresponding to the skill of the singing indicated by the audio signal R for each unit section (SA5), and the attribute acquisition unit 28 calculates the singing evaluation unit 24 for each unit section.
  • Attribute data DA including the evaluation value Q is generated for each unit section (SA6).
  • the song expression data DS generated by the song analysis unit 26 and the attribute data DA generated by the attribute acquisition unit 28 are stored in the storage device 12 for each unit section (SA7).
  • the singing expression specified by the singing expression data DS accumulated in the storage device 12 by the expression registration process described above is given to the audio signal X by the expression giving process described below.
  • FIG. 5 is a functional configuration diagram of elements related to the expression providing process in the audio processing apparatus 100.
  • the arithmetic processing unit 10 executes a program (expression providing program) stored in the storage device 12 to perform a plurality of functions (singing selection unit 32, section) for realizing the expression providing process as shown in FIG. It functions as a designation unit 34, an expression selection unit 36, and an expression giving unit 38).
  • a configuration in which each function of FIG. 5 is distributed over a plurality of integrated circuits, or a configuration in which a dedicated electronic circuit (for example, DSP) executes a part of the functions illustrated in FIG. 5 may be employed.
  • the singing selection unit 32 selects one of the plurality of audio signals X stored in the storage device 12 (hereinafter referred to as “selected audio signal X”). For example, the singing selection unit 32 selects the selected sound signal X from the plurality of sound signals X in the storage device 12 in accordance with an instruction (selection instruction for the sound signal X) from the user to the input device 16.
  • the section designating unit 34 designates one or more sections (hereinafter referred to as “target section”) to which the singing expression of the singing expression data DS is to be added in the selected audio signal X selected by the singing selecting unit 32.
  • the section specifying unit 34 specifies each target section in accordance with an instruction from the user to the input device 16.
  • the section specifying unit 34 defines a section between two points specified by the user on the time axis (for example, on the waveform of the selected audio signal X) by operating the input device 16 as a target section.
  • a plurality of target sections specified by the section specifying unit 34 may overlap each other on the time axis. Note that it is also possible to specify the entire section (the entire music piece) of the selected audio signal X as the target section.
  • the expression selection unit 36 shown in FIG. 5 selects the singing expression data DS (hereinafter referred to as “target expression data DS”) that is actually applied to the expression providing process among the plurality of singing expression data DS stored in the storage device 12.
  • the designation unit 34 sequentially selects the target sections designated.
  • the expression selection unit 36 of the first embodiment selects the target expression data DS from the plurality of song expression data DS in the search process using the attribute data DA stored in the storage device 12 in association with each song expression data DS. .
  • the user can designate the search condition (for example, search word) of the target expression data DS for each target section by appropriately operating the input device 16.
  • the expression selection unit 36 selects the singing expression data DS corresponding to the attribute data DA that matches the search condition specified by the user among the plurality of singing expression data DS in the storage device 12 as the target expression data DS for each target section.
  • a search condition for example, age or gender
  • the target expression data DS corresponding to the attribute data DA of the singer that matches the search condition that is, the singing expression of the singer that matches the search condition
  • the target expression data DS corresponding to the music attribute data DA that matches the search condition that is, the song singing expression that matches the search condition.
  • a search condition for example, a numerical range
  • the target expression data DS corresponding to the attribute data DA of the evaluation value Q that matches the search condition that is, the level intended by the user. Singing expression of the singer).
  • the expression selection unit 36 of the first embodiment is expressed as an element that selects the singing expression data DS (target expression data DS) in accordance with an instruction from the user.
  • the expression providing unit 38 sings the target expression data DS selected by the expression selection unit 36 for the target section for each of a plurality of target sections specified by the section specifying unit 34 in the selected audio signal X.
  • indication from a user (designation of search conditions) is provided with respect to each object area according to the instruction
  • FIG. A well-known technique is arbitrarily employ
  • the singing expression of the selected audio signal X In addition to the configuration that replaces the singing expression of the selected audio signal X with the singing expression of the target expression data DS (the configuration in which the singing expression of the selected audio signal X does not remain in the audio signal Y), the singing expression of the selected audio signal X
  • a configuration in which the singing expression of the target expression data DS is cumulatively given (for example, a structure in which both the singing expression of the selected audio signal X and the singing expression of the target expression data DS are reflected in the audio signal Y) may be employed.
  • FIG. 6 is a flowchart of the expression providing process.
  • the singing selection unit 32 selects the selected audio signal from the plurality of audio signals X stored in the storage device 12.
  • X is selected (SB2), and the section designating unit 34 designates one or more target sections for the selected audio signal X (SB3).
  • the expression selection unit 36 selects the target expression data DS from the plurality of song expression data DS stored in the storage device 12 (SB4), and the expression giving unit 38 selects the selected voice signal X selected by the song selection unit 32.
  • the voice signal Y is generated by giving the singing expression of the target expression data DS to each of the target sections (SB5).
  • the sound signal Y generated by the expression providing unit 38 is reproduced from the sound emitting device 18 (SB6).
  • FIG. 7 is an explanatory diagram of a specific example of the expression providing process to which the singing expression data DS indicating vibrato is applied.
  • FIG. 7 illustrates the time change of the pitch (pitch) of the selected voice signal X and a plurality of song expression data DS (DS [1] to DS [4]).
  • Each singing expression data DS is generated by an expression registration process for each audio signal R containing singing voices of different singers. Therefore, the vibrato represented by each song expression data DS (DS [1] to DS [4]) has different characteristics such as pitch fluctuation period (speed) and fluctuation width (depth). As shown in FIG.
  • the target section of the selected audio signal X is designated according to an instruction from the user (SB3), and the target expression data DS is selected from a plurality of song expression data DS, for example, according to an instruction from the user.
  • the audio signal Y in which the vibrato indicated by the target expression data DS [3] is added to the target section of the selected audio signal X is generated by the expression adding process (SB5).
  • the desired singing expression in the desired target section in the audio signal X of the singing voice sung without vibrato for example, the singing voice of a singer who is not good at vibrato
  • Vibrato for data DS is given.
  • the structure for a user to select object expression data DS from several song expression data DS is arbitrary.
  • a predetermined singing voice to which the singing expression of each singing expression data DS is given is reproduced from the sound emitting device 18 and listened to (ie, auditioned) by the user, and the input device 16 (
  • a configuration in which the target expression data DS is selected by operating a button or a touch panel is preferable.
  • the expression selection unit 36 selects the target expression data DS1 for the target section S1 of the selected audio signal X, and the expression selection unit 36 selects the target expression data DS2 for the target section S2 different from the target section S1.
  • the expression giving unit 38 gives the singing expression E1 indicated by the target expression data DS1 to the target section S1, and also gives the singing expression E2 indicated by the target expression data DS2 to the target section S2.
  • the target section S1 and the target section S2 overlap (when the target section S2 is included in the target section S1), the target section S1 and the target section S2 of the selected audio signal X
  • the overlapping section (that is, the target section S2) is given the song expression E1 of the target expression data DS1 and the song expression E2 of the target expression data DS2 redundantly. That is, a plurality of (typically a plurality of types) singing expressions are given to the specific section of the selected audio signal X in an overlapping manner. For example, both the singing expression E1 related to pitch fluctuation and the singing expression E2 related to volume fluctuation are given to the selected voice signal X (target section S2).
  • the sound signal Y generated by the above processing is supplied to the sound emitting device 18 and reproduced as sound.
  • each singing expression of the plurality of singing expression data DS indicating different singing expressions is selectively given to the target section of the selected audio signal X. Therefore, it is possible to generate singing voices (voice signal Y) with various singing expressions as compared with the technique of Patent Document 1.
  • the target section to which the singing expression is given is the selected voice.
  • the above-described effect of generating singing voices with various singing expressions is particularly remarkable.
  • a plurality (several types) of singing expressions can be given redundantly to the target section of the selected audio signal X (FIG. 9), so the singing expression given to the target section is limited to one type.
  • the effect of being able to generate singing voices with various singing expressions is particularly remarkable as compared to the configuration to be performed.
  • the configuration in which the target section to which the singing expression is given is limited to one section of the selected audio signal X and the configuration in which the singing expression given to the target section is limited to one type are also within the scope of the present invention. Is included.
  • the target section of the selected audio signal X is designated according to an instruction from the user, and the search condition for the attribute data DA is set according to the instruction from the user.
  • Second Embodiment A second embodiment of the present invention will be described.
  • the plurality of singing expression data DS stored in the storage device 12 is used for adjusting the singing expression of the audio signal X.
  • a plurality of singing expression data DS stored in the storage device 12 are used for the evaluation of the speech signal X.
  • standard referred by description of 1st Embodiment is diverted, and each detailed description is abbreviate
  • FIG. 10 is a functional configuration diagram of elements related to a process of evaluating the audio signal X (hereinafter referred to as “singing evaluation process”) in the audio processing apparatus 100 of the second embodiment.
  • the storage device 12 of the second embodiment stores a plurality of sets of song expression data DS and attribute data DA generated by the same expression registration process as that of the first embodiment.
  • the attribute data DA corresponding to each singing expression data DS is the evaluation value calculated by the singing evaluation unit 24 in FIG. 2 (the evaluation index of the skill of the singing expression data DS) Q as described above for the first embodiment. It is comprised including.
  • the arithmetic processing unit 10 executes a program (singing evaluation program) stored in the storage device 12 and thereby, as shown in FIG. 10, a plurality of elements (singing selection unit 42, section) for realizing the singing evaluation processing It functions as a designation unit 44 and a singing evaluation unit 46).
  • a program for example, a program stored in the storage device 12 and thereby, as shown in FIG. 10, a plurality of elements (singinging selection unit 42, section) for realizing the singing evaluation processing It functions as a designation unit 44 and a singing evaluation unit 46).
  • the expression provision process of 1st Embodiment and the song evaluation process explained in full detail below are selectively performed.
  • the expression providing process can be omitted. Note that it is also possible to adopt a configuration in which the functions in FIG. 10 are distributed over a plurality of integrated circuits, or a configuration in which a dedicated electronic circuit (for example, DSP) realizes some of the functions illustrated in FIG.
  • a dedicated electronic circuit for example, D
  • the 10 selects the selected audio signal X to be evaluated from among the plurality of audio signals X stored in the storage device 12.
  • the singing selection unit 42 selects the selected audio signal X from the storage device 12 in accordance with an instruction from the user to the input device 16, similarly to the singing selection unit 32 of the first embodiment.
  • the section specifying unit 44 specifies one or more target sections to be evaluated from the selected audio signal X selected by the song selection unit 42.
  • the section specifying unit 44 specifies each target section in accordance with an instruction from the user to the input device 16, similarly to the section specifying unit 34 of the first embodiment. It is also possible to specify the entire section of the selected audio signal X as the target section.
  • the singing evaluation unit 46 of FIG. 10 uses each singing expression data DS and each attribute data DA (evaluation value Q) stored in the storage device 12 to perform the singing of the selected voice signal X selected by the singing selection unit 42. Evaluate skill. That is, the singing evaluation unit 46 sets the evaluation value Q in the attribute data DA corresponding to the singing expression data DS of the singing expression similar to each target section of the selected voice signal X among the plurality of singing expression data DS of the storage device 12. Accordingly, the evaluation value Z of the selected audio signal X is calculated.
  • the specific operation of the singing evaluation unit 46 will be described below.
  • the singing evaluation unit 46 first determines the similarity (correlation or distance) between the singing expression indicated by the singing expression data DS and the singing expression of the target section of the selected audio signal X in each of the plurality of singing expression data DS in the storage device 12. Is calculated for each target section, and the singing expression data DS having the maximum similarity to the singing expression of the target section among the plurality of singing expression data DS is sequentially selected for each of the plurality of target sections of the selected audio signal X. .
  • a known technique for comparing feature quantities is arbitrarily employed for calculating the similarity of singing expressions.
  • the singing evaluation unit 46 weights and adds the evaluation value Q of the attribute data DA corresponding to the singing expression data DS selected for each target section of the selected speech signal X for a plurality of target sections of the selected speech signal X (or
  • the evaluation value Z of the selected speech signal X is calculated by averaging.
  • the evaluation value Z of the selected sound signal X is larger as the target section sung with a singing expression similar to the singing expression having a higher evaluation value Q is included in the selected sound signal X.
  • the evaluation value Z calculated by the singing evaluation unit 46 is notified to the user by, for example, image display by a display device (not shown) or sound reproduction by the sound emitting device 18.
  • FIG. 11 is a flowchart of the song evaluation process.
  • the song selection unit 42 selects the selected voice signal from the plurality of voice signals X stored in the storage device 12.
  • X is selected (SC2)
  • the section designating unit 44 designates one or more target sections for the selected audio signal X (SC3).
  • the song evaluation unit 46 calculates the evaluation value Z of the selected voice signal X using each song expression data DS and each attribute data DA stored in the storage device 12 (SC4).
  • the evaluation value Z calculated by the singing evaluation unit 46 is notified to the user (SC5).
  • the evaluation value Z of the selected voice signal X is calculated according to the evaluation value Q of the song expression data DS whose singing expression is similar to the selected voice signal X. Therefore, it is possible to appropriately evaluate the selected speech signal X from the viewpoint of skill of singing expression (similarity with the singing expression registered in the expression registration process).
  • information other than the evaluation value Q in the attribute data DA can be omitted. That is, the memory
  • the target of the expression providing process of the first embodiment and the target of the song evaluation process of the second embodiment are not limited to the audio signal X recorded in advance and stored in the storage device 12.
  • the audio signal X generated by the sound collection device 14, the audio signal X reproduced from a portable or built-in recording medium (for example, a CD), and the audio signal received from another communication terminal via the communication network (for example, a streaming-format audio signal) X can be used as an object of the expression imparting process or the song evaluation process.
  • the structure which performs an expression provision process and a song evaluation process is also employ
  • combination process for example, segment connection type song synthesis process).
  • the expression providing process and the singing evaluation process are performed on the recorded audio signal X. For example, if each target section on the time axis is specified in advance, the audio signal X is supplied. In parallel, it is also possible to execute the expression providing process and the singing evaluation process in real time.
  • one of the plurality of audio signals X is selected as the selected audio signal X, but the selection of the audio signal X (the singing selection unit 32 or the singing selection unit 42) may be omitted.
  • the section specifying unit 34 can be omitted. Therefore, the speech processing apparatus that executes the expression providing process is selected by the expression selecting unit 36 that selects the singing expression data DS to be applied from the plurality of singing expression data DS and the expression selecting unit 36, as illustrated in FIG.
  • the singing expression indicated by the singing expression data DS is comprehensively expressed as an apparatus including an expression providing unit 38 that applies a singing expression to a specific section of the singing voice (audio signal X).
  • the target of the expression registration process is not limited to the audio signal R generated by the sound collection device 14.
  • an audio signal R reproduced from a portable or built-in recording medium, or an audio signal R received from another communication terminal via a communication network can be used as an expression registration process target. It is also possible to execute the expression registration process in real time in parallel with the supply of the audio signal R.
  • the expression provision process of 1st Embodiment and the song evaluation process of 2nd Embodiment were performed for the audio
  • expression provision process and song evaluation are performed.
  • the expression format of the singing voice to be processed is arbitrary.
  • the singing voice can be expressed by synthetic information (for example, a file in the VSQ format) in which pitches and pronunciation characters (lyrics) are specified in time series for each musical note.
  • the expression providing unit 38 of the first embodiment synthesizes the singing expression by the same expression providing process as that of the first embodiment while sequentially synthesizing the singing voice specified by the synthesis information by, for example, the unit connection type speech synthesis process.
  • the song evaluation part 46 of 2nd Embodiment performs the song evaluation process similar to 2nd Embodiment, synthesize
  • one target expression data DS is selected for each target section.
  • the expression selection unit 36 selects a plurality (typically a plurality of types) of target expression data DS for one target section. It is also possible to select.
  • Each singing expression of the plurality of target expression data DS selected by the expression selecting unit 36 is given to one target section of the selected audio signal X in an overlapping manner.
  • the singing expression of one singing expression data DS (for example, singing expression data DS obtained by weighted addition of a plurality of object expressing data DS) obtained by integrating a plurality of object expressing data DS selected for one target section is the object. It is also possible to give to a section.
  • the singing expression data DS corresponding to the instruction from the user is selected by specifying the search condition, but the method of selecting the singing expression data DS by the expression selecting unit 36 is arbitrary.
  • the singing voice of the singing expression indicated by each singing expression data DS is reproduced by the user from the sound emitting device 18, and the singing expression data DS designated by the user in consideration of the result of the trial listening is expressed as an expression selection unit.
  • 36 it is also possible for 36 to select.
  • storage device 12 at random, and the structure which selects each song expression data DS by the predetermined rule selected beforehand are also employ
  • the audio signal Y generated by the expression providing unit 38 is supplied to the sound emitting device 18 and reproduced, but the method of outputting the audio signal Y is arbitrary.
  • a configuration in which the audio signal Y generated by the expression providing unit 38 is stored in a specific recording medium (for example, the storage device 12 or a portable recording medium), or a configuration in which the audio signal Y is transmitted from the communication device to another communication terminal. Is also adopted.
  • the speech processing apparatus 100 that executes both the expression registration process and the expression imparting process is illustrated.
  • the speech processing apparatus that executes the expression registration process and the speech processing apparatus that executes the expression imparting process are provided. It can also be configured separately.
  • a plurality of singing expression data DS generated in the expression registration process of the registration speech processing apparatus is transferred to the expression providing voice processing apparatus and applied to the expression providing process.
  • the voice processing device that executes the expression registration process and the voice processing device that executes the song evaluation process can be configured separately.
  • the voice processing device 100 can be realized by a server device that communicates with a terminal device such as a mobile phone.
  • the voice processing device 100 extracts the singing expression data DS by analyzing the voice signal R received from the terminal device and stores it in the storage device 12 or the singing expression indicated by the singing expression data DS as the voice signal X.
  • An expression providing process for transmitting the audio signal Y given to the terminal device is executed.
  • the present invention can be realized as a voice processing system including a voice processing device (server device) and a terminal device that communicate with each other.
  • the speech processing apparatus 100 of each embodiment described above can be realized as a system (speech processing system) in which each function is distributed to a plurality of apparatuses.
  • the song evaluation unit 46 uses the singing expression data DS and the attribute data DA (evaluation value Q) stored in the storage device 12 to sing the audio signal X. Although evaluated, the song evaluation part 46 may obtain the evaluation value Q from a device different from the storage device 12 and evaluate the skill of singing the audio signal X.
  • DESCRIPTION OF SYMBOLS 100 ... Voice processing device, 10 ... Arithmetic processing device, 12 ... Memory

Abstract

A storage device (12) stores singing expression data DS that indicate singing expression, and attribute data DA that pertain to the singing expression, regarding a plurality of different singing expressions. A section designation unit (34) designates each target section of a selection voice signal X according to an instruction from a user. An expression selection section (36) refers to each attribute data DA so as to select singing expression data DS according to the instruction from the user (search condition) for each target section. To each target section of the selection voice signal X, an expression imparting unit (38) imparts singing expression indicated by the singing expression data DS selected by the expression selection section (36) regarding the target section.

Description

音声処理装置Audio processing device
 本発明は、歌唱音声の歌唱表現を制御する技術に関する。 The present invention relates to a technique for controlling the singing expression of the singing voice.
 歌唱音声を処理する各種の技術が従来から提案されている。例えば特許文献1には、素片接続型の歌唱合成に利用される素片データを収集する技術が開示されている。特許文献1の技術で収集された素片データを適宜に選択して相互に接続することで任意の歌詞の歌唱音声を合成することが可能である。 Various techniques for processing singing voice have been proposed. For example, Patent Document 1 discloses a technique for collecting segment data used for segment-connected singing synthesis. Singing voices of arbitrary lyrics can be synthesized by appropriately selecting and connecting the piece data collected by the technique of Patent Document 1 to each other.
日本国特開2003-108179号公報Japanese Unexamined Patent Publication No. 2003-108179
 実際の歌唱音声には歌唱者に固有の歌唱表現(歌い廻し)が付与される。しかし、特許文献1の技術では歌唱音声の各種の歌唱表現が加味されていないから、素片データを利用して合成された歌唱音声が聴感的に単調な印象になり易いという問題がある。以上の事情を考慮して、本発明は、多様な歌唱表現の歌唱音声を生成することを目的とする。 The actual singing voice is given a unique singing expression (singing) for the singer. However, since the technique of Patent Document 1 does not take into account various singing expressions of the singing voice, there is a problem that the singing voice synthesized using the segment data tends to be audibly monotonous. In view of the above circumstances, an object of the present invention is to generate singing voices with various singing expressions.
 以上の課題を解決するために、本発明の音声処理装置は、相異なる歌唱表現を示す複数の歌唱表現データから適用対象の歌唱表現データを選択する表現選択部と、表現選択部が選択した歌唱表現データが示す歌唱表現を歌唱音声の特定区間に付与する表現付与部とを具備する。
 以上の様態では、歌唱表現データが示す歌唱表現が歌唱音声に付与されるから、特許文献1の技術と比較して、多様な歌唱表現の歌唱音声を生成することが可能である。特に、歌唱表現データが示す複数の歌唱表現が歌唱音声の特定区間に選択的に付与されるから、多様な歌唱表現の歌唱音声を生成できるという効果は格別に顕著である。
In order to solve the above-described problems, the speech processing apparatus according to the present invention includes an expression selection unit that selects application target song expression data from a plurality of song expression data indicating different song expressions, and a song selected by the expression selection unit. An expression providing unit that assigns the singing expression indicated by the expression data to a specific section of the singing voice.
In the above aspect, since the singing expression indicated by the singing expression data is given to the singing voice, it is possible to generate singing voices of various singing expressions as compared with the technique of Patent Document 1. In particular, since a plurality of singing expressions indicated by the singing expression data are selectively given to a specific section of the singing voice, the effect of generating singing voices with various singing expressions is particularly remarkable.
 表現選択部は、相異なる歌唱表現を示す第1歌唱表現データと第2歌唱表現データとを選択し、表現付与部は、第1歌唱表現データが示す歌唱表現を歌唱音声の第1区間に付与するとともに、第2歌唱表現データが示す歌唱表現を、歌唱音声のうち第1区間とは相違する第2区間に付与してもよい。
 以上の態様では、歌唱音声の区間毎に別個の歌唱表現が付与されるから、多様な歌唱表現の歌唱音声を生成できるという効果は格別に顕著である。
An expression selection part selects the 1st song expression data and 2nd song expression data which show different song expressions, and an expression provision part provides the song expression which 1st song expression data shows to the 1st area of song voice. In addition, the singing expression indicated by the second singing expression data may be given to the second section of the singing voice that is different from the first section.
In the above aspect, since a separate singing expression is given for each section of the singing voice, the effect that the singing voices of various singing expressions can be generated is particularly remarkable.
 表現選択部は、相異なる歌唱表現を示す2以上の歌唱表現データを選択し、表現付与部は、表現選択部が選択した2以上の歌唱表現データの各々が示す歌唱表現を、歌唱音声の特定区間に重複して付与してもよい。
 以上の態様では、複数の歌唱表現(典型的には相異なる種類の歌唱表現)が歌唱音声に重複して付与されるから、多様な歌唱表現の歌唱音声を生成できるという効果は格別に顕著である。
The expression selection unit selects two or more singing expression data indicating different singing expressions, and the expression providing unit specifies the singing voice indicated by each of the two or more singing expression data selected by the expression selecting unit. You may give redundantly to a section.
In the above aspect, since a plurality of singing expressions (typically different types of singing expressions) are added to the singing voice, the effect of generating singing voices with various singing expressions is particularly remarkable. is there.
 歌唱表現に関連する属性データを当該歌唱表現の歌唱表現データに対応付けて記憶する記憶部を具備し、表現選択部は、各歌唱表現データの属性データを参照して記憶部から歌唱表現データを選択してもよい。
 以上の態様では、各歌唱表現データに属性データが対応付けられるから、歌唱音声に付与される歌唱表現の歌唱表現データを属性データの参照により選択(検索)することが可能である。
The storage unit stores attribute data related to the singing expression in association with the singing expression data of the singing expression, and the expression selecting unit refers to the attribute data of each singing expression data to obtain the singing expression data from the storing unit. You may choose.
In the above aspect, attribute data is associated with each singing expression data, so that the singing expression data of the singing expression given to the singing voice can be selected (searched) by referring to the attribute data.
 表現選択部は、利用者からの指示に応じて歌唱表現データを選択してもよい。
 以上の態様では、利用者からの指示に応じた歌唱表現データが選択されるから、利用者の意図や嗜好を反映した多様な歌唱音声を生成できるという利点がある。
 表現付与部は、歌唱音声のうち利用者からの指示に応じた特定区間に、表現選択部が選択した歌唱表現データが示す歌唱表現を付与してもよい。
 以上の態様では、歌唱音声のうち利用者からの指示に応じた区間に歌唱表現が付与されるから、利用者の意図や嗜好を反映した多様な歌唱音声を生成できるという利点がある。
The expression selection unit may select singing expression data in accordance with an instruction from the user.
In the above aspect, since the song expression data according to the instruction | indication from a user is selected, there exists an advantage that the various singing voice reflecting a user's intention and preference can be produced | generated.
An expression provision part may provide the singing expression which the singing expression data which the expression selection part selected to the specific area according to the instruction | indication from a user among song voices shows.
In the above aspect, since the singing expression is given to the section according to the instruction from the user in the singing voice, there is an advantage that various singing voices reflecting the user's intention and preference can be generated.
 ところで、歌唱の巧拙を評価する各種の技術が従来から提案されている。例えば、歌唱音声の音高や音量の遷移と事前に用意された基準的(模範的)な歌唱音声の音高や音量の遷移とを対比することで歌唱音声が評価される。しかし、実際の歌唱の評価は、音高や音量の正確性だけでなく歌唱表現の巧拙にも依存する。
 以上の事情を考慮して、本発明の音声処理装置は、複数の歌唱表現データのうち歌唱音声に類似する歌唱表現の歌唱表現データに対応し、当該歌唱表現の評価を示す評価値に応じて歌唱音声を評価する歌唱評価部を具備してもよい。
 以上の様態では、歌唱音声に類似する歌唱表現の歌唱表現データに対応した評価値に応じて歌唱音声が評価されるから、歌唱表現の巧拙という観点から歌唱音声を適切に評価できるという利点がある。
By the way, various techniques for evaluating the skill of singing have been proposed. For example, the singing voice is evaluated by comparing the transition of the pitch and volume of the singing voice with the transition of the pitch and volume of the standard (exemplary) singing voice prepared in advance. However, the evaluation of actual singing depends not only on the accuracy of pitch and volume but also on the skill of singing expression.
In consideration of the above circumstances, the speech processing apparatus of the present invention corresponds to the singing expression data of the singing expression similar to the singing voice among the plurality of singing expression data, and according to the evaluation value indicating the evaluation of the singing expression. You may comprise the song evaluation part which evaluates a song voice.
In the above aspect, since the singing voice is evaluated according to the evaluation value corresponding to the singing expression data of the singing expression similar to the singing voice, there is an advantage that the singing voice can be appropriately evaluated from the viewpoint of skill of the singing expression. .
 歌唱評価部は、歌唱音声の複数の対象区間の各々について当該対象区間の歌唱表現に類似する歌唱表現の歌唱表現データを選択し、当該各歌唱表現データに対応する評価値に応じて歌唱音声を評価してもよい。
 以上の態様では、歌唱音声の複数の対象区間の各々について選択された歌唱表現データに対応した評価値に応じて歌唱音声が評価されるから、歌唱音声の特定の対象区間を重点的に評価できるという利点がある。ただし、対象区間を音声信号の全区間(楽曲の全体)とすることも可能である。
 音声処理装置は、歌唱表現を示す歌唱表現データと当該歌唱表現の評価を示す評価値とを相異なる複数の歌唱表現について記憶する記憶部を具備し、歌唱評価部は、前記複数の歌唱表現データのうち歌唱音声に類似する歌唱表現の歌唱表現データに対応し、前記記憶部に記憶された評価値に応じて前記歌唱音声を評価してもよい。
 以上の様態では、歌唱音声に類似する歌唱表現の歌唱表現データに対応した評価値に応じて歌唱音声が評価されるから、記憶部に登録された歌唱表現と類否するかどうかという観点から歌唱音声を適切に評価できるという利点がある。
 本発明において、相異なる歌唱表現を示す複数の歌唱表現データから適用対象の歌唱表現データを選択し、前記選択した歌唱表現データが示す歌唱表現を歌唱音声の特定区間に付与する音声処理方法が提供される。
The singing evaluation unit selects singing expression data of the singing expression similar to the singing expression of the target section for each of the plurality of target sections of the singing voice, and the singing voice is selected according to the evaluation value corresponding to each singing expression data. You may evaluate.
In the above aspect, since the singing voice is evaluated according to the evaluation value corresponding to the singing expression data selected for each of the plurality of target sections of the singing voice, the specific target section of the singing voice can be evaluated with priority. There is an advantage. However, the target section can be the entire section of the audio signal (the entire music piece).
The speech processing apparatus includes a storage unit that stores singing expression data indicating singing expression and evaluation values indicating evaluation of the singing expression for a plurality of different singing expressions, and the singing evaluation unit includes the plurality of singing expression data. The singing voice may be evaluated according to an evaluation value stored in the storage unit, corresponding to singing expression data of a singing expression similar to the singing voice.
In the above aspect, since the singing voice is evaluated according to the evaluation value corresponding to the singing expression data of the singing expression similar to the singing voice, the singing from the viewpoint of whether or not it is similar to the singing expression registered in the storage unit There is an advantage that sound can be properly evaluated.
In the present invention, there is provided a speech processing method for selecting singing expression data to be applied from a plurality of singing expression data indicating different singing expressions, and providing the singing expression indicated by the selected singing expression data to a specific section of the singing voice. Is done.
 以上の各態様に係る音声処理装置は、歌唱音声の処理に専用されるDSP(Digital Signal Processor)などのハードウェア(電子回路)によって実現されるほか、CPU(Central Processing Unit)等の汎用の演算処理装置とプログラムとの協働によっても実現される。具体的には、本発明の第1態様に係るプログラムは、相異なる歌唱表現を示す複数の歌唱表現データから適用対象の歌唱表現データを選択する表現選択処理と、表現選択処理で選択した歌唱表現データが示す歌唱表現を歌唱音声の特定区間に付与する表現付与処理とを実行する。また、本発明の第2態様に係るプログラムは、歌唱表現を示す歌唱表現データと当該歌唱表現の評価を示す評価値とを相異なる複数の歌唱表現について記憶する記憶部を具備するコンピュータに、複数の歌唱表現データのうち歌唱音声に類似する歌唱表現の歌唱表現データに対応する評価値に応じて歌唱音声を評価する歌唱評価処理を実行させる。 The audio processing device according to each of the above aspects is realized by hardware (electronic circuit) such as DSP (Digital Signal Processor) dedicated to processing of singing voice, and general-purpose arithmetic such as CPU (Central Processing Unit) This is also realized by cooperation between the processing device and the program. Specifically, the program according to the first aspect of the present invention includes an expression selection process for selecting application target song expression data from a plurality of song expression data indicating different song expressions, and a song expression selected in the expression selection process. The expression provision process which assign | provides the song expression which data shows to the specific area of song voice is performed. Moreover, the program which concerns on the 2nd aspect of this invention is a computer which comprises the memory | storage part which memorize | stores the song value data which shows song expression, and the evaluation value which shows the evaluation of the said song expression about different song expressions. The singing evaluation process is performed to evaluate the singing voice according to the evaluation value corresponding to the singing expression data of the singing expression similar to the singing voice among the singing expression data.
 以上の各態様に係るプログラムは、コンピュータが読取可能な記録媒体に格納された形態で提供されてコンピュータにインストールされ得る。記録媒体は、例えば非一過性(non-transitory)の記録媒体であり、CD-ROM等の光学式記録媒体(光ディスク)が好例であるが、半導体記録媒体や磁気記録媒体等の公知の任意の形式の記録媒体を包含し得る。また、例えば、本発明のプログラムは、通信網を介した配信の形態で提供されてコンピュータにインストールされ得る。 The program according to each of the above aspects can be provided in a form stored in a computer-readable recording medium and installed in the computer. The recording medium is, for example, a non-transitory recording medium, and an optical recording medium (optical disk) such as a CD-ROM is a good example, but a known arbitrary one such as a semiconductor recording medium or a magnetic recording medium This type of recording medium can be included. For example, the program of the present invention can be provided in the form of distribution via a communication network and installed in a computer.
本発明の第1実施形態に係る音声処理装置のブロック図である。1 is a block diagram of a speech processing apparatus according to a first embodiment of the present invention. 表現登録処理に関連する要素の機能的な構成図である。It is a functional block diagram of the element relevant to an expression registration process. 歌唱区分部のブロック図である。It is a block diagram of a song division part. 表現登録処理のフローチャートである。It is a flowchart of an expression registration process. 表現付与処理に関連する要素の機能的な構成図である。It is a functional block diagram of the element relevant to an expression provision process. 表現付与処理のフローチャートである。It is a flowchart of an expression provision process. 表現付与処理の具体例(ビブラートの付与)の説明図である。It is explanatory drawing of the specific example (giving of vibrato) of an expression provision process. 表現付与処理の説明図である。It is explanatory drawing of an expression provision process. 表現付与処理の説明図である。It is explanatory drawing of an expression provision process. 第2実施形態の歌唱評価処理に関連する要素の機能的な構成図である。It is a functional block diagram of the element relevant to the song evaluation process of 2nd Embodiment. 歌唱評価処理のフローチャートである。It is a flowchart of a song evaluation process. 変形例に係る音声処理装置のブロック図である。It is a block diagram of the audio processing apparatus which concerns on a modification.
<第1実施形態>
 図1は、本発明の第1実施形態に係る音声処理装置100のブロック図である。図1に示すように、音声処理装置100は、演算処理装置10と記憶装置12と収音装置14と入力装置16と放音装置18とを具備するコンピュータシステムで実現される。
<First Embodiment>
FIG. 1 is a block diagram of a speech processing apparatus 100 according to the first embodiment of the present invention. As shown in FIG. 1, the sound processing device 100 is realized by a computer system including an arithmetic processing device 10, a storage device 12, a sound collecting device 14, an input device 16, and a sound emitting device 18.
 演算処理装置10は、記憶装置12が記憶するプログラムを実行することで音声処理装置100の各要素を統括的に制御する。記憶装置12は、演算処理装置10が実行するプログラムや演算処理装置10が使用する各種のデータを記憶する。半導体記録媒体や磁気記録媒体等の公知の記録媒体または複数種の記録媒体の組合せが記憶装置12として任意に採用される。なお、音声処理装置100とは別体の外部装置(例えば外部サーバ装置)に記憶装置12を設置し、音声処理装置100がインターネット等の通信網を介して記憶装置12に対する情報の書込や読出を実行する構成も採用され得る。すなわち、記憶装置12は音声処理装置100の必須の要素ではない。 The arithmetic processing device 10 controls each element of the speech processing device 100 by executing a program stored in the storage device 12. The storage device 12 stores a program executed by the arithmetic processing device 10 and various data used by the arithmetic processing device 10. A known recording medium such as a semiconductor recording medium or a magnetic recording medium or a combination of a plurality of types of recording media is arbitrarily employed as the storage device 12. The storage device 12 is installed in an external device (for example, an external server device) separate from the speech processing device 100, and the speech processing device 100 writes and reads information to and from the storage device 12 via a communication network such as the Internet. A configuration for performing the above can also be adopted. That is, the storage device 12 is not an essential element of the voice processing device 100.
 第1実施形態の記憶装置12は、相異なる歌唱音声(例えば別個の歌唱者の歌唱音声)の時間波形を示す複数の音声信号Xを記憶する。複数の音声信号Xの各々は、楽曲(歌唱曲)を歌唱した歌唱音声を収録することで事前に用意される。また、記憶装置12は、相異なる歌唱表現を示す複数の歌唱表現データDSと、各歌唱表現データDSが示す歌唱表現に関連する複数の属性データDAとを記憶する。歌唱表現は、歌唱の特徴(歌唱者に特有の歌い廻しや歌唱法等)である。相異なる歌唱者が発音した歌唱音声から抽出される複数種の歌唱表現について記憶装置12に歌唱表現データDSが記憶され、複数の歌唱表現データDSの各々に属性データDAが対応付けられる。 The storage device 12 of the first embodiment stores a plurality of audio signals X indicating time waveforms of different singing voices (for example, singing voices of different singers). Each of the plurality of audio signals X is prepared in advance by recording a singing voice of singing a song (singing song). Further, the storage device 12 stores a plurality of song expression data DS indicating different song expressions and a plurality of attribute data DA related to the song expressions indicated by each song expression data DS. The singing expression is a characteristic of the singing (singing or singing method peculiar to the singer). Singing expression data DS is stored in the storage device 12 for a plurality of types of singing expressions extracted from singing voices pronounced by different singers, and attribute data DA is associated with each of the plurality of singing expression data DS.
 歌唱表現データDSは、例えば、音高または音量(分布範囲)、周波数スペクトル(例えば特定帯域内のスペクトル)の特徴量、特定の次数のフォルマントの周波数や強度、声質に関連する特徴量(例えば倍音成分と基音成分との強度比や調波成分と非調波成分との強度比)、または、MFCC(Mel-Frequency Cepstrum Coefficients)等、歌唱音声の音楽的な表情に関する各種の特徴量を指定する。また、以上に例示した歌唱表現は比較的に短時間の歌唱音声の傾向であるが、音高または音量の時間的な変化の傾向や、各種の歌唱技法(例えばビブラート,フォール,ロングトーン)の傾向等の長時間にわたる歌唱音声の傾向を歌唱表現データDSが指定する構成も好適である。 The singing expression data DS includes, for example, a pitch or volume (distribution range), a characteristic amount of a frequency spectrum (for example, a spectrum within a specific band), a frequency or intensity of a formant of a specific order, and a characteristic amount (for example, a harmonic overtone). Specify various features related to the musical expression of the singing voice, such as the intensity ratio between the component and fundamental component, the intensity ratio between the harmonic component and the non-harmonic component), or MFCC (Mel-Frequency Cepstrum Coefficients) . In addition, the singing expression exemplified above is a tendency of singing voice for a relatively short time, but the tendency of the pitch or volume to change with time and various singing techniques (for example, vibrato, fall, long tone). A configuration in which the singing expression data DS specifies the tendency of the singing voice over a long time such as a tendency is also suitable.
 各歌唱表現の属性データDAは、歌唱音声の歌唱者や楽曲に関連する情報(メタデータ)であり、歌唱表現データDSの検索に利用される。具体的には、各歌唱表現で歌唱した歌唱者の情報(例えば氏名,年齢,出身地,年齢,性別,人種,母国語,音域)や、各歌唱表現で歌唱された楽曲の情報(例えば楽曲名,作曲者,作詞者,ジャンル,テンポ,キー,コード,音域,言語)を属性データDAは指定する。歌唱音声の印象や雰囲気を表現する言葉(例えば「リズミカル」や「甘い」といった言葉)を属性データDAが指定することも可能である。また、第1実施形態の属性データDAは、各歌唱表現で歌唱された歌唱音声の評価結果に応じた評価値(当該歌唱表現データDSの歌唱表現の巧拙の評価指標)Qを包含する。例えば、公知の歌唱評価処理で算定された評価値Qや歌唱者以外の各利用者による評価を反映した評価値Qが属性データDAに包含される。なお、属性データDAが指定する事項は以上の例示に限定されない。例えば、楽曲を区分した音楽構造上の各区間(例えばAメロ,サビ,Bメロ等の各フレーズ)の何れで歌唱表現が歌唱されたのかを属性データDAが指定することも可能である。 The attribute data DA of each singing expression is information (metadata) related to the singer of the singing voice and the music, and is used for searching the singing expression data DS. Specifically, information on the singer sung in each singing expression (for example, name, age, birthplace, age, gender, race, native language, range), information on the song sung in each singing expression (for example, The attribute data DA specifies the song name, composer, songwriter, genre, tempo, key, chord, range, and language. The attribute data DA can also specify words (for example, words such as “rhythmic” and “sweet”) that express the impression and atmosphere of the singing voice. Further, the attribute data DA of the first embodiment includes an evaluation value (a skill evaluation index of the singing expression of the singing expression data DS) Q according to the evaluation result of the singing voice sung by each singing expression. For example, the attribute value DA includes an evaluation value Q calculated by a known singing evaluation process and an evaluation value Q reflecting the evaluation by each user other than the singer. The items specified by the attribute data DA are not limited to the above examples. For example, the attribute data DA can specify which section of the music structure into which the music is divided (for example, each phrase such as A melody, chorus, B melody) in which the singing expression is sung.
 図1の収音装置14は、周囲の音響を収音する装置(マイクロホン)である。第1実施形態の収音装置14は、歌唱者が楽曲(歌唱曲)を歌唱した歌唱音声を収音することで音声信号Rを生成する。音声信号Rをアナログからデジタルに変換するA/D変換器の図示は便宜的に省略した。なお、音声信号Rを記憶装置12に記憶した構成(したがって収音装置14は省略され得る)も好適である。 1 is a device (microphone) that collects ambient sounds. The sound collection device 14 according to the first embodiment generates an audio signal R by collecting a singing voice in which a singer sang a song (singing song). An A / D converter that converts the audio signal R from analog to digital is not shown for convenience. A configuration in which the audio signal R is stored in the storage device 12 (therefore, the sound collection device 14 can be omitted) is also suitable.
 入力装置16は、音声処理装置100に対する利用者からの指示を受付ける操作機器であり、例えば利用者が操作可能な複数の操作子を含んで構成される。例えば音声処理装置100の筐体に設置された操作パネルや音声処理装置100とは別体のリモコン装置が入力装置16として採用される。 The input device 16 is an operation device that receives an instruction from the user to the voice processing device 100, and includes, for example, a plurality of operators that can be operated by the user. For example, an operation panel installed in the casing of the voice processing device 100 or a remote control device separate from the voice processing device 100 is employed as the input device 16.
 演算処理装置10は、記憶装置12に記憶されたプログラムの実行で各種の制御処理および演算処理を実行する。具体的には、演算処理装置10は、収音装置14から供給される音声信号Rの解析で歌唱表現データDSを抽出して記憶装置12に格納する処理(以下「表現登録処理」という)と、表現登録処理で記憶装置12に記憶された各歌唱表現データDSが示す歌唱表現を記憶装置12内の音声信号Xに付与することで音声信号Yを生成する処理(以下「表現付与処理」という)とを実行する。すなわち、音声信号Yは、音声信号Xの発音内容(歌詞)を維持したまま、音声信号Xの歌唱表現を歌唱表現データDSの歌唱表現に合致または類似させた音響信号である。例えば入力装置16に対する利用者からの指示に応じて表現登録処理および表現付与処理の一方が選択的に実行される。図1の放音装置18(例えばスピーカやヘッドホン)は、演算処理装置10が表現付与処理で生成した音声信号Yに応じた音響を再生する。なお、音声信号Yをデジタルからアナログに変換するD/A変換器や音声信号Yを増幅する増幅器の図示は便宜的に省略した。 The arithmetic processing unit 10 executes various control processes and arithmetic processes by executing the programs stored in the storage device 12. Specifically, the arithmetic processing device 10 extracts the singing expression data DS by analyzing the audio signal R supplied from the sound collecting device 14 and stores it in the storage device 12 (hereinafter referred to as “expression registration processing”). The process of generating the audio signal Y by adding the singing expression indicated by each singing expression data DS stored in the storage device 12 in the expression registration process to the audio signal X in the storage apparatus 12 (hereinafter referred to as “expression adding process”). ) And execute. That is, the audio signal Y is an acoustic signal in which the singing expression of the audio signal X matches or resembles the singing expression of the singing expression data DS while maintaining the pronunciation content (lyrics) of the audio signal X. For example, one of the expression registration process and the expression providing process is selectively executed in accordance with an instruction from the user to the input device 16. The sound emitting device 18 (for example, a speaker or a headphone) in FIG. 1 reproduces sound corresponding to the audio signal Y generated by the arithmetic processing device 10 in the expression providing process. Note that a D / A converter that converts the audio signal Y from digital to analog and an amplifier that amplifies the audio signal Y are omitted for the sake of convenience.
<表現登録処理>
 図2は、音声処理装置100のうち表現登録処理に関連する要素の機能的な構成図である。演算処理装置10は、記憶装置12に記憶されたプログラム(表現登録プログラム)を実行することで、図2に示すように、表現登録処理を実現するための複数の要素(解析処理部20,歌唱区分部22,歌唱評価部24,歌唱解析部26,属性取得部28)として機能する。なお、図2の各機能を複数の集積回路に分散した構成や、図2に例示された機能の一部を専用の電子回路(例えばDSP)が実現する構成も採用され得る。
<Expression registration process>
FIG. 2 is a functional configuration diagram of elements related to the expression registration process in the voice processing apparatus 100. The arithmetic processing unit 10 executes a program (expression registration program) stored in the storage device 12 to thereby execute a plurality of elements (analysis processing unit 20, song) for realizing the expression registration process as shown in FIG. It functions as a classification unit 22, a song evaluation unit 24, a song analysis unit 26, and an attribute acquisition unit 28). A configuration in which each function of FIG. 2 is distributed over a plurality of integrated circuits, or a configuration in which a dedicated electronic circuit (for example, DSP) realizes a part of the functions illustrated in FIG. 2 may be employed.
 図2の解析処理部20は、収音装置14から供給される音声信号Rを解析する。図3に例示される通り、第1実施形態の解析処理部20は、楽曲構造解析部20Aと歌唱技法解析部20Bと声質解析部20Cとを含んで構成される。楽曲構造解析部20Aは、音声信号Rに対応する楽曲の音楽構造上の区間(例えばAメロ,サビ,Bメロ等の各フレーズ)を解析する。歌唱技法解析部20Bは、ビブラート(音高を微細に変動させる歌唱技法)やしゃくり(目標の音高を下回る音高から目標の音高に変化させる歌唱技法)やフォール(目標の音高を上回る音高から目標の音高に変化させる歌唱技法)等の各種の歌唱技法を音声信号Rから検出する。声質解析部20Cは、歌唱音声の声質(例えば倍音成分と基音成分との強度比や調波成分と非調波成分との強度比)を解析する。 The analysis processing unit 20 in FIG. 2 analyzes the audio signal R supplied from the sound collection device 14. As illustrated in FIG. 3, the analysis processing unit 20 according to the first embodiment includes a music structure analysis unit 20A, a singing technique analysis unit 20B, and a voice quality analysis unit 20C. The music structure analysis unit 20A analyzes a section (for example, each phrase such as A melody, chorus, and B melody) on the music structure of the music corresponding to the audio signal R. The singing technique analysis unit 20B includes vibrato (singing technique for finely changing the pitch), shackle (singing technique for changing from a pitch lower than the target pitch to a target pitch) and fall (exceeding the target pitch). Various singing techniques such as a singing technique for changing from a pitch to a target pitch are detected from the audio signal R. The voice quality analysis unit 20C analyzes the voice quality of the singing voice (for example, the intensity ratio between the harmonic component and the fundamental component and the intensity ratio between the harmonic component and the non-harmonic component).
 図2の歌唱区分部22は、収音装置14から供給される音声信号Rについて歌唱表現データDSの生成に適用される各区間(以下「単位区間」という)を画定する。第1実施形態の歌唱区分部22は、楽曲構造と歌唱技法と声質とに応じて音声信号Rの各単位区間を画定する。具体的には、歌唱区分部22は、楽曲構造解析部20Aが解析した楽曲の音楽構造上の各区間の端点と、歌唱技法解析部20Bが各種の歌唱技法を検出した各区間の端点と、声質解析部20Cが解析した声質が変動する時点とを境界として音声信号Rを各単位区間に区分する。なお、音声信号Rを複数の単位区間に区分する方法は以上の例示に限定されない。例えば、入力装置16に対する操作で利用者が指定した区間を単位区間として音声信号Rを区分することも可能である。また、時間軸上にランダムに設定された時点で音声信号Rを複数の単位区間に区分する構成や、歌唱評価部24が算定した評価値Qに応じて音声信号Rを複数の単位区間に区分する構成(例えば評価値Qが変動する時点を境界として各単位区間を画定する構成)も採用され得る。また、音声信号Rの全区間(楽曲の全体)を単位区間とすることも可能である。 2 defines each section (hereinafter referred to as “unit section”) applied to the generation of the singing expression data DS for the audio signal R supplied from the sound collection device 14. The singing section 22 of the first embodiment defines each unit section of the audio signal R according to the music structure, singing technique, and voice quality. Specifically, the singing section 22 includes end points of each section on the music structure of the music analyzed by the music structure analysis unit 20A, end points of each section where the singing technique analysis unit 20B detects various singing techniques, The voice signal R is divided into unit sections with the time point when the voice quality analyzed by the voice quality analysis unit 20C fluctuates. Note that the method of dividing the audio signal R into a plurality of unit sections is not limited to the above examples. For example, the audio signal R can be divided using a section specified by the user by an operation on the input device 16 as a unit section. Further, the audio signal R is divided into a plurality of unit sections according to the configuration in which the audio signal R is divided into a plurality of unit sections when set on the time axis at random, or the evaluation value Q calculated by the singing evaluation unit 24. A configuration (for example, a configuration in which each unit section is defined with a time point at which the evaluation value Q fluctuates) as a boundary may be employed. It is also possible to set the entire section of the audio signal R (the whole piece of music) as a unit section.
 歌唱評価部24は、収音装置14から供給される音声信号Rが示す歌唱の巧拙を評価する。具体的には、歌唱評価部24は、音声信号Rの歌唱の巧拙を評価した評価値Qを、歌唱区分部22が画定した単位区間毎に順次に算定する。歌唱評価部24による評価値Qの算定には、公知の歌唱評価処理が任意に採用される。なお、前述の歌唱技法解析部20Bが解析した歌唱技法や声質解析部20Cが解析した声質を歌唱評価部24による歌唱の評価に適用することも可能である。 The song evaluation unit 24 evaluates the skill of the song indicated by the audio signal R supplied from the sound collection device 14. Specifically, the singing evaluation unit 24 sequentially calculates an evaluation value Q obtained by evaluating the skill of singing the audio signal R for each unit section defined by the singing division unit 22. For the calculation of the evaluation value Q by the song evaluation unit 24, a known song evaluation process is arbitrarily employed. Note that the singing technique analyzed by the singing technique analyzing unit 20B and the voice quality analyzed by the voice quality analyzing unit 20C can be applied to the singing evaluation by the singing evaluation unit 24.
 図2の歌唱解析部26は、音声信号Rを解析することで単位区間毎に歌唱表現データDSを生成する。具体的には、歌唱解析部26は、音高や音量等の音響的な特徴量(歌唱表現に影響する特徴量)を音声信号Rから抽出し、各特徴量の短期的または長期的な傾向(すなわち歌唱表現)を示す歌唱表現データDSを生成する。歌唱表現の抽出には公知の音響解析技術(例えば日本国特開2011-013454号公報や日本国特開2011-028230号公報に開示された技術)が任意に採用される。相異なる種類の歌唱表現に対応する複数の歌唱表現データDSを1個の単位区間から生成することも可能である。なお、以上の例示では単位区間毎に1個の歌唱表現データDSを生成したが、相異なる単位区間の複数の特徴量から1個の歌唱表現データDSを生成することも可能である。例えば、属性データDAが近似または合致する複数の単位区間の特徴量を平均することで歌唱表現データDSを生成する構成や、歌唱評価部24による各単位区間の評価値Qに応じた加重値を適用して複数の単位区間にわたる特徴量を加重加算することで歌唱表現データDSを生成する構成が採用される。 The singing analysis unit 26 in FIG. 2 generates the singing expression data DS for each unit section by analyzing the audio signal R. Specifically, the singing analysis unit 26 extracts an acoustic feature quantity (feature quantity that affects the singing expression) such as pitch and volume from the voice signal R, and a short-term or long-term tendency of each feature quantity. Singing expression data DS indicating (that is, singing expression) is generated. A known acoustic analysis technique (for example, the technique disclosed in Japanese Patent Application Laid-Open No. 2011-013454 and Japanese Patent Application Laid-Open No. 2011-028230) is arbitrarily employed for extracting the singing expression. It is also possible to generate a plurality of song expression data DS corresponding to different types of song expressions from one unit section. In the above example, one singing expression data DS is generated for each unit section. However, one singing expression data DS can be generated from a plurality of feature quantities of different unit sections. For example, a configuration in which the singing expression data DS is generated by averaging feature quantities of a plurality of unit sections that are approximated or matched with the attribute data DA, and a weight value corresponding to the evaluation value Q of each unit section by the singing evaluation unit 24 is used. A configuration is adopted in which singing expression data DS is generated by applying weighted addition of feature quantities over a plurality of unit sections.
 属性取得部28は、歌唱区分部22が画定した各単位区間について属性データDAを生成する。具体的には、属性取得部28は、利用者が入力装置16の操作で指示した各種の情報を属性データDAに登録する。また、属性取得部28は、歌唱評価部24が各単位区間について算定した評価値Q(例えば単位区間内の評価値の平均)を当該単位区間の属性データDAに包含させる。 Attribute acquisition unit 28 generates attribute data DA for each unit section defined by singing section 22. Specifically, the attribute acquisition unit 28 registers, in the attribute data DA, various types of information that the user has instructed by operating the input device 16. Further, the attribute acquisition unit 28 includes the evaluation value Q (for example, the average of the evaluation values in the unit section) calculated by the singing evaluation unit 24 for each unit section in the attribute data DA of the unit section.
 歌唱解析部26が単位区間毎に生成した歌唱表現データDSと属性取得部28が単位区間毎に生成した属性データDAとが、単位区間が共通するもの同士で相互に対応付けられたうえで記憶装置12に格納される。以上に例示した表現登録処理が、相異なる複数の歌唱音声の音声信号Rについて反復されることで、複数の歌唱者の各々が発声した歌唱音声から抽出された複数種の歌唱表現の各々について、歌唱表現データDSと属性データDAとが記憶装置12に蓄積される。すなわち、多種多様な歌唱表現(歌唱者が相違する歌唱表現や種類が相違する歌唱表現)のデータベースが記憶装置12に構築される。なお、複数の歌唱表現データDSを統合して1個の歌唱表現データDSを生成することも可能である。例えば、属性データDAが近似または合致する複数の歌唱表現データDSを平均することで新規な歌唱表現データDSを生成する構成や、歌唱評価部24による評価値Qに応じた加重値を適用して複数の歌唱表現データDSを加重加算することで新規な歌唱表現データDSを生成する構成が採用される。 The singing expression data DS generated by the singing analysis unit 26 for each unit section and the attribute data DA generated by the attribute acquisition unit 28 for each unit section are stored in association with each other in common unit sections. It is stored in the device 12. The expression registration process exemplified above is repeated for the audio signals R of a plurality of different singing voices, so that each of a plurality of types of singing expressions extracted from the singing voices uttered by each of a plurality of singers, Singing expression data DS and attribute data DA are stored in the storage device 12. That is, a database of various singing expressions (singing expressions with different singers and singing expressions with different types) is constructed in the storage device 12. It is also possible to integrate a plurality of song expression data DS to generate one song expression data DS. For example, by applying a weighting value corresponding to the evaluation value Q by the singing evaluation unit 24 or a configuration for generating new singing expression data DS by averaging a plurality of singing expression data DS that approximate or match the attribute data DA. A configuration is adopted in which new song expression data DS is generated by weighted addition of a plurality of song expression data DS.
 図4は、表現登録処理のフローチャートである。図4に示すように、入力装置16の操作で利用者が表現登録処理の実行を指示すると(SA1)、解析処理部20は、収音装置14から供給される音声信号Rを解析する(SA2)。歌唱区分部22は、解析処理部20による解析結果に応じて音声信号Rを各単位区間に区分し(SA3)、歌唱解析部26は、音声信号Rを解析することで単位区間毎に歌唱表現データDSを生成する(SA4)。また、歌唱評価部24は、音声信号Rが示す歌唱の巧拙に応じた評価値Qを単位区間毎に算定し(SA5)、属性取得部28は、歌唱評価部24が単位区間毎に算定した評価値Qを含む属性データDAを単位区間毎に生成する(SA6)。歌唱解析部26が生成した歌唱表現データDSと属性取得部28が生成した属性データDAとが単位区間毎に記憶装置12に格納される(SA7)。以上に説明した表現登録処理で記憶装置12に蓄積された歌唱表現データDSで指定される歌唱表現が、以下に説明する表現付与処理にて音声信号Xに付与される。 FIG. 4 is a flowchart of the expression registration process. As shown in FIG. 4, when the user instructs the execution of the expression registration process by operating the input device 16 (SA1), the analysis processing unit 20 analyzes the audio signal R supplied from the sound collection device 14 (SA2). ). The singing section 22 classifies the voice signal R into each unit section according to the analysis result by the analysis processing section 20 (SA3), and the singing analysis section 26 analyzes the voice signal R to express the singing expression for each unit section. Data DS is generated (SA4). The singing evaluation unit 24 calculates an evaluation value Q corresponding to the skill of the singing indicated by the audio signal R for each unit section (SA5), and the attribute acquisition unit 28 calculates the singing evaluation unit 24 for each unit section. Attribute data DA including the evaluation value Q is generated for each unit section (SA6). The song expression data DS generated by the song analysis unit 26 and the attribute data DA generated by the attribute acquisition unit 28 are stored in the storage device 12 for each unit section (SA7). The singing expression specified by the singing expression data DS accumulated in the storage device 12 by the expression registration process described above is given to the audio signal X by the expression giving process described below.
<表現付与処理>
 図5は、音声処理装置100のうち表現付与処理に関連する要素の機能的な構成図である。演算処理装置10は、記憶装置12に記憶されたプログラム(表現付与プログラム)を実行することで、図5に示すように、表現付与処理を実現するための複数の機能(歌唱選択部32,区間指定部34,表現選択部36,表現付与部38)として機能する。なお、図5の各機能を複数の集積回路に分散した構成や、図5に例示された機能の一部を専用の電子回路(例えばDSP)が実行する構成も採用され得る。
<Expression assignment processing>
FIG. 5 is a functional configuration diagram of elements related to the expression providing process in the audio processing apparatus 100. The arithmetic processing unit 10 executes a program (expression providing program) stored in the storage device 12 to perform a plurality of functions (singing selection unit 32, section) for realizing the expression providing process as shown in FIG. It functions as a designation unit 34, an expression selection unit 36, and an expression giving unit 38). A configuration in which each function of FIG. 5 is distributed over a plurality of integrated circuits, or a configuration in which a dedicated electronic circuit (for example, DSP) executes a part of the functions illustrated in FIG. 5 may be employed.
 歌唱選択部32は、記憶装置12に記憶された複数の音声信号Xの何れか(以下「選択音声信号X」という)を選択する。例えば歌唱選択部32は、入力装置16に対する利用者からの指示(音声信号Xの選択指示)に応じて記憶装置12の複数の音声信号Xから選択音声信号Xを選択する。 The singing selection unit 32 selects one of the plurality of audio signals X stored in the storage device 12 (hereinafter referred to as “selected audio signal X”). For example, the singing selection unit 32 selects the selected sound signal X from the plurality of sound signals X in the storage device 12 in accordance with an instruction (selection instruction for the sound signal X) from the user to the input device 16.
 区間指定部34は、歌唱選択部32が選択した選択音声信号Xのうち歌唱表現データDSの歌唱表現を付与すべき1個以上の区間(以下「対象区間」という)を指定する。具体的には、区間指定部34は、入力装置16に対する利用者からの指示に応じて各対象区間を指定する。例えば、入力装置16の操作で利用者が時間軸上(例えば選択音声信号Xの波形上)に指定した2点間の区間を区間指定部34は対象区間として画定する。区間指定部34が指定する複数の対象区間は時間軸上で相互に重複し得る。なお、選択音声信号Xの全区間(楽曲の全体)を対象区間として指定することも可能である。 The section designating unit 34 designates one or more sections (hereinafter referred to as “target section”) to which the singing expression of the singing expression data DS is to be added in the selected audio signal X selected by the singing selecting unit 32. Specifically, the section specifying unit 34 specifies each target section in accordance with an instruction from the user to the input device 16. For example, the section specifying unit 34 defines a section between two points specified by the user on the time axis (for example, on the waveform of the selected audio signal X) by operating the input device 16 as a target section. A plurality of target sections specified by the section specifying unit 34 may overlap each other on the time axis. Note that it is also possible to specify the entire section (the entire music piece) of the selected audio signal X as the target section.
 図5の表現選択部36は、記憶装置12に記憶された複数の歌唱表現データDSのうち表現付与処理に実際に適用される歌唱表現データDS(以下「対象表現データDS」という)を、区間指定部34が指定した対象区間毎に順次に選択する。第1実施形態の表現選択部36は、各歌唱表現データDSに対応付けて記憶装置12に記憶された属性データDAを利用した検索処理で複数の歌唱表現データDSから対象表現データDSを選択する。 The expression selection unit 36 shown in FIG. 5 selects the singing expression data DS (hereinafter referred to as “target expression data DS”) that is actually applied to the expression providing process among the plurality of singing expression data DS stored in the storage device 12. The designation unit 34 sequentially selects the target sections designated. The expression selection unit 36 of the first embodiment selects the target expression data DS from the plurality of song expression data DS in the search process using the attribute data DA stored in the storage device 12 in association with each song expression data DS. .
 例えば利用者は、入力装置16を適宜に操作することで対象表現データDSの検索条件(例えば検索語)を対象区間毎に指定することが可能である。表現選択部36は、記憶装置12の複数の歌唱表現データDSのうち利用者が指定した検索条件に合致する属性データDAに対応した歌唱表現データDSを対象表現データDSとして対象区間毎に選択する。例えば、利用者が歌唱者の検索条件(例えば年齢や性別)を指定すると、検索条件に合致する歌唱者の属性データDAに対応した対象表現データDS(すなわち検索条件に合致する歌唱者の歌唱表現)が検索される。また、利用者が楽曲の検索条件(例えば楽曲のジャンルや音域)を指定すると、検索条件に合致する楽曲の属性データDAに対応した対象表現データDS(すなわち検索条件に合致する楽曲の歌唱表現)が検索される。また、利用者が歌唱音声の評価値Qの検索条件(例えば数値範囲)を指定すると、検索条件に合致する評価値Qの属性データDAに対応した対象表現データDS(すなわち利用者が意図した水準の歌唱者の歌唱表現)が検索される。以上の説明から理解される通り、第1実施形態の表現選択部36は、利用者からの指示に応じて歌唱表現データDS(対象表現データDS)を選択する要素として表現される。 For example, the user can designate the search condition (for example, search word) of the target expression data DS for each target section by appropriately operating the input device 16. The expression selection unit 36 selects the singing expression data DS corresponding to the attribute data DA that matches the search condition specified by the user among the plurality of singing expression data DS in the storage device 12 as the target expression data DS for each target section. . For example, when the user specifies a search condition (for example, age or gender) of the singer, the target expression data DS corresponding to the attribute data DA of the singer that matches the search condition (that is, the singing expression of the singer that matches the search condition) ) Is searched. When the user designates a music search condition (for example, genre or range), the target expression data DS corresponding to the music attribute data DA that matches the search condition (that is, the song singing expression that matches the search condition). Is searched. When the user specifies a search condition (for example, a numerical range) for the evaluation value Q of the singing voice, the target expression data DS corresponding to the attribute data DA of the evaluation value Q that matches the search condition (that is, the level intended by the user). Singing expression of the singer). As understood from the above description, the expression selection unit 36 of the first embodiment is expressed as an element that selects the singing expression data DS (target expression data DS) in accordance with an instruction from the user.
 図5の表現付与部38は、歌唱選択部32が選択した選択音声信号Xに対して対象表現データDSの歌唱表現を付与することで音声信号Yを生成する。具体的には、表現付与部38は、選択音声信号Xのうち区間指定部34が指定した複数の対象区間の各々に対し、表現選択部36が当該対象区間について選択した対象表現データDSの歌唱表現を付与する。すなわち、選択音声信号Xのうち利用者からの指示に応じた各対象区間に対して利用者からの指示(検索条件の指定)に応じた歌唱表現が付与される。選択音声信号Xに対する歌唱表現の付与には公知の技術が任意に採用される。なお、選択音声信号Xの歌唱表現を対象表現データDSの歌唱表現に置換する構成(選択音声信号Xの歌唱表現が音声信号Yには残留しない構成)のほか、選択音声信号Xの歌唱表現に対象表現データDSの歌唱表現を累積的に付与する構成(例えば選択音声信号Xの歌唱表現と対象表現データDSの歌唱表現との双方が音声信号Yに反映される構成)も採用され得る。 5 generates the audio signal Y by adding the singing expression of the target expression data DS to the selected audio signal X selected by the singing selection unit 32. Specifically, the expression providing unit 38 sings the target expression data DS selected by the expression selection unit 36 for the target section for each of a plurality of target sections specified by the section specifying unit 34 in the selected audio signal X. Give expression. That is, the singing expression according to the instruction | indication from a user (designation of search conditions) is provided with respect to each object area according to the instruction | indication from a user among the selection audio | voice signals X. FIG. A well-known technique is arbitrarily employ | adopted for provision of the song expression with respect to the selection audio | voice signal X. FIG. In addition to the configuration that replaces the singing expression of the selected audio signal X with the singing expression of the target expression data DS (the configuration in which the singing expression of the selected audio signal X does not remain in the audio signal Y), the singing expression of the selected audio signal X A configuration in which the singing expression of the target expression data DS is cumulatively given (for example, a structure in which both the singing expression of the selected audio signal X and the singing expression of the target expression data DS are reflected in the audio signal Y) may be employed.
 図6は、表現付与処理のフローチャートである。図6に示すように、入力装置16の操作で利用者が表現付与処理の実行を指示すると(SB1)、歌唱選択部32は、記憶装置12に記憶された複数の音声信号Xから選択音声信号Xを選択し(SB2)、区間指定部34は、選択音声信号Xについて1個以上の対象区間を指定する(SB3)。また、表現選択部36は、記憶装置12に記憶された複数の歌唱表現データDSから対象表現データDSを選択し(SB4)、表現付与部38は、歌唱選択部32が選択した選択音声信号Xの各対象区間に対して対象表現データDSの歌唱表現を付与することで音声信号Yを生成する(SB5)。表現付与部38が生成した音声信号Yが放音装置18から再生される(SB6)。 FIG. 6 is a flowchart of the expression providing process. As shown in FIG. 6, when the user instructs the execution of the expression providing process by operating the input device 16 (SB1), the singing selection unit 32 selects the selected audio signal from the plurality of audio signals X stored in the storage device 12. X is selected (SB2), and the section designating unit 34 designates one or more target sections for the selected audio signal X (SB3). In addition, the expression selection unit 36 selects the target expression data DS from the plurality of song expression data DS stored in the storage device 12 (SB4), and the expression giving unit 38 selects the selected voice signal X selected by the song selection unit 32. The voice signal Y is generated by giving the singing expression of the target expression data DS to each of the target sections (SB5). The sound signal Y generated by the expression providing unit 38 is reproduced from the sound emitting device 18 (SB6).
 図7は、ビブラートを示す歌唱表現データDSを適用した表現付与処理の具体例の説明図である。選択音声信号Xの音高(ピッチ)の時間変化と、複数の歌唱表現データDS(DS[1]~DS[4])とが図7では例示されている。各歌唱表現データDSは、相異なる歌唱者の歌唱音声を収録した各音声信号Rに対する表現登録処理で生成される。したがって、各歌唱表現データDS(DS[1]~DS[4])が示すビブラートは、音高の変動周期(速度)や変動幅(深度)等の特性が相違する。図7に示すように、例えば利用者からの指示に応じて選択音声信号Xの対象区間が指定され(SB3)、複数の歌唱表現データDSから例えば利用者からの指示に応じて対象表現データDS[3]が選択されると(SB4)、対象表現データDS[3]が示すビブラートを選択音声信号Xの対象区間に付与した音声信号Yが表現付与処理により生成される(SB5)。以上の説明から理解される通り、ビブラートを付与せずに歌唱された歌唱音声(例えばビブラートを付与した歌唱が苦手な歌唱者の歌唱音声)の音声信号Xにおける所望の対象区間に所望の歌唱表現データDSのビブラートが付与される。なお、利用者が複数の歌唱表現データDSから対象表現データDSを選択するための構成は任意である。例えば、各歌唱表現データDSの歌唱表現が付与された所定の歌唱音声を放音装置18から再生して利用者に受聴(すなわち試聴)させ、利用者が受聴の結果を踏まえて入力装置16(例えばボタンやタッチパネル)を操作することで対象表現データDSを選択する構成が好適である。 FIG. 7 is an explanatory diagram of a specific example of the expression providing process to which the singing expression data DS indicating vibrato is applied. FIG. 7 illustrates the time change of the pitch (pitch) of the selected voice signal X and a plurality of song expression data DS (DS [1] to DS [4]). Each singing expression data DS is generated by an expression registration process for each audio signal R containing singing voices of different singers. Therefore, the vibrato represented by each song expression data DS (DS [1] to DS [4]) has different characteristics such as pitch fluctuation period (speed) and fluctuation width (depth). As shown in FIG. 7, for example, the target section of the selected audio signal X is designated according to an instruction from the user (SB3), and the target expression data DS is selected from a plurality of song expression data DS, for example, according to an instruction from the user. When [3] is selected (SB4), the audio signal Y in which the vibrato indicated by the target expression data DS [3] is added to the target section of the selected audio signal X is generated by the expression adding process (SB5). As understood from the above description, the desired singing expression in the desired target section in the audio signal X of the singing voice sung without vibrato (for example, the singing voice of a singer who is not good at vibrato) Vibrato for data DS is given. In addition, the structure for a user to select object expression data DS from several song expression data DS is arbitrary. For example, a predetermined singing voice to which the singing expression of each singing expression data DS is given is reproduced from the sound emitting device 18 and listened to (ie, auditioned) by the user, and the input device 16 ( For example, a configuration in which the target expression data DS is selected by operating a button or a touch panel is preferable.
 図8では、選択音声信号Xの対象区間S1について表現選択部36が対象表現データDS1を選択し、対象区間S1とは相違する対象区間S2について表現選択部36が対象表現データDS2を選択した場合が想定されている。表現付与部38は、対象表現データDS1が示す歌唱表現E1を対象区間S1に付与するとともに、対象表現データDS2が示す歌唱表現E2を対象区間S2に付与する。 In FIG. 8, when the expression selection unit 36 selects the target expression data DS1 for the target section S1 of the selected audio signal X, and the expression selection unit 36 selects the target expression data DS2 for the target section S2 different from the target section S1. Is assumed. The expression giving unit 38 gives the singing expression E1 indicated by the target expression data DS1 to the target section S1, and also gives the singing expression E2 indicated by the target expression data DS2 to the target section S2.
 また、図9に示すように、対象区間S1と対象区間S2とが重複する場合(対象区間S2が対象区間S1に内包される場合)、選択音声信号Xのうち対象区間S1と対象区間S2との重複区間(すなわち対象区間S2)には、対象表現データDS1の歌唱表現E1と対象表現データDS2の歌唱表現E2とが重複して付与される。すなわち、選択音声信号Xの特定区間に複数(典型的には複数種)の歌唱表現が重複して付与される。例えば、音高の変動に関する歌唱表現E1と音量の変動に関する歌唱表現E2との双方が選択音声信号X(対象区間S2)に付与される。以上の処理で生成された音声信号Yが放音装置18に供給されることで音響として再生される。 Also, as shown in FIG. 9, when the target section S1 and the target section S2 overlap (when the target section S2 is included in the target section S1), the target section S1 and the target section S2 of the selected audio signal X The overlapping section (that is, the target section S2) is given the song expression E1 of the target expression data DS1 and the song expression E2 of the target expression data DS2 redundantly. That is, a plurality of (typically a plurality of types) singing expressions are given to the specific section of the selected audio signal X in an overlapping manner. For example, both the singing expression E1 related to pitch fluctuation and the singing expression E2 related to volume fluctuation are given to the selected voice signal X (target section S2). The sound signal Y generated by the above processing is supplied to the sound emitting device 18 and reproduced as sound.
 以上に説明した通り、第1実施形態では、相異なる歌唱表現を示す複数の歌唱表現データDSの各々の歌唱表現が選択音声信号Xの対象区間に選択的に付与される。したがって、特許文献1の技術と比較して多様な歌唱表現の歌唱音声(音声信号Y)を生成することが可能である。 As described above, in the first embodiment, each singing expression of the plurality of singing expression data DS indicating different singing expressions is selectively given to the target section of the selected audio signal X. Therefore, it is possible to generate singing voices (voice signal Y) with various singing expressions as compared with the technique of Patent Document 1.
 第1実施形態では特に、選択音声信号Xに指定された複数の対象区間の各々について別個の歌唱表現が付与されるから(図8,図9)、歌唱表現が付与される対象区間が選択音声信号Xの1個の区間に制限される構成と比較すると、多様な歌唱表現の歌唱音声を生成できるという前述の効果は格別に顕著である。また、第1実施形態では、複数(複数種)の歌唱表現が選択音声信号Xの対象区間に重複して付与され得るから(図9)、対象区間に付与される歌唱表現が1種類に制限される構成と比較して、多様な歌唱表現の歌唱音声を生成できるという効果は格別に顕著である。ただし、歌唱表現が付与される対象区間が選択音声信号Xの1個の区間に制限される構成や、対象区間に付与される歌唱表現が1種類に制限される構成も、本発明の範囲には包含される。 In the first embodiment, in particular, since a separate singing expression is given to each of a plurality of target sections specified in the selected voice signal X (FIGS. 8 and 9), the target section to which the singing expression is given is the selected voice. Compared with the configuration limited to one section of the signal X, the above-described effect of generating singing voices with various singing expressions is particularly remarkable. In the first embodiment, a plurality (several types) of singing expressions can be given redundantly to the target section of the selected audio signal X (FIG. 9), so the singing expression given to the target section is limited to one type. The effect of being able to generate singing voices with various singing expressions is particularly remarkable as compared to the configuration to be performed. However, the configuration in which the target section to which the singing expression is given is limited to one section of the selected audio signal X and the configuration in which the singing expression given to the target section is limited to one type are also within the scope of the present invention. Is included.
 また、第1実施形態では、選択音声信号Xの対象区間が利用者からの指示に応じて指定され、かつ、属性データDAの検索条件が利用者からの指示に応じて設定されるから、利用者の意図や嗜好を充分に反映した多様な歌唱音声を生成できるという利点もある。 In the first embodiment, the target section of the selected audio signal X is designated according to an instruction from the user, and the search condition for the attribute data DA is set according to the instruction from the user. There is also an advantage that various singing voices that sufficiently reflect the intentions and preferences of the person can be generated.
<第2実施形態>
 本発明の第2実施形態を説明する。第1実施形態の音声処理装置100では、記憶装置12に記憶された複数の歌唱表現データDSを音声信号Xの歌唱表現の調整に利用した。第2実施形態の音声処理装置100では、記憶装置12に記憶された複数の歌唱表現データDSを音声信号Xの評価に利用する。なお、以下に例示する各形態において作用や機能が第1実施形態と同様である要素については、第1実施形態の説明で参照した符号を流用して各々の詳細な説明を適宜に省略する。
Second Embodiment
A second embodiment of the present invention will be described. In the speech processing apparatus 100 of the first embodiment, the plurality of singing expression data DS stored in the storage device 12 is used for adjusting the singing expression of the audio signal X. In the speech processing apparatus 100 according to the second embodiment, a plurality of singing expression data DS stored in the storage device 12 are used for the evaluation of the speech signal X. In addition, about the element which an effect | action and function are the same as that of 1st Embodiment in each form illustrated below, the reference | standard referred by description of 1st Embodiment is diverted, and each detailed description is abbreviate | omitted suitably.
 図10は、第2実施形態の音声処理装置100のうち音声信号Xを評価する処理(以下「歌唱評価処理」という)に関連する要素の機能的な構成図である。第2実施形態の記憶装置12は、第1実施形態と同様の表現登録処理で生成された歌唱表現データDSおよび属性データDAの複数組を記憶する。各歌唱表現データDSに対応する属性データDAは、第1実施形態について前述した通り、図2の歌唱評価部24が算定した評価値(当該歌唱表現データDSの歌唱表現の巧拙の評価指標)Qを含んで構成される。 FIG. 10 is a functional configuration diagram of elements related to a process of evaluating the audio signal X (hereinafter referred to as “singing evaluation process”) in the audio processing apparatus 100 of the second embodiment. The storage device 12 of the second embodiment stores a plurality of sets of song expression data DS and attribute data DA generated by the same expression registration process as that of the first embodiment. The attribute data DA corresponding to each singing expression data DS is the evaluation value calculated by the singing evaluation unit 24 in FIG. 2 (the evaluation index of the skill of the singing expression data DS) Q as described above for the first embodiment. It is comprised including.
 演算処理装置10は、記憶装置12に記憶されたプログラム(歌唱評価プログラム)を実行することで、図10に示すように、歌唱評価処理を実現するための複数の要素(歌唱選択部42,区間指定部44,歌唱評価部46)として機能する。例えば入力装置16に対する利用者からの指示に応じて第1実施形態の表現付与処理と以下に詳述する歌唱評価処理とが選択的に実行される。ただし、第2実施形態では表現付与処理を省略することも可能である。なお、図10の各機能を複数の集積回路に分散した構成や、図10に例示された機能の一部を専用の電子回路(例えばDSP)が実現する構成を採用することも可能である。 The arithmetic processing unit 10 executes a program (singing evaluation program) stored in the storage device 12 and thereby, as shown in FIG. 10, a plurality of elements (singing selection unit 42, section) for realizing the singing evaluation processing It functions as a designation unit 44 and a singing evaluation unit 46). For example, according to the instruction | indication from the user with respect to the input device 16, the expression provision process of 1st Embodiment and the song evaluation process explained in full detail below are selectively performed. However, in the second embodiment, the expression providing process can be omitted. Note that it is also possible to adopt a configuration in which the functions in FIG. 10 are distributed over a plurality of integrated circuits, or a configuration in which a dedicated electronic circuit (for example, DSP) realizes some of the functions illustrated in FIG.
 図10の歌唱選択部42は、記憶装置12に記憶された複数の音声信号Xのうち評価対象となる選択音声信号Xを選択する。具体的には、歌唱選択部42は、第1実施形態の歌唱選択部32と同様に、入力装置16に対する利用者からの指示に応じて記憶装置12から選択音声信号Xを選択する。また、区間指定部44は、歌唱選択部42が選択した選択音声信号Xのうち評価対象となる1個以上の対象区間を指定する。具体的には、区間指定部44は、第1実施形態の区間指定部34と同様に、入力装置16に対する利用者からの指示に応じて各対象区間を指定する。なお、選択音声信号Xの全区間を対象区間として指定することも可能である。 10 selects the selected audio signal X to be evaluated from among the plurality of audio signals X stored in the storage device 12. Specifically, the singing selection unit 42 selects the selected audio signal X from the storage device 12 in accordance with an instruction from the user to the input device 16, similarly to the singing selection unit 32 of the first embodiment. In addition, the section specifying unit 44 specifies one or more target sections to be evaluated from the selected audio signal X selected by the song selection unit 42. Specifically, the section specifying unit 44 specifies each target section in accordance with an instruction from the user to the input device 16, similarly to the section specifying unit 34 of the first embodiment. It is also possible to specify the entire section of the selected audio signal X as the target section.
 図10の歌唱評価部46は、記憶装置12に記憶された各歌唱表現データDSおよび各属性データDA(評価値Q)を利用して、歌唱選択部42が選択した選択音声信号Xの歌唱の巧拙を評価する。すなわち、歌唱評価部46は、記憶装置12の複数の歌唱表現データDSのうち選択音声信号Xの各対象区間に類似する歌唱表現の歌唱表現データDSに対応した属性データDA内の評価値Qに応じて選択音声信号Xの評価値Zを算定する。歌唱評価部46の具体的な動作を以下に説明する。 The singing evaluation unit 46 of FIG. 10 uses each singing expression data DS and each attribute data DA (evaluation value Q) stored in the storage device 12 to perform the singing of the selected voice signal X selected by the singing selection unit 42. Evaluate skill. That is, the singing evaluation unit 46 sets the evaluation value Q in the attribute data DA corresponding to the singing expression data DS of the singing expression similar to each target section of the selected voice signal X among the plurality of singing expression data DS of the storage device 12. Accordingly, the evaluation value Z of the selected audio signal X is calculated. The specific operation of the singing evaluation unit 46 will be described below.
 歌唱評価部46は、まず、歌唱表現データDSが示す歌唱表現と選択音声信号Xの対象区間の歌唱表現との類似度(相関または距離)を記憶装置12内の複数の歌唱表現データDSの各々について対象区間毎に算定し、複数の歌唱表現データDSのうち対象区間の歌唱表現との類似度が最大となる歌唱表現データDSを選択音声信号Xの複数の対象区間の各々について順次に選択する。歌唱表現の類似度の算定には、特徴量を比較するための公知の技術が任意に採用される。 The singing evaluation unit 46 first determines the similarity (correlation or distance) between the singing expression indicated by the singing expression data DS and the singing expression of the target section of the selected audio signal X in each of the plurality of singing expression data DS in the storage device 12. Is calculated for each target section, and the singing expression data DS having the maximum similarity to the singing expression of the target section among the plurality of singing expression data DS is sequentially selected for each of the plurality of target sections of the selected audio signal X. . A known technique for comparing feature quantities is arbitrarily employed for calculating the similarity of singing expressions.
 そして、歌唱評価部46は、選択音声信号Xの各対象区間について選択された歌唱表現データDSに対応する属性データDAの評価値Qを、選択音声信号Xの複数の対象区間について加重加算(または平均)することで選択音声信号Xの評価値Zを算定する。以上の説明から理解される通り、評価値Qが高い歌唱表現に類似する歌唱表現で歌唱された対象区間が選択音声信号X内に多く包含されるほど、選択音声信号Xの評価値Zは大きい数値に設定される。歌唱評価部46が算定した評価値Zは、例えば表示装置(図示略)による画像表示や放音装置18による音声再生で利用者に報知される。 Then, the singing evaluation unit 46 weights and adds the evaluation value Q of the attribute data DA corresponding to the singing expression data DS selected for each target section of the selected speech signal X for a plurality of target sections of the selected speech signal X (or The evaluation value Z of the selected speech signal X is calculated by averaging. As understood from the above description, the evaluation value Z of the selected sound signal X is larger as the target section sung with a singing expression similar to the singing expression having a higher evaluation value Q is included in the selected sound signal X. Set to a numeric value. The evaluation value Z calculated by the singing evaluation unit 46 is notified to the user by, for example, image display by a display device (not shown) or sound reproduction by the sound emitting device 18.
 図11は、歌唱評価処理のフローチャートである。図11に示すように、入力装置16の操作で利用者が歌唱評価処理の実行を指示すると(SC1)、歌唱選択部42は、記憶装置12に記憶された複数の音声信号Xから選択音声信号Xを選択し(SC2)、区間指定部44は、選択音声信号Xについて1個以上の対象区間を指定する(SC3)。歌唱評価部46は、記憶装置12に記憶された各歌唱表現データDSと各属性データDAとを利用して選択音声信号Xの評価値Zを算定する(SC4)。歌唱評価部46が算定した評価値Zが利用者に報知される(SC5)。 FIG. 11 is a flowchart of the song evaluation process. As shown in FIG. 11, when the user instructs the execution of the song evaluation process by operating the input device 16 (SC1), the song selection unit 42 selects the selected voice signal from the plurality of voice signals X stored in the storage device 12. X is selected (SC2), and the section designating unit 44 designates one or more target sections for the selected audio signal X (SC3). The song evaluation unit 46 calculates the evaluation value Z of the selected voice signal X using each song expression data DS and each attribute data DA stored in the storage device 12 (SC4). The evaluation value Z calculated by the singing evaluation unit 46 is notified to the user (SC5).
 以上に説明した通り、第2実施形態では、歌唱表現が選択音声信号Xに類似する歌唱表現データDSの評価値Qに応じて選択音声信号Xの評価値Zが算定される。したがって、歌唱表現の巧拙(表現登録処理で登録された歌唱表現との類否)という観点から選択音声信号Xを適切に評価することが可能である。なお、以上の説明からも理解される通り、第2実施形態では、属性データDAのうち評価値Q以外の情報は省略され得る。すなわち、第2実施形態の記憶装置12は、歌唱表現を示す歌唱表現データDSと当該歌唱表現の評価を示す評価値Qとを相異なる複数の歌唱表現について記憶する要素として表現される。 As described above, in the second embodiment, the evaluation value Z of the selected voice signal X is calculated according to the evaluation value Q of the song expression data DS whose singing expression is similar to the selected voice signal X. Therefore, it is possible to appropriately evaluate the selected speech signal X from the viewpoint of skill of singing expression (similarity with the singing expression registered in the expression registration process). As can be understood from the above description, in the second embodiment, information other than the evaluation value Q in the attribute data DA can be omitted. That is, the memory | storage device 12 of 2nd Embodiment is expressed as an element which memorize | stores the song expression data DS which show a song expression, and the evaluation value Q which shows the evaluation of the said song expression about several different song expressions.
<変形例>
 前述の各形態は多様に変形され得る。具体的な変形の態様を以下に例示する。以下の例示から任意に選択された2以上の態様は適宜に併合され得る。
<Modification>
Each of the above-described embodiments can be variously modified. Specific modifications are exemplified below. Two or more aspects arbitrarily selected from the following examples can be appropriately combined.
(1)第1実施形態の表現付与処理の対象や第2実施形態の歌唱評価処理の対象は、事前に収録されて記憶装置12に格納された音声信号Xに限定されない。例えば、収音装置14が生成した音声信号Xや、可搬型または内蔵型の記録媒体(例えばCD)から再生された音声信号Xや、他の通信端末から通信網を介して受信した音声信号(例えばストリーミング形式の音声信号)Xを、表現付与処理や歌唱評価処理の対象とすることも可能である。また、公知の音声合成処理(例えば素片接続型の歌唱合成処理)で生成された音声信号Xについて表現付与処理や歌唱評価処理を実行する構成も採用される。なお、前述の各形態では、収録済の音声信号Xに対して表現付与処理や歌唱評価処理を実行したが、例えば時間軸上の各対象区間を事前に指定すれば、音声信号Xの供給に並行して実時間的に表現付与処理や歌唱評価処理を実行することも可能である。 (1) The target of the expression providing process of the first embodiment and the target of the song evaluation process of the second embodiment are not limited to the audio signal X recorded in advance and stored in the storage device 12. For example, the audio signal X generated by the sound collection device 14, the audio signal X reproduced from a portable or built-in recording medium (for example, a CD), and the audio signal received from another communication terminal via the communication network ( For example, a streaming-format audio signal) X can be used as an object of the expression imparting process or the song evaluation process. Moreover, the structure which performs an expression provision process and a song evaluation process is also employ | adopted about the audio | voice signal X produced | generated by the well-known audio | voice synthesis | combination process (for example, segment connection type song synthesis process). In each of the above-described forms, the expression providing process and the singing evaluation process are performed on the recorded audio signal X. For example, if each target section on the time axis is specified in advance, the audio signal X is supplied. In parallel, it is also possible to execute the expression providing process and the singing evaluation process in real time.
 また、前述の各形態では、複数の音声信号Xの何れかを選択音声信号Xとして選択したが、音声信号Xの選択(歌唱選択部32または歌唱選択部42)は省略され得る。なお、音声信号Xの全区間(楽曲の全体)を対象区間として指定する構成では、区間指定部34を省略することも可能である。したがって、表現付与処理を実行する音声処理装置は、図12に例示される通り、複数の歌唱表現データDSから適用対象の歌唱表現データDSを選択する表現選択部36と、表現選択部36が選択した歌唱表現データDSが示す歌唱表現を歌唱音声(音声信号X)の特定区間に付与する表現付与部38とを具備する装置として包括的に表現される。 In each of the above-described embodiments, one of the plurality of audio signals X is selected as the selected audio signal X, but the selection of the audio signal X (the singing selection unit 32 or the singing selection unit 42) may be omitted. In the configuration in which the entire section of the audio signal X (the entire music piece) is specified as the target section, the section specifying unit 34 can be omitted. Therefore, the speech processing apparatus that executes the expression providing process is selected by the expression selecting unit 36 that selects the singing expression data DS to be applied from the plurality of singing expression data DS and the expression selecting unit 36, as illustrated in FIG. The singing expression indicated by the singing expression data DS is comprehensively expressed as an apparatus including an expression providing unit 38 that applies a singing expression to a specific section of the singing voice (audio signal X).
 表現登録処理の対象も同様に、収音装置14が生成した音声信号Rには限定されない。例えば、可搬型または内蔵型の記録媒体から再生された音声信号Rや、他の通信端末から通信網を介して受信した音声信号Rを表現登録処理の対象とすることも可能である。また、音声信号Rの供給に並行して実時間的に表現登録処理を実行することも可能である。 Similarly, the target of the expression registration process is not limited to the audio signal R generated by the sound collection device 14. For example, an audio signal R reproduced from a portable or built-in recording medium, or an audio signal R received from another communication terminal via a communication network can be used as an expression registration process target. It is also possible to execute the expression registration process in real time in parallel with the supply of the audio signal R.
(2)前述の各形態では、歌唱音声の時間波形を示す音声信号Xを対象として第1実施形態の表現付与処理や第2実施形態の歌唱評価処理を実行したが、表現付与処理や歌唱評価処理の対象となる歌唱音声の表現形式は任意である。具体的には、楽曲の音符毎に音高と発音文字(歌詞)とを時系列に指定する合成情報(例えばVSQ形式のファイル)で歌唱音声を表現することも可能である。例えば第1実施形態の表現付与部38は、合成情報で指定される歌唱音声を例えば素片接続型の音声合成処理で順次に合成しながら第1実施形態と同様の表現付与処理により歌唱表現を付与する。同様に、第2実施形態の歌唱評価部46は、合成情報で指定される歌唱音声を音声合成処理で順次に合成しながら第2実施形態と同様の歌唱評価処理を実行する。 (2) In each above-mentioned form, although the expression provision process of 1st Embodiment and the song evaluation process of 2nd Embodiment were performed for the audio | voice signal X which shows the time waveform of a singing voice, expression provision process and song evaluation are performed. The expression format of the singing voice to be processed is arbitrary. Specifically, the singing voice can be expressed by synthetic information (for example, a file in the VSQ format) in which pitches and pronunciation characters (lyrics) are specified in time series for each musical note. For example, the expression providing unit 38 of the first embodiment synthesizes the singing expression by the same expression providing process as that of the first embodiment while sequentially synthesizing the singing voice specified by the synthesis information by, for example, the unit connection type speech synthesis process. Give. Similarly, the song evaluation part 46 of 2nd Embodiment performs the song evaluation process similar to 2nd Embodiment, synthesize | combining the singing voice designated by synthetic | combination information sequentially by a voice synthesis process.
(3)第1実施形態では対象区間毎に1個の対象表現データDSを選択したが、1個の対象区間について表現選択部36が複数(典型的には複数種)の対象表現データDSを選択することも可能である。表現選択部36が選択した複数の対象表現データDSの各々の歌唱表現が選択音声信号Xの1個の対象区間に対して重複して付与される。また、1個の対象区間について選択された複数の対象表現データDSを統合した1個の歌唱表現データDS(例えば複数の対象表現データDSを加重加算した歌唱表現データDS)の歌唱表現を当該対象区間に付与することも可能である。 (3) In the first embodiment, one target expression data DS is selected for each target section. However, the expression selection unit 36 selects a plurality (typically a plurality of types) of target expression data DS for one target section. It is also possible to select. Each singing expression of the plurality of target expression data DS selected by the expression selecting unit 36 is given to one target section of the selected audio signal X in an overlapping manner. Further, the singing expression of one singing expression data DS (for example, singing expression data DS obtained by weighted addition of a plurality of object expressing data DS) obtained by integrating a plurality of object expressing data DS selected for one target section is the object. It is also possible to give to a section.
(4)第1実施形態では検索条件を指定することで利用者からの指示に応じた歌唱表現データDSを選択したが、表現選択部36が歌唱表現データDSを選択する方法は任意である。例えば、各歌唱表現データDSが示す歌唱表現の歌唱音声を放音装置18から再生することで利用者に試聴させ、利用者が試聴の結果を考慮して指定した歌唱表現データDSを表現選択部36が選択することも可能である。また、記憶装置12に記憶された各歌唱表現データDSをランダムに選択する構成や、事前に選択された所定の規則で各歌唱表現データDSを選択する構成も採用される。 (4) In the first embodiment, the singing expression data DS corresponding to the instruction from the user is selected by specifying the search condition, but the method of selecting the singing expression data DS by the expression selecting unit 36 is arbitrary. For example, the singing voice of the singing expression indicated by each singing expression data DS is reproduced by the user from the sound emitting device 18, and the singing expression data DS designated by the user in consideration of the result of the trial listening is expressed as an expression selection unit. It is also possible for 36 to select. Moreover, the structure which selects each song expression data DS memorize | stored in the memory | storage device 12 at random, and the structure which selects each song expression data DS by the predetermined rule selected beforehand are also employ | adopted.
(5)第1実施形態では、表現付与部38が生成した音声信号Yを放音装置18に供給して再生したが、音声信号Yの出力方法は任意である。例えば、表現付与部38が生成した音声信号Yを特定の記録媒体(例えば記憶装置12や可搬型の記録媒体)に格納する構成や、音声信号Yを通信装置から他の通信端末に送信する構成も採用される。 (5) In the first embodiment, the audio signal Y generated by the expression providing unit 38 is supplied to the sound emitting device 18 and reproduced, but the method of outputting the audio signal Y is arbitrary. For example, a configuration in which the audio signal Y generated by the expression providing unit 38 is stored in a specific recording medium (for example, the storage device 12 or a portable recording medium), or a configuration in which the audio signal Y is transmitted from the communication device to another communication terminal. Is also adopted.
(6)第1実施形態では表現登録処理および表現付与処理の双方を実行する音声処理装置100を例示したが、表現登録処理を実行する音声処理装置と表現付与処理を実行する音声処理装置とを別体に構成することも可能である。登録用の音声処理装置の表現登録処理で生成された複数の歌唱表現データDSが表現付与用の音声処理装置に転送されて表現付与処理に適用される。同様に、第2実施形態では、表現登録処理を実行する音声処理装置と歌唱評価処理を実行する音声処理装置とを別体に構成することも可能である。 (6) In the first embodiment, the speech processing apparatus 100 that executes both the expression registration process and the expression imparting process is illustrated. However, the speech processing apparatus that executes the expression registration process and the speech processing apparatus that executes the expression imparting process are provided. It can also be configured separately. A plurality of singing expression data DS generated in the expression registration process of the registration speech processing apparatus is transferred to the expression providing voice processing apparatus and applied to the expression providing process. Similarly, in the second embodiment, the voice processing device that executes the expression registration process and the voice processing device that executes the song evaluation process can be configured separately.
(7)携帯電話機等の端末装置と通信するサーバ装置で音声処理装置100を実現することも可能である。例えば、音声処理装置100は、端末装置から受信した音声信号Rの解析で歌唱表現データDSを抽出して記憶装置12に格納する表現登録処理や、歌唱表現データDSが示す歌唱表現を音声信号Xに付与した音声信号Yを端末装置に送信する表現付与処理を実行する。すなわち、相互に通信する音声処理装置(サーバ装置)と端末装置とを具備する音声処理システムとしても本発明は実現され得る。また、前述の各形態の音声処理装置100は、各機能を複数の装置に分散したシステム(音声処理システム)としても実現され得る。 (7) The voice processing device 100 can be realized by a server device that communicates with a terminal device such as a mobile phone. For example, the voice processing device 100 extracts the singing expression data DS by analyzing the voice signal R received from the terminal device and stores it in the storage device 12 or the singing expression indicated by the singing expression data DS as the voice signal X. An expression providing process for transmitting the audio signal Y given to the terminal device is executed. That is, the present invention can be realized as a voice processing system including a voice processing device (server device) and a terminal device that communicate with each other. Moreover, the speech processing apparatus 100 of each embodiment described above can be realized as a system (speech processing system) in which each function is distributed to a plurality of apparatuses.
(8)第2実施形態では、記憶装置12に記憶された各歌唱表現データDSおよび各属性データDA(評価値Q)を利用して、歌唱評価部46は、音声信号Xの歌唱の巧拙を評価したが、歌唱評価部46は、評価値Qを記憶装置12とは異なる装置から入手し、音声信号Xの歌唱の巧拙を評価してもよい。 (8) In the second embodiment, the song evaluation unit 46 uses the singing expression data DS and the attribute data DA (evaluation value Q) stored in the storage device 12 to sing the audio signal X. Although evaluated, the song evaluation part 46 may obtain the evaluation value Q from a device different from the storage device 12 and evaluate the skill of singing the audio signal X.
 本出願は、2013年3月15日出願の日本特許出願(特願2013-053983)に基づくものであり、その内容はここに参照として取り込まれる。 This application is based on a Japanese patent application filed on March 15, 2013 (Japanese Patent Application No. 2013-053983), the contents of which are incorporated herein by reference.
 本発明によれば、多様な歌唱表現の歌唱音声を生成することが可能である。 According to the present invention, it is possible to generate singing voices with various singing expressions.
100……音声処理装置、10……演算処理装置、12……記憶装置、14……収音装置、16……入力装置、18……放音装置、20……解析処理部、20A……楽曲構造解析部、20B……歌唱技法解析部、20C……声質解析部、22……歌唱区分部、24,46……歌唱評価部、26……歌唱解析部、28……属性取得部、32,42……歌唱選択部、34,44……区間指定部、36……表現選択部、38……表現付与部。  DESCRIPTION OF SYMBOLS 100 ... Voice processing device, 10 ... Arithmetic processing device, 12 ... Memory | storage device, 14 ... Sound collection device, 16 ... Input device, 18 ... Sound emission device, 20 ... Analysis processing part, 20A ... Music structure analysis unit, 20B ... Singing technique analysis unit, 20C ... Voice quality analysis unit, 22 ... Singing section, 24,46 ... Singing evaluation unit, 26 ... Singing analysis unit, 28 ... Attribute acquisition unit, 32, 42 ... Singing selection unit, 34, 44 ... Section designation unit, 36 ... Expression selection unit, 38 ... Expression giving unit.

Claims (7)

  1.  相異なる歌唱表現を示す複数の歌唱表現データから適用対象の歌唱表現データを選択する表現選択部と、
     前記表現選択部が選択した歌唱表現データが示す歌唱表現を歌唱音声の特定区間に付与する表現付与部と
     を具備する音声処理装置。
    An expression selection unit that selects singing expression data to be applied from a plurality of singing expression data indicating different singing expressions;
    A speech processing apparatus comprising: an expression providing unit that applies a singing expression indicated by the singing expression data selected by the expression selecting unit to a specific section of the singing voice.
  2.  前記表現選択部は、相異なる歌唱表現を示す2以上の歌唱表現データを選択し、
     前記表現付与部は、前記表現選択部が選択した前記2以上の歌唱表現データの各々が示す歌唱表現を、前記歌唱音声の特定区間に重複して付与する
     請求項1の音声処理装置。
    The expression selection unit selects two or more song expression data indicating different song expressions,
    The speech processing apparatus according to claim 1, wherein the expression providing unit provides the singing expression indicated by each of the two or more singing expression data selected by the expression selecting unit so as to overlap with a specific section of the singing sound.
  3.  歌唱表現に関連する属性データを当該歌唱表現の歌唱表現データに対応付けて記憶する記憶部を具備し、
     前記表現選択部は、前記各歌唱表現データの属性データを参照して前記記憶部から歌唱表現データを選択する
     請求項1または請求項2の音声処理装置。
    A storage unit for storing attribute data related to the singing expression in association with the singing expression data of the singing expression;
    The voice processing device according to claim 1, wherein the expression selection unit selects song expression data from the storage unit with reference to attribute data of each song expression data.
  4.  前記表現選択部は、利用者からの指示に応じて前記歌唱表現データを選択し、
     前記表現付与部は、歌唱音声のうち利用者からの指示に応じた特定区間に、前記表現選択部が選択した歌唱表現データが示す歌唱表現を付与する
     請求項1から請求項3の何れかの音声処理装置。
    The expression selection unit selects the singing expression data according to an instruction from a user,
    The said expression provision part provides the song expression which the song expression data which the said expression selection part has selected to the specific area according to the instruction | indication from a user among song voices. Audio processing device.
  5.  前記複数の歌唱表現データのうち歌唱音声に類似する歌唱表現の歌唱表現データに対応し、当該歌唱表現の評価を示す評価値に応じて前記歌唱音声を評価する歌唱評価部
     を具備する請求項1の音声処理装置。 
    The singing evaluation part corresponding to the singing expression data of the singing expression similar to the singing voice among the plurality of singing expression data, and evaluating the singing voice according to an evaluation value indicating the evaluation of the singing expression. Voice processing device.
  6.  歌唱表現を示す歌唱表現データと当該歌唱表現の評価を示す評価値とを相異なる複数の歌唱表現について記憶する記憶部、を具備し、
     前記歌唱評価部は、前記複数の歌唱表現データのうち歌唱音声に類似する歌唱表現の歌唱表現データに対応し、前記記憶部に記憶された評価値に応じて前記歌唱音声を評価する
     請求項5の音声処理装置。
    A storage unit that stores singing expression data indicating singing expression and an evaluation value indicating evaluation of the singing expression for a plurality of different singing expressions,
    The singing evaluation unit corresponds to singing expression data of a singing expression similar to a singing voice among the plurality of singing expression data, and evaluates the singing voice according to an evaluation value stored in the storage unit. Voice processing device.
  7.  相異なる歌唱表現を示す複数の歌唱表現データから適用対象の歌唱表現データを選択し、
     前記選択した歌唱表現データが示す歌唱表現を歌唱音声の特定区間に付与する
     音声処理方法。
    Select the song expression data to be applied from a plurality of song expression data showing different song expressions,
    The voice processing method of assigning the singing expression indicated by the selected singing expression data to a specific section of the singing voice.
PCT/JP2014/056570 2013-03-15 2014-03-12 Voice processing device WO2014142200A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201480014605.4A CN105051811A (en) 2013-03-15 2014-03-12 Voice processing device
KR1020157024316A KR20150118974A (en) 2013-03-15 2014-03-12 Voice processing device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2013-053983 2013-03-15
JP2013053983A JP2014178620A (en) 2013-03-15 2013-03-15 Voice processor

Publications (1)

Publication Number Publication Date
WO2014142200A1 true WO2014142200A1 (en) 2014-09-18

Family

ID=51536851

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2014/056570 WO2014142200A1 (en) 2013-03-15 2014-03-12 Voice processing device

Country Status (5)

Country Link
JP (1) JP2014178620A (en)
KR (1) KR20150118974A (en)
CN (1) CN105051811A (en)
TW (1) TW201443874A (en)
WO (1) WO2014142200A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2016194622A (en) * 2015-04-01 2016-11-17 株式会社エクシング Karaoke device and karaoke program
EP3537432A4 (en) * 2016-11-07 2020-06-03 Yamaha Corporation Voice synthesis method

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6620462B2 (en) * 2015-08-21 2019-12-18 ヤマハ株式会社 Synthetic speech editing apparatus, synthetic speech editing method and program
KR102168529B1 (en) * 2020-05-29 2020-10-22 주식회사 수퍼톤 Method and apparatus for synthesizing singing voice with artificial neural network

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003255974A (en) * 2002-02-28 2003-09-10 Yamaha Corp Singing synthesis device, method and program
JP2004264676A (en) * 2003-03-03 2004-09-24 Yamaha Corp Apparatus and program for singing synthesis
JP2006330615A (en) * 2005-05-30 2006-12-07 Yamaha Corp Device and program for synthesizing singing
JP2008165130A (en) * 2007-01-05 2008-07-17 Yamaha Corp Singing sound synthesizing device and program
JP2009244607A (en) * 2008-03-31 2009-10-22 Daiichikosho Co Ltd Duet part singing generation system
JP2009258291A (en) * 2008-04-15 2009-11-05 Yamaha Corp Sound data processing device and program
JP2011013454A (en) * 2009-07-02 2011-01-20 Yamaha Corp Apparatus for creating singing synthesizing database, and pitch curve generation apparatus
JP2011095397A (en) * 2009-10-28 2011-05-12 Yamaha Corp Sound synthesizing device

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003108179A (en) * 2001-10-01 2003-04-11 Nippon Telegr & Teleph Corp <Ntt> Method and program for gathering rhythm data for singing voice synthesis and recording medium where the same program is recorded
JPWO2009125710A1 (en) * 2008-04-08 2011-08-04 株式会社エヌ・ティ・ティ・ドコモ Media processing server apparatus and media processing method

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003255974A (en) * 2002-02-28 2003-09-10 Yamaha Corp Singing synthesis device, method and program
JP2004264676A (en) * 2003-03-03 2004-09-24 Yamaha Corp Apparatus and program for singing synthesis
JP2006330615A (en) * 2005-05-30 2006-12-07 Yamaha Corp Device and program for synthesizing singing
JP2008165130A (en) * 2007-01-05 2008-07-17 Yamaha Corp Singing sound synthesizing device and program
JP2009244607A (en) * 2008-03-31 2009-10-22 Daiichikosho Co Ltd Duet part singing generation system
JP2009258291A (en) * 2008-04-15 2009-11-05 Yamaha Corp Sound data processing device and program
JP2011013454A (en) * 2009-07-02 2011-01-20 Yamaha Corp Apparatus for creating singing synthesizing database, and pitch curve generation apparatus
JP2011095397A (en) * 2009-10-28 2011-05-12 Yamaha Corp Sound synthesizing device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
HIDEKI KAWAHARA ET AL.: "Perceptual study on design reuse of voice identity and singing style based on singing voice morphing", INTERACTION 2007 YOKOSHU, March 2007 (2007-03-01), Retrieved from the Internet <URL:http://www.interaction-ipsj.org/archives/paper2007/aural/0043/paper0043.pdf> [retrieved on 20140604] *
TAKESHI SAITO ET AL.: "Utagoe no Kojinsei Chikaku ni Kiyo suru Onkyo Tokucho no Kento", REPORT OF THE 2007 AUTUMN MEETING, THE ACOUSTICAL SOCIETY OF JAPAN CD-ROM, September 2007 (2007-09-01), pages 601 - 602 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2016194622A (en) * 2015-04-01 2016-11-17 株式会社エクシング Karaoke device and karaoke program
EP3537432A4 (en) * 2016-11-07 2020-06-03 Yamaha Corporation Voice synthesis method
US11410637B2 (en) 2016-11-07 2022-08-09 Yamaha Corporation Voice synthesis method, voice synthesis device, and storage medium

Also Published As

Publication number Publication date
KR20150118974A (en) 2015-10-23
JP2014178620A (en) 2014-09-25
TW201443874A (en) 2014-11-16
CN105051811A (en) 2015-11-11

Similar Documents

Publication Publication Date Title
KR101094687B1 (en) The Karaoke system which has a song studying function
JP4207902B2 (en) Speech synthesis apparatus and program
JP2015034920A (en) Voice analysis device
JP4645241B2 (en) Voice processing apparatus and program
CN111542875A (en) Speech synthesis method, speech synthesis device, and program
CN112331222A (en) Method, system, equipment and storage medium for converting song tone
JP2019061135A (en) Electronic musical instrument, musical sound generating method of electronic musical instrument, and program
WO2014142200A1 (en) Voice processing device
JP4479701B2 (en) Music practice support device, dynamic time alignment module and program
WO2020095950A1 (en) Information processing method and information processing system
Lerch Software-based extraction of objective parameters from music performances
JP6288197B2 (en) Evaluation apparatus and program
JP6102076B2 (en) Evaluation device
JP6657713B2 (en) Sound processing device and sound processing method
JP6737320B2 (en) Sound processing method, sound processing system and program
JP4491743B2 (en) Karaoke equipment
JP2002073064A (en) Voice processor, voice processing method and information recording medium
KR20090023912A (en) Music data processing system
JP2008197350A (en) Musical signal creating device and karaoke device
JP6252420B2 (en) Speech synthesis apparatus and speech synthesis system
JP5618743B2 (en) Singing voice evaluation device
JP2008040258A (en) Musical piece practice assisting device, dynamic time warping module, and program
JP6365483B2 (en) Karaoke device, karaoke system, and program
JP5953743B2 (en) Speech synthesis apparatus and program
JP5805474B2 (en) Voice evaluation apparatus, voice evaluation method, and program

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 201480014605.4

Country of ref document: CN

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14762388

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 20157024316

Country of ref document: KR

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14762388

Country of ref document: EP

Kind code of ref document: A1