WO2018230670A1 - Method for outputting singing voice, and voice response system - Google Patents

Method for outputting singing voice, and voice response system Download PDF

Info

Publication number
WO2018230670A1
WO2018230670A1 PCT/JP2018/022816 JP2018022816W WO2018230670A1 WO 2018230670 A1 WO2018230670 A1 WO 2018230670A1 JP 2018022816 W JP2018022816 W JP 2018022816W WO 2018230670 A1 WO2018230670 A1 WO 2018230670A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
character string
singing
singing voice
user
Prior art date
Application number
PCT/JP2018/022816
Other languages
French (fr)
Japanese (ja)
Inventor
大樹 倉光
頌子 奈良
強 宮木
浩雅 椎原
健一 山内
晋 山中
Original Assignee
ヤマハ株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ヤマハ株式会社 filed Critical ヤマハ株式会社
Publication of WO2018230670A1 publication Critical patent/WO2018230670A1/en

Links

Images

Classifications

    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/16Devices for psychotechnics; Testing reaction times ; Devices for evaluating the psychological state
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/10Speech classification or search using distance or distortion measures between unknown speech and reference templates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Definitions

  • the present invention relates to a technique for responding to a user input using a voice including a song.
  • Patent Document 1 is a technique for changing the atmosphere of music according to the user's situation and preferences.
  • Patent Document 2 is a technique for making a unique music selection that does not get bored in a device that outputs musical sounds according to the state of a moving body.
  • Patent Literature 1 nor 2 outputs a singing voice according to the interaction with the user.
  • this invention provides the technique which outputs a singing voice according to the interaction with a user.
  • the present invention includes a step of decomposing content into a plurality of partial contents, a step of identifying first partial contents from the plurality of partial contents, and a character string included in the first partial contents.
  • the content includes, for example, a character string.
  • the singing voice output method may include a step of determining an element used for singing synthesis using a character string included in the second partial content in response to the user's reaction.
  • the element may include a parameter of the song synthesis, a melody, or a tempo, or an arrangement of accompaniment in the song voice.
  • the synthesis of the first singing voice and the second singing voice is performed using segments recorded in at least one database selected from a plurality of databases. You may have the step which selects the database used in the case of singing composition using the character string contained in the said 2nd partial content with respect to the said user's reaction.
  • the synthesis of the first singing voice and the second singing voice is performed using segments recorded in a plurality of databases selected from a plurality of databases.
  • the database may be selected, and the singing voice output method may include a step of determining a usage ratio of the plurality of databases according to the reaction of the user.
  • This singing voice output method has a step of replacing a part of a character string included in the first partial content with another character string, and in the step of synthesizing the first singing voice, The first singing voice may be synthesized using a character string included in the first partial content replaced with the other character string.
  • the other character string and the character string to be replaced may have the same syllable number or mora number.
  • This singing voice output method has a step of replacing a part of the second partial content with another character string in response to the user's reaction, and in the step of synthesizing the second singing voice,
  • the second singing voice may be synthesized using a character string included in the second partial content partly replaced with the other character string.
  • the method for outputting the singing voice includes a step of synthesizing a third singing voice so as to have a time length corresponding to a matter indicated by a character string included in the first partial content, the first singing voice, and the first singing voice. You may have the step which outputs the said 3rd song voice between 2 song voices.
  • the method for outputting the singing voice includes a step of synthesizing a fourth singing voice using a second character string corresponding to a matter indicated by the first character string included in the first partial content, and the first singing voice.
  • the present invention uses a decomposition unit that decomposes content into a plurality of partial contents, a specifying unit that specifies first partial content from the plurality of partial contents, and a character string included in the first partial content.
  • a synthesis unit that synthesizes the first singing voice, an output unit that outputs the first singing voice, and a reception unit that receives a user's reaction to the first singing voice
  • the specifying unit specifies a second partial content related to the first partial content in response to the user's reaction
  • the combining unit uses a character string included in the second partial content.
  • the singing voice of 2 is synthesized, and the output unit provides an information processing system that outputs the second singing voice.
  • singing voice can be output in accordance with the interaction with the user.
  • FIG. 1 is a diagram illustrating an outline of a voice response system 1 according to an embodiment.
  • FIG. 2 is a diagram illustrating an outline of functions of the voice response system 1.
  • FIG. 3 is a diagram illustrating a hardware configuration of the input / output device 10.
  • FIG. 4 is a diagram illustrating a hardware configuration of the response engine 20 and the song synthesis engine 30.
  • FIG. 5 is a diagram illustrating a functional configuration related to the learning function 51.
  • FIG. 6 is a flowchart showing an outline of an operation related to the learning function 51.
  • FIG. 7 is a sequence chart illustrating an operation related to the learning function 51.
  • FIG. 8 is a diagram illustrating a classification table 5161.
  • FIG. 9 is a diagram illustrating a functional configuration related to the song synthesis function 52.
  • FIG. 10 is a flowchart showing an outline of the operation related to the song synthesis function 52.
  • FIG. 11 is a sequence chart illustrating an operation related to the song synthesis function 52.
  • FIG. 12 is a diagram illustrating a functional configuration related to the response function 53.
  • FIG. 13 is a flowchart illustrating an operation related to the response function 53.
  • FIG. 14 is a diagram showing an operation example 1 of the voice response system 1.
  • FIG. 15 is a diagram illustrating an operation example 2 of the voice response system 1.
  • FIG. 16 is a diagram showing an operation example 3 of the voice response system 1.
  • FIG. 17 is a diagram showing an operation example 4 of the voice response system 1.
  • FIG. 18 is a diagram illustrating an operation example 5 of the voice response system 1.
  • FIG. 19 is a diagram illustrating an operation example 6 of the voice response system 1.
  • FIG. 20 is a diagram illustrating an operation example 7 of the voice response system 1.
  • FIG. 21 is a diagram showing an operation example 8 of the voice response system 1.
  • FIG. 22 is a diagram showing an operation example 9 of the voice response system 1.
  • FIG. 1 is a diagram illustrating an overview of a voice response system 1 according to an embodiment.
  • the voice response system 1 is a so-called AI (Artificial Intelligence) voice assistant that automatically outputs a voice response in response to an input (or instruction) by a user.
  • AI Artificial Intelligence
  • voice input from the user to the voice response system 1 is referred to as “input voice”
  • voice output from the voice response system 1 in response to the input voice is referred to as “response voice”.
  • Voice response includes singing.
  • the voice response system 1 is an example of a song synthesis system. For example, when the user speaks “Sing something” to the voice response system 1, the voice response system 1 automatically synthesizes the song and outputs the synthesized song.
  • the voice response system 1 includes an input / output device 10, a response engine 20, and a song synthesis engine 30.
  • the input / output device 10 is a device that provides a man-machine interface, and is a device that receives an input voice from a user and outputs a response voice in response to the input voice.
  • the response engine 20 analyzes the input voice received by the input / output device 10 and generates a response voice. At least a part of the response voice includes singing voice.
  • the singing voice synthesis engine 30 synthesizes the singing voice used for the response voice.
  • FIG. 2 is a diagram illustrating an outline of functions of the voice response system 1.
  • the voice response system 1 has a learning function 51, a song synthesis function 52, and a response function 53.
  • the response function 53 is a function of analyzing a user input voice and providing a response voice based on the analysis result, and is provided by the input / output device 10 and the response engine 20.
  • the learning function 51 is a function for learning the user's preference from the user's input voice, and is provided by the singing synthesis engine 30.
  • the singing voice synthesizing function 52 is a function for synthesizing the singing voice used for the response voice, and is provided by the singing voice synthesis engine 30.
  • the learning function 51 learns the user's preference using the analysis result obtained by the response function 53.
  • the singing voice synthesis function 52 synthesizes a singing voice based on learning performed by the learning function 51.
  • the response function 53 makes a response using the singing voice synthesized by the singing voice synthesis function 52.
  • FIG. 3 is a diagram illustrating a hardware configuration of the input / output device 10.
  • the input / output device 10 includes a microphone 101, an input signal processing unit 102, an output signal processing unit 103, a speaker 104, a CPU (Central Processing Unit) 105, a sensor 106, a motor 107, and a network IF 108.
  • the microphone 101 converts the user's voice into an electric signal (input sound signal).
  • the input signal processing unit 102 performs processing such as analog / digital conversion on the input sound signal, and outputs data indicating the input sound (hereinafter referred to as “input sound data”).
  • the output signal processing unit 103 performs processing such as digital / analog conversion on data indicating response sound (hereinafter referred to as “response sound data”), and outputs an output sound signal.
  • the speaker 104 converts the output sound signal into sound (outputs sound based on the output sound signal).
  • the CPU 105 controls other elements of the input / output device 10 and reads and executes a program from a memory (not shown).
  • the sensor 106 detects the position of the user (the direction of the user viewed from the input / output device 10), and is an infrared sensor or an ultrasonic sensor, for example.
  • the motor 107 changes the direction of at least one of the microphone 101 and the speaker 104 so as to face the direction in which the user is present.
  • the microphone 101 may be configured by a microphone array, and the CPU 105 may detect the direction in which the user is present based on the sound collected by the microphone array.
  • the network IF 108 is an interface for performing communication via a network (for example, the Internet), and includes, for example, an antenna and a chip set for performing communication in accordance with a predetermined wireless communication standard (for example, WiFi (registered trademark)). .
  • FIG. 4 is a diagram illustrating a hardware configuration of the response engine 20 and the song synthesis engine 30.
  • the response engine 20 includes a CPU 201, a memory 202, a storage 203, and a communication IF 204.
  • the CPU 201 performs various calculations according to the program and controls other elements of the computer apparatus.
  • the memory 202 is a main storage device that functions as a work area when the CPU 201 executes a program, and includes, for example, a RAM (Random Access Memory).
  • the storage 203 is a nonvolatile auxiliary storage device that stores various programs and data, and includes, for example, an HDD (Hard Disk Drive) or an SSD (Solid State Drive).
  • the communication IF 204 includes a connector and a chip set for performing communication according to a predetermined communication standard (for example, Ethernet).
  • the storage 203 stores a program for causing the computer device to function as the response engine 20 in the voice response system 1 (hereinafter referred to as “response program”).
  • the computer device functions as the response engine 20 by the CPU 201 executing the response program.
  • the response engine 20 is, for example, a so-called AI.
  • the song synthesis engine 30 includes a CPU 301, a memory 302, a storage 303, and a communication IF 304. Details of each element are the same as those of the response engine 20.
  • the storage 303 stores a program for causing the computer device to function as the song synthesis engine 30 in the voice response system 1 (hereinafter referred to as “song synthesis program”).
  • ong synthesis program a program for causing the computer device to function as the song synthesis engine 30 in the voice response system 1 (hereinafter referred to as “song synthesis program”).
  • the CPU 301 executes the song synthesis program
  • the computer device functions as the song synthesis engine 30.
  • the response engine 20 and the song synthesis engine 30 are provided as cloud services on the Internet. Note that the response engine 20 and the song synthesis engine 30 may be services that do not depend on cloud computing.
  • FIG. 5 is a diagram illustrating a functional configuration related to the learning function 51.
  • the voice response system 1 includes a voice analysis unit 511, an emotion estimation unit 512, a music analysis unit 513, a lyrics extraction unit 514, a preference analysis unit 515, a storage unit 516, and a processing unit 510.
  • the input / output device 10 functions as a receiving unit that receives user input voice and an output unit that outputs response voice.
  • the voice analysis unit 511 analyzes the input voice. This analysis is a process of acquiring information used for generating a response voice from the input voice. Specifically, the input voice is converted into text (that is, converted into a character string). Processing for determining a request for the content, processing for specifying the content providing unit 60 that provides the content in response to a user request, processing for instructing the specified content providing unit 60, and processing for acquiring data from the content providing unit And a process of generating a response using the acquired data.
  • the content providing unit 60 is an external system of the voice response system 1.
  • the content providing unit 60 provides a service (for example, a music streaming service or a net radio) that outputs data for reproducing content such as music as sound (hereinafter referred to as “music data”). 1 external server.
  • the music analysis unit 513 analyzes the music data output from the content providing unit 60.
  • the analysis of music data refers to a process of extracting music characteristics.
  • the music features include at least one of tune, rhythm, chord progression, tempo, and arrangement. A known technique is used for feature extraction.
  • the lyrics extracting unit 514 extracts lyrics from the music data output from the content providing unit 60.
  • the music data includes metadata in addition to sound data.
  • the sound data is data indicating a signal waveform of music, and includes, for example, uncompressed data such as PCM (Pulse Code Modulation) data or compressed data such as MP3 data.
  • the metadata is data including information related to the music, and includes, for example, music title, performer name, composer name, songwriter name, album title, genre and other music attributes, and lyrics information. .
  • the lyrics extraction unit 514 extracts lyrics from metadata included in the music data. When the music data does not include metadata, the lyrics extraction unit 514 performs speech recognition processing on the sound data, and extracts lyrics from text obtained by the speech recognition.
  • the emotion estimation unit 512 estimates the user's emotion.
  • the emotion estimation unit 512 estimates the user's emotion from the input voice.
  • a known technique is used for emotion estimation.
  • the emotion estimation unit 512 may estimate the user's emotion based on the relationship between the (average) pitch in the voice output by the voice response system 1 and the pitch of the user's response to the pitch.
  • the emotion estimation unit 512 may estimate the user's emotion based on the input voice converted into text by the voice analysis unit 511 or the analyzed user request.
  • the preference analysis unit 515 indicates the user's preference using at least one of the reproduction history of the music that the user has instructed to reproduce, the analysis result, the lyrics, and the user's emotion when the reproduction of the music is instructed.
  • Information (hereinafter referred to as “preference information”) is generated.
  • the preference analysis unit 515 updates the classification table 5161 stored in the storage unit 516 using the generated preference information.
  • the classification table 5161 is a table (or database) in which user preferences are recorded. For example, for each user and for each emotion, the characteristics of the music (for example, timbre, tune, rhythm, chord progression, and tempo), and the attributes of the music (Performer name, composer name, composer name, and genre) and lyrics are recorded.
  • storage part 516 is an example of the read-out part which reads the parameter according to the user who input the trigger from the table which matched and recorded the parameter used for song synthesis
  • the parameters used for singing synthesis are data to be referred to at the time of singing synthesis, and the classification table 5161 includes timbre, tone, rhythm, chord progression, tempo, performer name, composer name, songwriter name, genre, And a concept including lyrics.
  • FIG. 6 is a flowchart showing an outline of the operation of the voice response system 1 according to the learning function 51.
  • the voice response system 1 analyzes the input voice.
  • the voice response system 1 performs processing instructed by the input voice.
  • the voice response system 1 determines whether the input voice includes an item to be learned. When it is determined that the input voice includes an item to be learned (S13: YES), the voice response system 1 moves the process to step S14. When it is determined that the input voice does not include items to be learned (S13: NO), the voice response system 1 moves the process to step S18.
  • the voice response system 1 estimates the user's emotion.
  • step S15 the voice response system 1 analyzes the music for which playback has been instructed.
  • step S ⁇ b> 16 the voice response system 1 acquires the lyrics of the music that is instructed to be played.
  • step S17 the voice response system 1 updates the classification table using the information obtained in steps S14 to S16.
  • step S18 The processing after step S18 is not directly related to the learning function 51, that is, the update of the classification table, but includes the processing using the classification table.
  • step S18 the voice response system 1 generates a response voice for the input voice.
  • the classification table is referred to as necessary.
  • step S19 the voice response system 1 outputs a response voice.
  • FIG. 7 is a sequence chart illustrating the operation of the voice response system 1 according to the learning function 51.
  • the user performs user registration with the voice response system 1 when, for example, the voice response system 1 is subscribed or activated for the first time.
  • User registration includes setting of a user name (or login ID) and a password.
  • the input / output device 10 is activated at the start of the sequence in FIG. 7, and the login process of the user is completed. That is, in the voice response system 1, a user who uses the input / output device 10 is specified.
  • the input / output device 10 is in a state of waiting for a user's voice input (speech).
  • the method by which the voice response system 1 identifies the user is not limited to the login process.
  • the voice response system 1 may specify the user based on the input voice.
  • the input / output device 10 receives an input voice.
  • the input / output device 10 converts the input voice into data and generates voice data.
  • the sound data includes sound data indicating a signal waveform of the input sound and a header.
  • the header includes information indicating the attribute of the input voice.
  • the attributes of the input voice include, for example, an identifier for specifying the input / output device 10, a user identifier (for example, a user name or a login ID) of a user who has issued the voice, and a time stamp indicating the time at which the voice was emitted.
  • the input / output device 10 outputs voice data indicating the input voice to the voice analysis unit 511.
  • step S103 the voice analysis unit 511 analyzes the input voice using the voice data.
  • the voice analysis unit 511 determines whether the input voice includes items to be learned.
  • the item to be learned is a matter for specifying a song, specifically, a music playback instruction.
  • step S104 the processing unit 510 performs processing instructed by the input voice.
  • the processing performed by the processing unit 510 is, for example, streaming playback of music.
  • the content providing unit 60 has a music database in which a plurality of music data is recorded.
  • the processing unit 510 reads the music data of the instructed music from the music database.
  • the processing unit 510 transmits the read music data to the input / output device 10 that is the transmission source of the input sound.
  • the processing performed by the processing unit 510 is playback of a net radio.
  • the content providing unit 60 performs streaming broadcasting of radio sound.
  • the processing unit 510 transmits the streaming data received from the content providing unit 60 to the input / output device 10 that is the transmission source of the input audio.
  • the processing unit 510 further performs processing for updating the classification table (step S105).
  • the processing for updating the classification table includes a request for emotion estimation to the emotion estimation unit 512 (step S1051), a request for music analysis to the music analysis unit 513 (step S1052), and a request for lyrics extraction to the lyrics extraction unit 514 ( Step S1053) is included.
  • the emotion estimation unit 512 estimates the user's emotion (step S106), and outputs information indicating the estimated emotion (hereinafter referred to as “emotion information”) to the processing unit 510 that is the request source. (Step S107).
  • the emotion estimation unit 512 estimates the user's emotion using the input voice.
  • the emotion estimation unit 512 estimates an emotion based on, for example, input text that has been converted into text. In one example, if a keyword indicating emotion is defined in advance, and the input voice that has been converted into text includes this keyword, the emotion estimation unit 512 determines that the user is the emotion (for example, “kutsu”). If the keyword is included, it is determined that the user's emotion is “anger”).
  • the emotion estimation unit 512 estimates an emotion based on the pitch, volume, speed, or temporal change of the input voice. In one example, when the average pitch of the input voice is lower than the threshold value, the emotion estimation unit 512 determines that the user's emotion is “sad”. In another example, the emotion estimation unit 512 may estimate the user's emotion based on the relationship between the (average) pitch in the voice output by the voice response system 1 and the pitch of the user's response to the pitch. . Specifically, when the pitch of the voice that the user responds is low even though the pitch of the voice that the voice response system 1 outputs is high, the emotion estimation unit 512 indicates that the user's emotion is “sad”. to decide.
  • the emotion estimation unit 512 may estimate the user's emotion based on the relationship between the pitch of the ending in the voice and the pitch of the user's response thereto. Alternatively, the emotion estimation unit 512 may estimate the user's emotion in consideration of these multiple factors.
  • the emotion estimation unit 512 may estimate the user's emotion using an input other than voice.
  • the input other than the voice for example, an image of a user's face taken by a camera, a user's body temperature detected by a temperature sensor, or a combination thereof is used.
  • the emotion estimation unit 512 determines whether the user's emotion is “fun”, “anger”, or “sad” from the facial expression of the user.
  • the emotion estimation unit 512 may determine the user's emotion based on the change in facial expression in the user's facial video.
  • the emotion estimation unit 512 may determine “anger” when the user's body temperature is high and “sad” when the user's body temperature is low.
  • the music analysis unit 513 analyzes the music reproduced in accordance with the user's instruction (step S108), and the information indicating the analysis result (hereinafter referred to as “music information”) is processed as a request source. It outputs to the part 510 (step S109).
  • the lyrics extraction unit 514 acquires the lyrics of the music to be played according to the user's instruction (step S110), and obtains information indicating the acquired lyrics (hereinafter referred to as “lyric information”) as the request source. Is output to the processing unit 510 (step S111).
  • step S112 the processing unit 510 outputs the set of emotion information, music information, and lyrics information acquired from the emotion estimation unit 512, the music analysis unit 513, and the lyrics extraction unit 514 to the preference analysis unit 515.
  • the preference analysis unit 515 analyzes a plurality of sets of information to obtain information indicating the user's preference. For this analysis, the preference analysis unit 515 records a plurality of sets of these information over a past period (for example, a period from the start of system operation to the present time). In one example, the preference analysis unit 515 statistically processes music information and calculates a statistical representative value (for example, an average value, a mode value, or a median value). By this statistical processing, for example, the average value of the tempo and the mode value of timbre, tone, rhythm, chord progression, composer name, composer name, and performer name are obtained.
  • a statistical representative value for example, an average value, a mode value, or a median value.
  • the preference analysis unit 515 identifies the part of speech of each word after decomposing the lyrics indicated by the lyrics information into a word level using a technique such as morphological analysis, and displays a histogram for the word of a specific part of speech (for example, a noun). A word that is created and whose appearance frequency is within a predetermined range (for example, the top 5%) is specified. Furthermore, the preference analysis unit 515 extracts a word group that includes the identified word and corresponds to a predetermined syntactic break (for example, minutes, clauses, or phrases) from the lyrics information. For example, when the word “like” appears frequently, word groups such as “I like you” and “I like it very much” are extracted from the lyrics information.
  • a technique such as morphological analysis
  • the preference analysis unit 515 may analyze a plurality of sets of information according to a predetermined algorithm different from simple statistical processing to obtain information indicating the user's preference.
  • the preference analysis unit 515 may receive feedback from the user and adjust the weights of these parameters according to the feedback.
  • the preference analysis unit 515 updates the classification table 5161 using the information obtained in step S113.
  • FIG. 8 is a diagram illustrating a classification table 5161.
  • This figure shows a classification table 5161 for users whose user name is “Taro Yamada”.
  • the features, attributes, and lyrics of the music are recorded in association with the user's emotions.
  • the classification table 5161 for example, when the user “Taro Yamada” has a feeling of “happy”, the words “love”, “love”, and “love” are included in the lyrics, and the tempo is about 60. It is shown that the user prefers a music piece that has a chord progression of “I ⁇ V ⁇ VIm ⁇ IIIm ⁇ IV ⁇ I ⁇ IV ⁇ V” and is mainly a piano tone.
  • information indicating the user's preference can be obtained automatically.
  • the preference information recorded in the classification table 5161 is accumulated as the learning progresses, that is, as the accumulated usage time of the voice response system 1 increases, and reflects the user's preference more. According to this example, information reflecting the user's preference can be obtained automatically.
  • the preference analysis unit 515 may set the initial value of the classification table 5161 at a predetermined timing such as user registration or first login.
  • the voice response system 1 causes the user to select a character (for example, a so-called avatar) representing the user on the system, and sets a classification table 5161 having an initial value corresponding to the selected character to the classification corresponding to the user. It may be set as a table.
  • the data recorded in the classification table 5161 described in this embodiment is an example.
  • the user's emotion is not recorded in the classification table 5161, and at least lyrics may be recorded.
  • the lyrics may not be recorded in the classification table 5161, and at least the user's emotion and the result of music analysis may be recorded.
  • FIG. 9 is a diagram illustrating a functional configuration related to the song synthesis function 52.
  • the voice response system 1 includes a voice analysis unit 511, an emotion estimation unit 512, a storage unit 516, a detection unit 521, a song generation unit 522, an accompaniment generation unit 523, and a synthesis unit 524.
  • the song generation unit 522 includes a melody generation unit 5221 and a lyrics generation unit 5222. In the following, description of elements common to the learning function 51 is omitted.
  • the storage unit 516 stores a segment database 5162.
  • the segment database is a database that records speech segment data used in singing synthesis.
  • the speech segment data is obtained by converting one or more phonemes into data.
  • a phoneme corresponds to the smallest unit of language semantic distinction (for example, vowels and consonants), and is set in consideration of the actual articulation of a language and the entire phonological system. Is the smallest unit.
  • the speech segment is obtained by cutting out a section corresponding to a desired phoneme or phoneme chain from the input speech uttered by a specific speaker.
  • the speech segment data in the present embodiment is data indicating the frequency spectrum of the speech segment.
  • the term “speech segment” includes a single phoneme (for example, a monophone) or a phoneme chain (for example, a diphone or a triphone).
  • the storage unit 516 may store a plurality of unit databases 5162.
  • the plurality of segment databases 5162 may include, for example, records of phonemes pronounced by different singers (or speakers). Or the some segment database 5162 may contain what recorded the phoneme sounded by the single singer (or speaker) by a different way of singing or a voice color, respectively.
  • the song generation unit 522 generates a song voice, that is, synthesizes a song.
  • the singing voice is a voice uttered according to a given melody with given lyrics.
  • the melody generation unit 5221 generates a melody used for song synthesis.
  • the lyrics generation unit 5222 generates lyrics used for singing synthesis.
  • the melody generation unit 5221 and the lyric generation unit 5222 may generate melody and lyrics using information recorded in the classification table 5161.
  • the song generation unit 522 generates a song voice using the melody generated by the melody generation unit 5221 and the lyrics generated by the lyrics generation unit 5222.
  • generation part 523 produces
  • the synthesizing unit 519 synthesizes the singing voice using the singing voice generated by the singing generation unit 522, the accompaniment generated by the accompaniment generation unit 523, and the voice unit recorded in the unit database 5162.
  • FIG. 10 is a flowchart showing an outline of the operation (song synthesis method) of the voice response system 1 according to the song synthesis function 52.
  • the voice response system 1 determines whether an event that triggers singing synthesis has occurred (detects).
  • the event that triggers the singing synthesis is, for example, an event that a voice input is made by the user, an event registered in a calendar (for example, an alarm or a user's birthday), or a method other than a voice (for example, the input / output device 10).
  • At least one of an event that a singing synthesis instruction is input by an operation on a smartphone (not shown) wirelessly connected to the mobile phone, and an event that occurs randomly.
  • the voice response system 1 moves the process to step S22.
  • the voice response system 1 waits until an event that triggers singing synthesis occurs.
  • step S22 the voice response system 1 reads the singing synthesis parameters.
  • step S23 the voice response system 1 generates lyrics.
  • step S24 the voice response system 1 generates a melody.
  • step S25 the voice response system 1 corrects one of the generated lyrics and melody to match the other.
  • step S26 the voice response system 1 selects a segment database to be used (an example of a selection unit).
  • step S27 the voice response system 1 performs singing synthesis using the melody, lyrics, and segment database obtained in steps S23, S26, and S27.
  • step S28 the voice response system 1 generates an accompaniment.
  • step S29 the voice response system 1 synthesizes the singing voice and the accompaniment.
  • the processing of steps S23 to S29 is a part of the processing of step S18 in the flow of FIG.
  • combination function 52 is demonstrated in detail.
  • FIG. 11 is a sequence chart illustrating the operation of the voice response system 1 according to the song synthesis function 52.
  • the detection unit 521 requests the song generation unit 522 to perform song synthesis (step S201).
  • the request for song composition includes the user's identifier.
  • the song generation unit 522 inquires of the storage unit 516 about the user's preference (step S202). This query includes a user identifier.
  • the storage unit 516 reads the preference information corresponding to the user identifier included in the inquiry from the classification table 5161, and outputs the read preference information to the song generation unit 522 (step S203).
  • the song generation unit 522 inquires of the emotion estimation unit 512 about the user's emotion (step S204). This query includes a user identifier. When receiving the inquiry, the emotion estimation unit 512 outputs the emotion information of the user to the song generation unit 522 (step S205).
  • the song generation unit 522 selects a lyrics source.
  • the source of the lyrics is determined according to the input sound.
  • the source of the lyrics is roughly either the processing unit 510 or the classification table 5161.
  • the request for song synthesis output from the processing unit 510 to the song generation unit 522 may include lyrics (or lyrics material) and may not include lyrics.
  • the lyric material is a character string that cannot form lyrics by itself but forms lyrics by combining with other lyric materials.
  • the case where the request for singing synthesis includes lyrics means, for example, a case where a response voice is output with a melody attached to the response itself by AI (“Tomorrow's weather is fine”, etc.).
  • the processing unit 510 Since the singing synthesis request is generated by the processing unit 510, it can be said that the source of the lyrics is the processing unit 510. Furthermore, since the processing unit 510 may acquire content from the content providing unit 60, it can be said that the source of the lyrics is the content providing unit 60.
  • the content providing unit 60 is, for example, a server that provides news or a server that provides weather information. Or the content provision part 60 is a server which has a database which recorded the lyrics of the existing music. Although only one content providing unit 60 is shown in the figure, a plurality of content providing units 60 may exist.
  • the song generation unit 522 selects the request for song synthesis as the source of the lyrics.
  • the lyrics are not included in the singing synthesis request (for example, when the instruction by the input voice is not to specify the contents of the lyrics such as “sing something”), the singing generation unit 522 displays the classification table. 5161 is selected as the source of the lyrics.
  • step S207 the song generation unit 522 requests the selected source to provide lyrics material.
  • the classification table 5161 that is, the storage unit 516 is selected as the source.
  • the request includes a user identifier and emotion information of the user.
  • the storage unit 516 extracts the lyrics material corresponding to the user identifier and emotion information included in the request from the classification table 5161 (step S208).
  • the storage unit 516 outputs the extracted lyric material to the song generation unit 522 (step S209).
  • the song generation unit 522 requests the lyrics generation unit 5222 to generate lyrics (step S210). This request includes the lyrics material obtained from the source.
  • the lyrics generation unit 5222 generates lyrics using the lyrics material (step S211).
  • the lyrics generation unit 5222 generates lyrics by combining a plurality of lyrics materials, for example.
  • each source may store the lyrics for the entire song, and in this case, the lyrics generation unit 5222 selects the lyrics for one song used for singing synthesis from the lyrics stored in the source. May be.
  • the lyrics generation unit 5222 outputs the generated lyrics to the song generation unit 522 (step S212).
  • the song generation unit 522 requests the melody generation unit 5221 to generate a melody.
  • This request includes information specifying user preference information and the number of lyrics.
  • the information for specifying the number of sounds in the lyrics is the number of characters in the generated lyrics, the number of mora, or the number of syllables.
  • the melody generation unit 5221 generates a melody according to the preference information included in the request (step S214). Specifically, for example, as follows.
  • the melody generating unit 5221 is a database of melody material (for example, a note string having a length of about two or four bars, or an information string obtained by subdividing a note string into musical elements such as changes in rhythm and pitch).
  • the melody database is stored in the storage unit 516, for example.
  • melody attributes are recorded.
  • the attribute of the melody includes, for example, music information such as a suitable tune or lyrics and a composer name.
  • the melody generation unit 5221 selects one or a plurality of materials that match the preference information included in the request from the materials recorded in the melody database, and combines the selected materials to generate a melody having a desired length. obtain.
  • the song generation unit 522 outputs information for specifying the generated melody (for example, sequence data such as MIDI) to the song generation unit 522 (step S215).
  • the song generation unit 522 requests the melody generation unit 5221 to correct the melody or generate the lyrics from the lyrics generation unit 5222.
  • One of the purposes of this correction is to make the number of sounds of lyrics (for example, the number of mora) and the number of sounds of a melody match. For example, when the number of mora in the lyrics is less than the number of sounds in the melody (when there are not enough characters), the song generation unit 522 requests the lyrics generation unit 5222 to increase the number of characters in the lyrics. Alternatively, when the number of mora in the lyrics is greater than the number of sounds in the melody (in the case of remaining characters), the singing generation unit 522 requests the melody generation unit 5221 to increase the number of sounds in the melody. In this figure, an example of correcting lyrics will be described.
  • the lyrics generation unit 5222 corrects the lyrics in response to the request for correction.
  • the melody generation part 5221 corrects a melody by dividing a note and increasing the number of notes, for example.
  • the lyric generation unit 5222 or the melody generation unit 5221 may adjust the lyric phrase delimiter to match the melody phrase delimiter.
  • the lyrics generation unit 5222 outputs the corrected lyrics to the song generation unit 522 (step S218).
  • the song generation unit 522 selects the segment database 5162 used for song synthesis (step S219).
  • the segment database 5162 is selected according to the user's attribute regarding the event that triggered the singing synthesis, for example.
  • the segment database 5162 may be selected according to the content of the event that triggered the song synthesis. Further alternatively, the segment database 5162 may be selected according to user preference information recorded in the classification table 5161.
  • the song generation unit 522 synthesizes the speech unit extracted from the selected unit database 5162 according to the lyrics and the melody obtained in the process so far, and obtains synthesized song data (step S220).
  • the classification table 5161 may record information indicating user preferences regarding performance of singing, such as voice color change, tame, shackle, and vibrato in the singing, and the singing generation unit 522 refers to these information. Thus, a singing that reflects the performance according to the user's preference may be synthesized.
  • the song generation unit 522 outputs the generated synthesized song data to the synthesis unit 524 (step S2221).
  • the song generation unit 522 requests the accompaniment generation unit 523 to generate an accompaniment (S222).
  • This request includes information indicating a melody in singing synthesis.
  • the accompaniment generation unit 523 generates an accompaniment according to the melody included in the request (step S223).
  • a well-known technique is used as a technique for automatically adding an accompaniment to a melody.
  • Chord progression data data indicating the melody chord progression
  • the accompaniment generation unit 523 may generate an accompaniment using the chord progression data.
  • accompaniment chord progression data for a melody is recorded in the melody database
  • the accompaniment generation unit 523 may generate an accompaniment using the chord progression data.
  • the accompaniment generation unit 523 may store a plurality of accompaniment audio data in advance, and read out the one that matches the chord progression of the melody. Moreover, the accompaniment production
  • the synthesizing unit 524 Upon receiving the synthesized singing and accompaniment data, the synthesizing unit 524 synthesizes the synthesized singing and accompaniment (step S225). In synthesizing, the singing and the accompaniment are synthesized in synchronism by matching the performance start position and tempo. Thus, synthetic singing data with accompaniment is obtained. The synthesizing unit 524 outputs synthetic singing data.
  • the voice response system 1 may generate a melody first, and then generate lyrics according to the melody.
  • an accompaniment is not produced
  • an accompaniment may be generated first and a song may be synthesized in accordance with the accompaniment.
  • FIG. 12 is a diagram illustrating a functional configuration of the voice response system 1 according to the response function 53.
  • the voice response system 1 includes a voice analysis unit 511, an emotion estimation unit 512, and a content decomposition unit 531.
  • the content decomposition unit 531 decomposes one content into a plurality of partial contents.
  • the content refers to the content of information output as response voice, and specifically refers to, for example, music, news, recipes, or teaching materials (sports learning, instrument learning, learning drill, quiz).
  • FIG. 13 is a flowchart illustrating the operation of the voice response system 1 according to the response function 53.
  • the voice analysis unit 511 specifies content to be played back.
  • the content to be reproduced is specified according to, for example, the user input voice.
  • the voice analysis unit 511 analyzes the input voice and specifies the content instructed to be played by the input voice.
  • the voice analysis unit 11 instructs the processing unit 510 to provide a “hamburger recipe”.
  • the processing unit 510 accesses the content providing unit 60 and acquires text data describing the “hamburger recipe”. The data acquired in this way is specified as the content to be played back.
  • the processing unit 510 notifies the content decomposition unit 531 of the identified content.
  • step S32 the content decomposition unit 531 decomposes the content into a plurality of partial contents.
  • the “hamburger recipe” is composed of a plurality of steps (cutting ingredients, mixing ingredients, molding, baking, etc.), and the content decomposition unit 531 converts the text “hamburger recipe” into “materials”. It is broken down into four partial contents: “cutting step”, “mixing material”, “forming step”, and “baking step”.
  • the content decomposition position is automatically determined by, for example, AI.
  • a marker indicating a delimiter may be embedded in the content in advance, and the content may be decomposed at the position of the marker.
  • step S33 the content decomposition unit 531 specifies one target partial content among the plurality of partial contents (an example of a specifying unit).
  • the target partial content is the partial content to be played back, and is determined according to the positional relationship of the partial content in the original content.
  • the content disassembling unit 531 first identifies “the step of cutting the material” as the target partial content.
  • the content decomposing unit 531 identifies “the step of mixing materials” as the target partial content.
  • the content decomposition unit 531 notifies the content modification unit 532 of the identified partial content.
  • the content correction unit 532 corrects the target partial content.
  • a specific correction method is defined according to the content. For example, the content correction unit 532 does not correct content such as news, weather information, and recipes.
  • the content correction unit 532 replaces the portion that is desired to be hidden as a problem with another sound (for example, humming, “rara”, beep sound, etc.).
  • the content correction unit 532 performs replacement using a character string having the same number of mora or syllables as the character string before replacement.
  • the content correction unit 532 outputs the corrected partial content to the song generation unit 522.
  • step S35 the song generation unit 522 sings the modified partial content.
  • the singing voice generated by the singing generation unit 522 is finally output as a response voice from the input / output device 10.
  • the voice response system 1 waits for a user response (step S36).
  • step S ⁇ b> 36 the voice response system 1 may output a singing or voice that prompts the user to respond (for example, “has it done?”).
  • the voice analysis unit 511 determines the next process according to the user response. When a response for prompting the reproduction of the next partial content is input (S36: next), the voice analysis unit 511 moves the process to step S33.
  • the response that prompts the reproduction of the next partial content is, for example, a voice such as “next step”, “completed”, “finished”, or the like.
  • a response other than a response prompting the reproduction of the next partial content is input (S36: end)
  • the audio analysis unit 511 instructs the processing unit 510 to stop outputting the audio.
  • step S37 the processing unit 510 stops the output of the synthesized voice of the partial content at least temporarily.
  • step S38 the processing unit 510 performs processing according to the user's input voice.
  • the processing in step S38 includes, for example, stop playback of the current content, keyword search instructed by the user, and start playback of another content. For example, when a response such as “I want you to stop singing”, “End of song”, or “End” is input, the processing unit 510 stops the reproduction of the current content. For example, when a question-type response such as “How do you cut a strip?” Or “What is Ario Aurio?” Is input, the processing unit 510 provides content for answering a user's question as content. Obtained from the unit 60.
  • the processing unit 510 outputs a sound of an answer to the user's question. This answer may be spoken voice instead of singing.
  • a response instructing the reproduction of another content such as “Turn XXX”
  • the processing unit 510 acquires the instructed content from the content providing unit 60 and reproduces it.
  • the voice response system 1 may determine whether to break down into partial contents according to user input voice or according to content to be output, or to output as it is without being decomposed.
  • FIG. 14 is a diagram illustrating an operation example 1 of the voice response system 1.
  • the user requests the reproduction of the musical piece by the input voice of “Kazutaro Sato (performer name)“ Sakura Sakura ”(music name)”.
  • the voice response system 1 searches the music database according to the input voice and reproduces the requested music.
  • the voice response system 1 updates the classification table using the emotion of the user when the input voice is input and the analysis result of the music.
  • the classification table is updated every time music playback is requested.
  • the classification table more reflects the user's preference as the number of times the user requests the voice response system 1 to play a song increases (that is, as the cumulative usage time of the voice response system 1 increases). Go.
  • FIG. 15 is a diagram illustrating an operation example 2 of the voice response system 1.
  • the user requests singing synthesis with an input voice of "Sing something fun”.
  • the voice response system 1 performs singing synthesis according to the input voice.
  • the voice response system 1 refers to the classification table. Lyrics and melodies are generated using information recorded in the classification table. Therefore, it is possible to automatically create music that reflects the user's preferences.
  • FIG. 16 is a diagram illustrating an operation example 3 of the voice response system 1.
  • the user requests the provision of weather information by an input voice “What is the weather today?”.
  • the processing unit 510 accesses a server that provides weather information in the content providing unit 60 and acquires text indicating today's weather (for example, “Today is sunny all day”).
  • the processing unit 510 outputs a song synthesis request including the acquired text to the song generation unit 522.
  • the song generation unit 522 performs song synthesis using the text included in the request as lyrics.
  • the voice response system 1 outputs a singing voice with a melody and accompaniment added to “Today is sunny today” as an answer to the input voice.
  • FIG. 17 is a diagram illustrating an operation example 4 of the voice response system 1.
  • the voice response system 1 asks the user a question in order to obtain information that can be used as a hint for generating lyrics, such as “Where is the meeting place?” And “When is the season?”.
  • the voice response system 1 generates lyrics using the user's answers to these questions. Since the usage period is still as short as two weeks, the classification table of the voice response system 1 still does not sufficiently reflect the user's preference, and the association with emotions is not sufficient. Therefore, although the user really likes the ballad-like music, the user may generate a different rock-like music.
  • FIG. 18 is a diagram illustrating an operation example 5 of the voice response system 1.
  • This example shows an example in which the use of the voice response system 1 is further continued from the operation example 3, and the cumulative use period becomes one and a half months.
  • the classification table more reflects the user's preference, and the synthesized singing is in accordance with the user's preference. The user can experience that the response of the voice response system 1 that was initially incomplete gradually changes so as to suit his / her preference.
  • FIG. 19 is a diagram illustrating an operation example 6 of the voice response system 1.
  • the user requests the provision of the content of the “recipe” of “hamburger” by an input voice “Tell me a recipe for the hamburger?”.
  • the voice response system 1 breaks down the content into partial content based on the fact that the content “recipe” should proceed to the next step after a certain step is completed, and performs the next process according to the user's reaction. It is determined to play in a manner of determining.
  • “Recipe” of “hamburger” is disassembled step by step, and every time a singing of each step is output, the voice response system 1 is a voice prompting the user's response, such as “has it done?” Is output.
  • the voice response system 1 outputs the singing of the next step in response.
  • the voice response system 1 outputs a singing of “chopped onion” in response.
  • the voice response system 1 starts singing from the continuation of the “recipe” of “hamburg”.
  • the voice response system 1 may output the singing voice of another content between the singing voice of the first partial content and the singing voice of the second partial content that follows.
  • the voice response system 1 uses, for example, the singing voice synthesized so as to have a time length according to the matter indicated by the character string included in the first partial content, the singing voice of the first partial content, and the second partial content. Output between the singing voices. Specifically, when the first partial content indicates that the waiting time will occur for 20 minutes, such as “Let's boil the ingredients for 20 minutes”, the voice response system 1 boils the ingredients. Synthesize a 20-minute song that is played while you are playing.
  • the voice response system 1 outputs the singing voice synthesized by using the second character string corresponding to the matter indicated by the first character string included in the first partial content, and the singing voice of the first partial content. Thereafter, the data may be output at a timing corresponding to the time length corresponding to the item indicated by the first character string.
  • the voice response system 1 may be output 20 minutes after the first partial content is output. Or, in the example where the first partial content is “Let's boil the ingredients for 20 minutes here”, when half of the waiting time (10 minutes) has passed, it wraps with “10 minutes until the end of boiling” You may sing in the wind.
  • FIG. 21 is a diagram illustrating an operation example 7 of the voice response system 1.
  • the user requests the provision of the contents of the “procedure manual” by an input voice “Will you read the process manual of the process in the factory?”.
  • the voice response system 1 is based on the fact that the content called “procedure manual” is for confirming the user's memory, and decomposes the content into partial content and determines the next process according to the user's reaction Decide to play with.
  • the voice response system 1 divides the procedure manual at random positions and breaks it down into a plurality of partial contents.
  • the voice response system 1 waits for a user's reaction. For example, the voice response system 1 sings the part “after pressing switch A” for the content of the procedure “pressing switch B when the value of meter B becomes 10 or less after pressing switch A” Wait for user response.
  • the voice response system 1 outputs the next partial content song.
  • the singing speed of the next partial content may be changed depending on whether or not the user can correctly say the next partial content. Specifically, when the user can correctly say the next partial content, the voice response system 1 increases the speed of singing the next partial content. Alternatively, when the user cannot correctly say the next partial content, the voice response system 1 reduces the speed of singing the next partial content.
  • FIG. 22 is a diagram illustrating an operation example 8 of the voice response system 1.
  • the operation example 8 is an operation example of measures for dementia of elderly people. The fact that the user is an elderly person is set in advance by user registration or the like.
  • the voice response system 1 starts singing an existing song in accordance with, for example, a user instruction.
  • the voice response system 1 pauses singing at a random position or a predetermined position (for example, before rust).
  • a message such as “I don't know” or “I forgot” is issued, and it behaves as if I forgot the lyrics.
  • the voice response system 1 waits for a user's response in this state.
  • the voice response system 1 When the user utters some voice, the voice response system 1 outputs a singing from the continuation of the word, with the part of the word uttered by the user as the correct lyrics. When the user utters something, the voice response system 1 may output a response such as “thank you”. When a predetermined time has elapsed while waiting for a response from the user, the voice response system 1 may output a speech such as “remembered” and resume singing from the continuation of the paused portion.
  • FIG. 23 is a diagram illustrating an operation example 9 of the voice response system 1.
  • the user requests singing synthesis with an input voice of "Sing something fun".
  • the voice response system 1 performs singing synthesis according to the input voice.
  • the segment database used at the time of singing synthesis is selected according to the character selected at the time of user registration, for example (for example, when a male character is selected, a segment database by a male singer is used).
  • the user utters an input voice instructing to change the segment database, such as “change to a female voice” during the song.
  • the voice response system 1 switches the segment database used for singing synthesis according to a user's input voice.
  • the switching of the segment database may be performed when the voice response system 1 is outputting a singing voice, or when the voice response system 1 is in a state of waiting for a response from the user as in the operation examples 7 to 8. It may be done.
  • the voice response system 1 may have a plurality of segment databases that record phonemes that are pronounced by a single singer (or speaker) with different singing styles or voice colors.
  • the voice response system 1 may use a plurality of segments extracted from a plurality of segment databases in combination with a certain ratio (usage ratio), that is, add a certain phoneme.
  • usage ratio a certain ratio
  • the voice response system 1 may determine the usage ratio according to the user's reaction. Specifically, when two segment databases are recorded for a singer with a normal voice and a sweet voice, if the user utters an input voice of “a sweeter voice”, the sweet voice segment database If you increase the usage rate of the voice and utter the input voice "with a much sweeter voice", the usage rate of the sweet voice segment database will be increased.
  • the singing voice refers to a voice that includes at least a part of the singing voice, and may include only an accompaniment that does not include a singing, or a part that includes only a voice.
  • at least one partial content may not include a song.
  • Singing may also include raps or poetry readings.
  • the learning function 51, the song synthesis function 52, and the response function 53 are related to each other has been described, but these functions may be provided independently.
  • the classification table obtained by the learning function 51 may be used to know the user's preference in a music distribution system that distributes music, for example.
  • the song synthesis function 52 may perform song synthesis using a classification table manually input by the user.
  • at least some of the functional elements of the voice response system 1 may be omitted.
  • the voice response system 1 may not have the emotion estimation unit 512.
  • the voice analysis unit 511 and the emotion estimation unit 512 may be implemented in the input / output device.
  • the relative arrangement of the input / output device 10, the response engine 20, and the song synthesis engine 30 is, for example, that the song synthesis engine 30 is arranged between the input / output device 10 and the response engine 20 and is output from the response engine 20.
  • singing synthesis may be performed for responses determined to require singing synthesis.
  • the content used in the voice response system 1 may be stored in a local device such as the input / output device 10 or a device capable of communicating with the input / output device 10.
  • the hardware configuration of the input / output device 10, the response engine 20, and the song synthesis engine 30 may be, for example, a smartphone or a tablet terminal.
  • the user input to the voice response system 1 is not limited to voice input, and may be input via a touch screen, a keyboard, or a pointing device.
  • the input / output device 10 may have a human sensor.
  • the voice response system 1 may control the operation using the human sensor depending on whether or not the user is nearby. For example, when it is determined that the user is not near the input / output device 10, the voice response system 1 may perform an operation of not outputting a voice (not returning a dialogue).
  • the voice response system 1 may output the voice regardless of whether the user is near the input / output device 10. For example, as described in the second half of the operation example 6, the voice response system 1 may output the voice that guides the remaining waiting time regardless of whether the user is near the input / output device 10 or not.
  • a sensor other than a human sensor such as a camera or a temperature sensor may be used, or a plurality of sensors may be used in combination.
  • the programs executed in the input / output device 10, the response engine 20, and the song synthesis engine 30 may be provided in a state stored in a recording medium such as a CD-ROM or a semiconductor memory, or via a network such as the Internet. May be provided by download.
  • a singing voice can be output according to the interaction with the user, which is useful.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Child & Adolescent Psychology (AREA)
  • General Health & Medical Sciences (AREA)
  • Heart & Thoracic Surgery (AREA)
  • Biophysics (AREA)
  • Medical Informatics (AREA)
  • Molecular Biology (AREA)
  • Surgery (AREA)
  • Animal Behavior & Ethology (AREA)
  • Pathology (AREA)
  • Public Health (AREA)
  • Veterinary Medicine (AREA)
  • Biomedical Technology (AREA)
  • Social Psychology (AREA)
  • Psychology (AREA)
  • Educational Technology (AREA)
  • Developmental Disabilities (AREA)
  • Signal Processing (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Measurement Of The Respiration, Hearing Ability, Form, And Blood Characteristics Of Living Organisms (AREA)

Abstract

This method for outputting a singing voice has: a step (S31) for specifying first partial content from among a plurality of partial content obtained by analyzing content; a step (S35) for outputting a singing voice synthesized using a character string included in the first partial content; a step (S36) for receiving a response from a user with respect to the singing voice; and a step (S35) for outputting, in accordance with the response, a singing voice synthesized using a character string included in second partial content continuing from the first partial content.

Description

歌唱音声の出力方法及び音声応答システムSinging voice output method and voice response system
 本発明は、ユーザの入力に対し歌唱を含む音声を用いて応答する技術に関する。 The present invention relates to a technique for responding to a user input using a voice including a song.
 ユーザの指示に応じて楽曲を出力する技術がある。特許文献1は、ユーザの状況や嗜好に応じて楽曲の雰囲気を変える技術である。特許文献2は、運動体の状態に応じた楽音を出力する装置において、飽きの来ない独特な選曲をする技術である。 There is a technology to output music according to user instructions. Patent Document 1 is a technique for changing the atmosphere of music according to the user's situation and preferences. Patent Document 2 is a technique for making a unique music selection that does not get bored in a device that outputs musical sounds according to the state of a moving body.
日本国特開2006-85045号公報Japanese Unexamined Patent Publication No. 2006-85045 日本国特許第4496993号公報Japanese Patent No. 4496993
 特許文献1及び2はいずれも、ユーザとのインタラクションに応じて歌唱音声を出力するものではなかった。
 これに対し本発明は、ユーザとのインタラクションに応じて歌唱音声を出力する技術を提供する。
Neither Patent Literature 1 nor 2 outputs a singing voice according to the interaction with the user.
On the other hand, this invention provides the technique which outputs a singing voice according to the interaction with a user.
 本発明は、コンテンツを複数の部分コンテンツに分解するステップと、前記複数の部分コンテンツから第1の部分コンテンツを特定するステップと、前記第1の部分コンテンツに含まれる文字列を用いて第1の歌唱音声を合成するステップと、前記第1の歌唱音声を出力するステップと、前記第1の歌唱音声に対して、ユーザの反応を受け付けるステップと、前記ユーザの反応に対して、前記第1の部分コンテンツに関連する第2の部分コンテンツを特定するステップと、前記第2の部分コンテンツに含まれる文字列を用いて第2の歌唱音声を合成するステップと、前記第2の歌唱音声を出力するステップと、を有する歌唱音声の出力方法を提供する。前記コンテンツは例えば文字列を含む。 The present invention includes a step of decomposing content into a plurality of partial contents, a step of identifying first partial contents from the plurality of partial contents, and a character string included in the first partial contents. A step of synthesizing a singing voice; a step of outputting the first singing voice; a step of accepting a user's reaction to the first singing voice; Identifying a second partial content related to the partial content, synthesizing a second singing voice using a character string included in the second partial content, and outputting the second singing voice And providing a method for outputting a singing voice. The content includes, for example, a character string.
 この歌唱音声の出力方法は、前記ユーザの反応に対して、前記第2の部分コンテンツに含まれる文字列を用いた歌唱合成に用いられる要素を決定するステップを有してもよい。 The singing voice output method may include a step of determining an element used for singing synthesis using a character string included in the second partial content in response to the user's reaction.
 前記要素は、前記歌唱合成のパラメータ、メロディ、若しくはテンポ、又は前記歌唱音声における伴奏のアレンジを含んでもよい。 The element may include a parameter of the song synthesis, a melody, or a tempo, or an arrangement of accompaniment in the song voice.
 前記第1の歌唱音声及び前記第2の歌唱音声の合成は、複数のデータベースの中から選択された少なくとも1つのデータベースに記録された素片を用いて行われ、この歌唱音声の出力方法は、前記ユーザの反応に対して、前記第2の部分コンテンツに含まれる文字列を用いた歌唱合成の際に用いられるデータベースを選択するステップを有してもよい。 The synthesis of the first singing voice and the second singing voice is performed using segments recorded in at least one database selected from a plurality of databases. You may have the step which selects the database used in the case of singing composition using the character string contained in the said 2nd partial content with respect to the said user's reaction.
 前記第1の歌唱音声及び前記第2の歌唱音声の合成は、複数のデータベースの中から選択された複数のデータベースに記録された素片を用いて行われ、前記データベースを選択するステップにおいて、複数のデータベースが選択され、この歌唱音声の出力方法は、前記複数のデータベースの利用比率を、前記ユーザの反応に応じて決定するステップを有してもよい。 The synthesis of the first singing voice and the second singing voice is performed using segments recorded in a plurality of databases selected from a plurality of databases. The database may be selected, and the singing voice output method may include a step of determining a usage ratio of the plurality of databases according to the reaction of the user.
 この歌唱音声の出力方法は、前記第1の部分コンテンツに含まれる文字列の一部を他の文字列に置換するステップを有し、前記第1の歌唱音声を合成するステップにおいて、一部が前記他の文字列に置換された前記第1の部分コンテンツに含まれる文字列を用いて前記第1の歌唱音声を合成してもよい。 This singing voice output method has a step of replacing a part of a character string included in the first partial content with another character string, and in the step of synthesizing the first singing voice, The first singing voice may be synthesized using a character string included in the first partial content replaced with the other character string.
 前記他の文字列と前記置換の対象となる文字列とは、音節数又はモーラ数が同じであってもよい。 The other character string and the character string to be replaced may have the same syllable number or mora number.
 この歌唱音声の出力方法は、前記ユーザの反応に対して、前記第2の部分コンテンツの一部を他の文字列に置換するステップを有し、前記第2の歌唱音声を合成するステップにおいて、一部が前記他の文字列に置換された前記第2の部分コンテンツに含まれる文字列を用いて前記第2の歌唱音声を合成してもよい。 This singing voice output method has a step of replacing a part of the second partial content with another character string in response to the user's reaction, and in the step of synthesizing the second singing voice, The second singing voice may be synthesized using a character string included in the second partial content partly replaced with the other character string.
 この歌唱音声の出力方法は、前記第1の部分コンテンツに含まれる文字列が示す事項に応じた時間長となるよう第3の歌唱音声を合成するステップと、前記第1の歌唱音声と前記第2の歌唱音声との間に前記第3の歌唱音声を出力するステップを有してもよい。 The method for outputting the singing voice includes a step of synthesizing a third singing voice so as to have a time length corresponding to a matter indicated by a character string included in the first partial content, the first singing voice, and the first singing voice. You may have the step which outputs the said 3rd song voice between 2 song voices.
 この歌唱音声の出力方法は、前記第1の部分コンテンツに含まれる第1文字列が示す事項に応じた第2文字列を用いて第4の歌唱音声を合成するステップと、前記第1の歌唱音声の出力後、前記第1文字列が示す事項に応じた時間長に応じたタイミングで前記第4の歌唱音声を出力するステップと、を有してもよい。 The method for outputting the singing voice includes a step of synthesizing a fourth singing voice using a second character string corresponding to a matter indicated by the first character string included in the first partial content, and the first singing voice. A step of outputting the fourth singing voice at a timing corresponding to a time length corresponding to a matter indicated by the first character string after outputting the voice.
 また、本発明は、コンテンツを複数の部分コンテンツに分解する分解部と、前記複数の部分コンテンツから第1の部分コンテンツを特定する特定部と、前記第1の部分コンテンツに含まれる文字列を用いて第1の歌唱音声を合成する合成部と、前記第1の歌唱音声を出力する出力部と、前記第1の歌唱音声に対して、ユーザの反応を受け付ける受け付け部と、を有し、前記特定部は、前記ユーザの反応に対して、前記第1の部分コンテンツに関連する第2の部分コンテンツを特定し、前記合成部は、前記第2の部分コンテンツに含まれる文字列を用いて第2の歌唱音声を合成し、前記出力部は、前記第2の歌唱音声を出力する情報処理システムを提供する。 In addition, the present invention uses a decomposition unit that decomposes content into a plurality of partial contents, a specifying unit that specifies first partial content from the plurality of partial contents, and a character string included in the first partial content. And a synthesis unit that synthesizes the first singing voice, an output unit that outputs the first singing voice, and a reception unit that receives a user's reaction to the first singing voice, The specifying unit specifies a second partial content related to the first partial content in response to the user's reaction, and the combining unit uses a character string included in the second partial content. The singing voice of 2 is synthesized, and the output unit provides an information processing system that outputs the second singing voice.
 本発明によれば、ユーザとのインタラクションに応じて歌唱音声を出力することができる。 According to the present invention, singing voice can be output in accordance with the interaction with the user.
図1は、一実施形態に係る音声応答システム1の概要を示す図。FIG. 1 is a diagram illustrating an outline of a voice response system 1 according to an embodiment. 図2は、音声応答システム1の機能の概要を例示する図。FIG. 2 is a diagram illustrating an outline of functions of the voice response system 1. 図3は、入出力装置10のハードウェア構成を例示する図。FIG. 3 is a diagram illustrating a hardware configuration of the input / output device 10. 図4は、応答エンジン20及び歌唱合成エンジン30のハードウェア構成を例示する図。FIG. 4 is a diagram illustrating a hardware configuration of the response engine 20 and the song synthesis engine 30. 図5は、学習機能51に係る機能構成を例示する図。FIG. 5 is a diagram illustrating a functional configuration related to the learning function 51. 図6は、学習機能51に係る動作の概要を示すフローチャート。FIG. 6 is a flowchart showing an outline of an operation related to the learning function 51. 図7は、学習機能51に係る動作を例示するシーケンスチャート。FIG. 7 is a sequence chart illustrating an operation related to the learning function 51. 図8は、分類テーブル5161を例示する図。FIG. 8 is a diagram illustrating a classification table 5161. 図9は、歌唱合成機能52に係る機能構成を例示する図。FIG. 9 is a diagram illustrating a functional configuration related to the song synthesis function 52. 図10は、歌唱合成機能52に係る動作の概要を示すフローチャート。FIG. 10 is a flowchart showing an outline of the operation related to the song synthesis function 52. 図11は、歌唱合成機能52に係る動作を例示するシーケンスチャート。FIG. 11 is a sequence chart illustrating an operation related to the song synthesis function 52. 図12は、応答機能53に係る機能構成を例示する図。FIG. 12 is a diagram illustrating a functional configuration related to the response function 53. 図13は、応答機能53に係る動作を例示するフローチャート。FIG. 13 is a flowchart illustrating an operation related to the response function 53. 図14は、音声応答システム1の動作例1を示す図。FIG. 14 is a diagram showing an operation example 1 of the voice response system 1. 図15は、音声応答システム1の動作例2を示す図。FIG. 15 is a diagram illustrating an operation example 2 of the voice response system 1. 図16は、音声応答システム1の動作例3を示す図。FIG. 16 is a diagram showing an operation example 3 of the voice response system 1. 図17は、音声応答システム1の動作例4を示す図。FIG. 17 is a diagram showing an operation example 4 of the voice response system 1. 図18は、音声応答システム1の動作例5を示す図。FIG. 18 is a diagram illustrating an operation example 5 of the voice response system 1. 図19は、音声応答システム1の動作例6を示す図。FIG. 19 is a diagram illustrating an operation example 6 of the voice response system 1. 図20は、音声応答システム1の動作例7を示す図。FIG. 20 is a diagram illustrating an operation example 7 of the voice response system 1. 図21は、音声応答システム1の動作例8を示す図。FIG. 21 is a diagram showing an operation example 8 of the voice response system 1. 図22は、音声応答システム1の動作例9を示す図。FIG. 22 is a diagram showing an operation example 9 of the voice response system 1.
1.システム概要
 図1は、一実施形態に係る音声応答システム1の概要を示す図である。音声応答システム1は、ユーザが声によって入力(又は指示)を行うと、それに対し自動的に音声による応答を出力するシステムであり、いわゆるAI(Artificial Intelligence)音声アシスタントである。以下、ユーザから音声応答システム1に入力される音声を「入力音声」といい、入力音声に対し音声応答システム1から出力される音声を「応答音声」という。音声応答は歌唱を含む。音声応答システム1は、歌唱合成システムの一例である。例えば、音声応答システム1に対しユーザが「何か歌って」と話しかけると、音声応答システム1は自動的に歌唱を合成し、合成された歌唱を出力する。
1. System Overview FIG. 1 is a diagram illustrating an overview of a voice response system 1 according to an embodiment. The voice response system 1 is a so-called AI (Artificial Intelligence) voice assistant that automatically outputs a voice response in response to an input (or instruction) by a user. Hereinafter, the voice input from the user to the voice response system 1 is referred to as “input voice”, and the voice output from the voice response system 1 in response to the input voice is referred to as “response voice”. Voice response includes singing. The voice response system 1 is an example of a song synthesis system. For example, when the user speaks “Sing something” to the voice response system 1, the voice response system 1 automatically synthesizes the song and outputs the synthesized song.
 音声応答システム1は、入出力装置10、応答エンジン20、及び歌唱合成エンジン30を含む。入出力装置10は、マンマシンインターフェースを提供する装置であり、ユーザからの入力音声を受け付け、その入力音声に対する応答音声を出力する装置である。応答エンジン20は、入出力装置10により受け付けられた入力音声を分析し、応答音声を生成する。この応答音声の少なくとも一部は歌唱音声を含む。歌唱合成エンジン30は、応答音声に用いられる歌唱音声を合成する。 The voice response system 1 includes an input / output device 10, a response engine 20, and a song synthesis engine 30. The input / output device 10 is a device that provides a man-machine interface, and is a device that receives an input voice from a user and outputs a response voice in response to the input voice. The response engine 20 analyzes the input voice received by the input / output device 10 and generates a response voice. At least a part of the response voice includes singing voice. The singing voice synthesis engine 30 synthesizes the singing voice used for the response voice.
 図2は、音声応答システム1の機能の概要を例示する図である。音声応答システム1は、学習機能51、歌唱合成機能52、及び応答機能53を有する。応答機能53は、ユーザの入力音声を分析し、分析結果に基づいて応答音声を提供する機能であり、入出力装置10及び応答エンジン20により提供される。学習機能51は、ユーザの入力音声からユーザの嗜好を学習する機能であり、歌唱合成エンジン30により提供される。歌唱合成機能52は、応答音声に用いられる歌唱音声を合成する機能であり、歌唱合成エンジン30により提供される。学習機能51は、応答機能53により得られた分析結果を用いてユーザの嗜好を学習する。歌唱合成機能52は、学習機能51によって行われた学習に基づいて歌唱音声を合成する。応答機能53は、歌唱合成機能52により合成された歌唱音声を用いた応答をする。 FIG. 2 is a diagram illustrating an outline of functions of the voice response system 1. The voice response system 1 has a learning function 51, a song synthesis function 52, and a response function 53. The response function 53 is a function of analyzing a user input voice and providing a response voice based on the analysis result, and is provided by the input / output device 10 and the response engine 20. The learning function 51 is a function for learning the user's preference from the user's input voice, and is provided by the singing synthesis engine 30. The singing voice synthesizing function 52 is a function for synthesizing the singing voice used for the response voice, and is provided by the singing voice synthesis engine 30. The learning function 51 learns the user's preference using the analysis result obtained by the response function 53. The singing voice synthesis function 52 synthesizes a singing voice based on learning performed by the learning function 51. The response function 53 makes a response using the singing voice synthesized by the singing voice synthesis function 52.
 図3は、入出力装置10のハードウェア構成を例示する図である。入出力装置10は、マイクロフォン101、入力信号処理部102、出力信号処理部103、スピーカ104、CPU(Central Processing Unit)105、センサー106、モータ107、及びネットワークIF108を有する。マイクロフォン101はユーザの音声を電気信号(入力音信号)に変換する。入力信号処理部102は、入力音信号に対しアナログ/デジタル変換等の処理を行い、入力音声を示すデータ(以下「入力音声データ」という)を出力する。出力信号処理部103は、応答音声を示すデータ(以下「応答音声データ」という)に対しデジタル/アナログ変換等の処理を行い、出力音信号を出力する。スピーカ104は、出力音信号を音に変換する(出力音信号に基づいて音を出力する)。CPU105は、入出力装置10の他の要素を制御し、メモリー(図示略)からプログラムを読み出して実行する。センサー106は、ユーザの位置(入出力装置10から見たユーザの方向)を検知し、例えば赤外線センサー又は超音波センサーである。モータ107は、ユーザのいる方向に向くように、マイクロフォン101及びスピーカ104の少なくとも一方の向きを変化させる。マイクロフォン101がマイクロフォンアレイで構成され、CPU105が、マイクロフォンアレイにより収音された音に基づいてユーザのいる方向を検知してもよい。ネットワークIF108は、ネットワーク(例えばインターネット)を介した通信を行うためのインターフェースであり、例えば、所定の無線通信規格(例えばWiFi(登録商標))に従った通信を行うためのアンテナ及びチップセットを含む。 FIG. 3 is a diagram illustrating a hardware configuration of the input / output device 10. The input / output device 10 includes a microphone 101, an input signal processing unit 102, an output signal processing unit 103, a speaker 104, a CPU (Central Processing Unit) 105, a sensor 106, a motor 107, and a network IF 108. The microphone 101 converts the user's voice into an electric signal (input sound signal). The input signal processing unit 102 performs processing such as analog / digital conversion on the input sound signal, and outputs data indicating the input sound (hereinafter referred to as “input sound data”). The output signal processing unit 103 performs processing such as digital / analog conversion on data indicating response sound (hereinafter referred to as “response sound data”), and outputs an output sound signal. The speaker 104 converts the output sound signal into sound (outputs sound based on the output sound signal). The CPU 105 controls other elements of the input / output device 10 and reads and executes a program from a memory (not shown). The sensor 106 detects the position of the user (the direction of the user viewed from the input / output device 10), and is an infrared sensor or an ultrasonic sensor, for example. The motor 107 changes the direction of at least one of the microphone 101 and the speaker 104 so as to face the direction in which the user is present. The microphone 101 may be configured by a microphone array, and the CPU 105 may detect the direction in which the user is present based on the sound collected by the microphone array. The network IF 108 is an interface for performing communication via a network (for example, the Internet), and includes, for example, an antenna and a chip set for performing communication in accordance with a predetermined wireless communication standard (for example, WiFi (registered trademark)). .
 図4は、応答エンジン20及び歌唱合成エンジン30のハードウェア構成を例示する図である。応答エンジン20は、CPU201、メモリー202、ストレージ203、及び通信IF204を有する。CPU201は、プログラムに従って各種の演算を行い、コンピュータ装置の他の要素を制御する。メモリー202は、CPU201がプログラムを実行する際のワークエリアとして機能する主記憶装置であり、例えばRAM(Random Access Memory)を含む。ストレージ203は、各種のプログラム及びデータを記憶する不揮発性の補助記憶装置であり、例えばHDD(Hard Disk Drive)又はSSD(Solid State Drive)を含む。通信IF204は、所定の通信規格(例えばEthernet)に従った通信を行うためのコネクタ及びチップセットを含む。ストレージ203は、コンピュータ装置を音声応答システム1における応答エンジン20として機能させるためのプログラム(以下「応答プログラム」という)を記憶している。CPU201が応答プログラムを実行することにより、コンピュータ装置は応答エンジン20として機能する。応答エンジン20は、例えばいわゆるAIである。 FIG. 4 is a diagram illustrating a hardware configuration of the response engine 20 and the song synthesis engine 30. The response engine 20 includes a CPU 201, a memory 202, a storage 203, and a communication IF 204. The CPU 201 performs various calculations according to the program and controls other elements of the computer apparatus. The memory 202 is a main storage device that functions as a work area when the CPU 201 executes a program, and includes, for example, a RAM (Random Access Memory). The storage 203 is a nonvolatile auxiliary storage device that stores various programs and data, and includes, for example, an HDD (Hard Disk Drive) or an SSD (Solid State Drive). The communication IF 204 includes a connector and a chip set for performing communication according to a predetermined communication standard (for example, Ethernet). The storage 203 stores a program for causing the computer device to function as the response engine 20 in the voice response system 1 (hereinafter referred to as “response program”). The computer device functions as the response engine 20 by the CPU 201 executing the response program. The response engine 20 is, for example, a so-called AI.
 歌唱合成エンジン30は、CPU301、メモリー302、ストレージ303、及び通信IF304を有する。各要素の詳細は応答エンジン20と同様である。ストレージ303は、コンピュータ装置を音声応答システム1における歌唱合成エンジン30として機能させるためのプログラム(以下「歌唱合成プログラム」という)を記憶している。CPU301が歌唱合成プログラムを実行することにより、コンピュータ装置は歌唱合成エンジン30として機能する。 The song synthesis engine 30 includes a CPU 301, a memory 302, a storage 303, and a communication IF 304. Details of each element are the same as those of the response engine 20. The storage 303 stores a program for causing the computer device to function as the song synthesis engine 30 in the voice response system 1 (hereinafter referred to as “song synthesis program”). When the CPU 301 executes the song synthesis program, the computer device functions as the song synthesis engine 30.
 応答エンジン20及び歌唱合成エンジン30は、インターネット上において、クラウドサービスとして提供される。なお、応答エンジン20及び歌唱合成エンジン30は、クラウドコンピューティングによらないサービスであってもよい。 The response engine 20 and the song synthesis engine 30 are provided as cloud services on the Internet. Note that the response engine 20 and the song synthesis engine 30 may be services that do not depend on cloud computing.
2.学習機能
2-1.構成
 図5は、学習機能51に係る機能構成を例示する図である。学習機能51に係る機能要素として、音声応答システム1は、音声分析部511、感情推定部512、楽曲解析部513、歌詞抽出部514、嗜好分析部515、記憶部516、及び処理部510を有する。また、入出力装置10は、ユーザの入力音声を受け付ける受け付け部、及び応答音声を出力する出力部として機能する。
2. Learning function 2-1. Configuration FIG. 5 is a diagram illustrating a functional configuration related to the learning function 51. As functional elements related to the learning function 51, the voice response system 1 includes a voice analysis unit 511, an emotion estimation unit 512, a music analysis unit 513, a lyrics extraction unit 514, a preference analysis unit 515, a storage unit 516, and a processing unit 510. . The input / output device 10 functions as a receiving unit that receives user input voice and an output unit that outputs response voice.
 音声分析部511は、入力音声を分析する。この分析は、応答音声を生成するために用いられる情報を入力音声から取得する処理であり、具体的には、入力音声をテキスト化(すなわち文字列に変換)する処理、得られたテキストからユーザの要求を判断する処理、ユーザの要求に対してコンテンツを提供するコンテンツ提供部60を特定する処理、特定されたコンテンツ提供部60に対し指示を行う処理、コンテンツ提供部60からデータを取得する処理、取得したデータを用いて応答を生成する処理を含む。この例において、コンテンツ提供部60は、音声応答システム1の外部システムである。コンテンツ提供部60は、楽曲等のコンテンツを音として再生するためのデータ(以下「楽曲データ」という)を出力するサービス(例えば、楽曲のストリーミングサービス又はネットラジオ)を提供し、例えば、音声応答システム1の外部サーバである。 The voice analysis unit 511 analyzes the input voice. This analysis is a process of acquiring information used for generating a response voice from the input voice. Specifically, the input voice is converted into text (that is, converted into a character string). Processing for determining a request for the content, processing for specifying the content providing unit 60 that provides the content in response to a user request, processing for instructing the specified content providing unit 60, and processing for acquiring data from the content providing unit And a process of generating a response using the acquired data. In this example, the content providing unit 60 is an external system of the voice response system 1. The content providing unit 60 provides a service (for example, a music streaming service or a net radio) that outputs data for reproducing content such as music as sound (hereinafter referred to as “music data”). 1 external server.
 楽曲解析部513は、コンテンツ提供部60から出力される楽曲データを解析する。楽曲データの解析とは、楽曲の特徴を抽出する処理をいう。楽曲の特徴は、曲調、リズム、コード進行、テンポ、及びアレンジの少なくとも1つを含む。特徴の抽出には公知の技術が用いられる。 The music analysis unit 513 analyzes the music data output from the content providing unit 60. The analysis of music data refers to a process of extracting music characteristics. The music features include at least one of tune, rhythm, chord progression, tempo, and arrangement. A known technique is used for feature extraction.
 歌詞抽出部514は、コンテンツ提供部60から出力される楽曲データから歌詞を抽出する。一例において、楽曲データは、音データに加えメタデータを含む。音データは、楽曲の信号波形を示すデータであり、例えば、PCM(Pulse Code Modulation)データ等の非圧縮データ、又はMP3データ等の圧縮データを含む。メタデータはその楽曲に関連する情報を含むデータであり、例えば、楽曲タイトル、実演者名、作曲者名、作詞者名、アルバムタイトル、及びジャンル等の楽曲の属性、並びに歌詞等の情報を含む。歌詞抽出部514は、楽曲データに含まれるメタデータから、歌詞を抽出する。楽曲データがメタデータを含まない場合、歌詞抽出部514は、音データに対し音声認識処理を行い、音声認識により得られたテキストから歌詞を抽出する。 The lyrics extracting unit 514 extracts lyrics from the music data output from the content providing unit 60. In one example, the music data includes metadata in addition to sound data. The sound data is data indicating a signal waveform of music, and includes, for example, uncompressed data such as PCM (Pulse Code Modulation) data or compressed data such as MP3 data. The metadata is data including information related to the music, and includes, for example, music title, performer name, composer name, songwriter name, album title, genre and other music attributes, and lyrics information. . The lyrics extraction unit 514 extracts lyrics from metadata included in the music data. When the music data does not include metadata, the lyrics extraction unit 514 performs speech recognition processing on the sound data, and extracts lyrics from text obtained by the speech recognition.
 感情推定部512は、ユーザの感情を推定する。感情推定部512は、入力音声からユーザの感情を推定する。感情の推定には公知の技術が用いられる。感情推定部512は、音声応答システム1が出力する音声における(平均)音高と、それに対するユーザの応答の音高との関係に基づいてユーザの感情を推定してもよい。感情推定部512は、音声分析部511によりテキスト化された入力音声、又は分析されたユーザの要求に基づいてユーザの感情を推定してもよい。 The emotion estimation unit 512 estimates the user's emotion. The emotion estimation unit 512 estimates the user's emotion from the input voice. A known technique is used for emotion estimation. The emotion estimation unit 512 may estimate the user's emotion based on the relationship between the (average) pitch in the voice output by the voice response system 1 and the pitch of the user's response to the pitch. The emotion estimation unit 512 may estimate the user's emotion based on the input voice converted into text by the voice analysis unit 511 or the analyzed user request.
 嗜好分析部515は、ユーザが再生を指示した楽曲の再生履歴、解析結果、及び歌詞、並びにその楽曲の再生を指示したときのユーザの感情のうち少なくとも1つを用いて、ユーザの嗜好を示す情報(以下「嗜好情報」という)を生成する。嗜好分析部515は、生成された嗜好情報を用いて、記憶部516に記憶されている分類テーブル5161を更新する。分類テーブル5161は、ユーザの嗜好を記録したテーブル(又はデータベース)であり、例えば、ユーザ毎かつ感情毎に、楽曲の特徴(例えば、音色、曲調、リズム、コード進行、及びテンポ)、楽曲の属性(実演者名、作曲者名、作詞者名、及びジャンル)、及び歌詞を記録したものである。記憶部516は、歌唱合成に用いるパラメータをユーザと対応付けて記録したテーブルから、トリガを入力したユーザに応じたパラメータを読み出す読み出し部の一例である。歌唱合成に用いるパラメータとは、歌唱合成の際に参照されるデータであり、分類テーブル5161は、音色、曲調、リズム、コード進行、テンポ、実演者名、作曲者名、作詞者名、ジャンル、及び歌詞を含む概念である。 The preference analysis unit 515 indicates the user's preference using at least one of the reproduction history of the music that the user has instructed to reproduce, the analysis result, the lyrics, and the user's emotion when the reproduction of the music is instructed. Information (hereinafter referred to as “preference information”) is generated. The preference analysis unit 515 updates the classification table 5161 stored in the storage unit 516 using the generated preference information. The classification table 5161 is a table (or database) in which user preferences are recorded. For example, for each user and for each emotion, the characteristics of the music (for example, timbre, tune, rhythm, chord progression, and tempo), and the attributes of the music (Performer name, composer name, songwriter name, and genre) and lyrics are recorded. The memory | storage part 516 is an example of the read-out part which reads the parameter according to the user who input the trigger from the table which matched and recorded the parameter used for song synthesis | combination with the user. The parameters used for singing synthesis are data to be referred to at the time of singing synthesis, and the classification table 5161 includes timbre, tone, rhythm, chord progression, tempo, performer name, composer name, songwriter name, genre, And a concept including lyrics.
2-2.動作
 図6は、学習機能51に係る音声応答システム1の動作の概要を示すフローチャートである。ステップS11において、音声応答システム1は、入力音声を分析する。ステップS12において、音声応答システム1は、入力音声により指示された処理を行う。ステップS13において、音声応答システム1は、入力音声が学習の対象となる事項を含むか判断する。入力音声が学習の対象となる事項を含むと判断された場合(S13:YES)、音声応答システム1は、処理をステップS14に移行する。入力音声が学習の対象となる事項を含まないと判断された場合(S13:NO)、音声応答システム1は、処理をステップS18に移行する。ステップS14において、音声応答システム1は、ユーザの感情を推定する。ステップS15において、音声応答システム1は、再生が指示された楽曲を解析する。ステップS16において、音声応答システム1は、再生が指示された楽曲の歌詞を取得する。ステップS17において、音声応答システム1は、ステップS14~S16において得られた情報を用いて、分類テーブルを更新する。
2-2. Operation FIG. 6 is a flowchart showing an outline of the operation of the voice response system 1 according to the learning function 51. In step S11, the voice response system 1 analyzes the input voice. In step S12, the voice response system 1 performs processing instructed by the input voice. In step S <b> 13, the voice response system 1 determines whether the input voice includes an item to be learned. When it is determined that the input voice includes an item to be learned (S13: YES), the voice response system 1 moves the process to step S14. When it is determined that the input voice does not include items to be learned (S13: NO), the voice response system 1 moves the process to step S18. In step S14, the voice response system 1 estimates the user's emotion. In step S15, the voice response system 1 analyzes the music for which playback has been instructed. In step S <b> 16, the voice response system 1 acquires the lyrics of the music that is instructed to be played. In step S17, the voice response system 1 updates the classification table using the information obtained in steps S14 to S16.
 ステップS18以降の処理は学習機能51すなわち分類テーブルの更新と直接は関係ないが、分類テーブルを用いる処理を含む。ステップS18において、音声応答システム1は、入力音声に対する応答音声を生成する。このとき、必要に応じて分類テーブルが参照される。ステップS19において、音声応答システム1は、応答音声を出力する。 The processing after step S18 is not directly related to the learning function 51, that is, the update of the classification table, but includes the processing using the classification table. In step S18, the voice response system 1 generates a response voice for the input voice. At this time, the classification table is referred to as necessary. In step S19, the voice response system 1 outputs a response voice.
 図7は、学習機能51に係る音声応答システム1の動作を例示するシーケンスチャートである。ユーザは、例えば音声応答システム1の加入時又は初回起動時に、音声応答システム1に対しユーザ登録を行う。ユーザ登録は、ユーザ名(又はログインID)及びパスワードの設定を含む。図7のシーケンスの開始時点において入出力装置10は起動しており、ユーザのログイン処理が完了している。すなわち、音声応答システム1において、入出力装置10を使用しているユーザが特定されている。また、入出力装置10は、ユーザの音声入力(発声)を待ち受けている状態である。なお、音声応答システム1がユーザを特定する方法はログイン処理に限定されない。例えば、音声応答システム1は、入力音声に基づいてユーザを特定してもよい。 FIG. 7 is a sequence chart illustrating the operation of the voice response system 1 according to the learning function 51. The user performs user registration with the voice response system 1 when, for example, the voice response system 1 is subscribed or activated for the first time. User registration includes setting of a user name (or login ID) and a password. The input / output device 10 is activated at the start of the sequence in FIG. 7, and the login process of the user is completed. That is, in the voice response system 1, a user who uses the input / output device 10 is specified. The input / output device 10 is in a state of waiting for a user's voice input (speech). Note that the method by which the voice response system 1 identifies the user is not limited to the login process. For example, the voice response system 1 may specify the user based on the input voice.
 ステップS101において、入出力装置10は、入力音声を受け付ける。入出力装置10は、入力音声をデータ化し、音声データを生成する。音声データは、入力音声の信号波形を示す音データ及びヘッダを含む。ヘッダには、入力音声の属性を示す情報が含まれる。入力音声の属性は、例えば、入出力装置10を特定するための識別子、その音声を発したユーザのユーザ識別子(例えば、ユーザ名又はログインID)、及びその音声を発した時刻を示すタイムスタンプを含む。ステップS102において、入出力装置10は、入力音声を示す音声データを音声分析部511に出力する。 In step S101, the input / output device 10 receives an input voice. The input / output device 10 converts the input voice into data and generates voice data. The sound data includes sound data indicating a signal waveform of the input sound and a header. The header includes information indicating the attribute of the input voice. The attributes of the input voice include, for example, an identifier for specifying the input / output device 10, a user identifier (for example, a user name or a login ID) of a user who has issued the voice, and a time stamp indicating the time at which the voice was emitted. Including. In step S <b> 102, the input / output device 10 outputs voice data indicating the input voice to the voice analysis unit 511.
 ステップS103において、音声分析部511は、音声データを用いて入力音声を分析する。この分析において、音声分析部511は、入力音声が学習の対象となる事項を含むか判断する。学習の対象となる事項とは、楽曲を特定する事項をいい、具体的には楽曲の再生指示である。 In step S103, the voice analysis unit 511 analyzes the input voice using the voice data. In this analysis, the voice analysis unit 511 determines whether the input voice includes items to be learned. The item to be learned is a matter for specifying a song, specifically, a music playback instruction.
 ステップS104において、処理部510は、入力音声により指示された処理を行う。処理部510が行う処理は、例えば楽曲のストリーミング再生である。この場合、コンテンツ提供部60は複数の楽曲データが記録された楽曲データベースを有する。処理部510は、指示された楽曲の楽曲データを楽曲データベースから読み出す。処理部510は、読み出した楽曲データを、入力音声の送信元の入出力装置10に送信する。別の例において、処理部510が行う処理は、ネットラジオの再生である。この場合、コンテンツ提供部60は、ラジオ音声のストリーミング放送を行う。処理部510は、コンテンツ提供部60から受信したストリーミングデータを、入力音声の送信元の入出力装置10に送信する。 In step S104, the processing unit 510 performs processing instructed by the input voice. The processing performed by the processing unit 510 is, for example, streaming playback of music. In this case, the content providing unit 60 has a music database in which a plurality of music data is recorded. The processing unit 510 reads the music data of the instructed music from the music database. The processing unit 510 transmits the read music data to the input / output device 10 that is the transmission source of the input sound. In another example, the processing performed by the processing unit 510 is playback of a net radio. In this case, the content providing unit 60 performs streaming broadcasting of radio sound. The processing unit 510 transmits the streaming data received from the content providing unit 60 to the input / output device 10 that is the transmission source of the input audio.
 ステップS103において入力音声が学習の対象となる事項を含むと判断された場合、処理部510はさらに、分類テーブルを更新するための処理を行う(ステップS105)。分類テーブルを更新するための処理には、感情推定部512に対する感情推定の要求(ステップS1051)、楽曲解析部513に対する楽曲解析の要求(ステップS1052)、及び歌詞抽出部514に対する歌詞抽出の要求(ステップS1053)を含む。 If it is determined in step S103 that the input speech includes an item to be learned, the processing unit 510 further performs processing for updating the classification table (step S105). The processing for updating the classification table includes a request for emotion estimation to the emotion estimation unit 512 (step S1051), a request for music analysis to the music analysis unit 513 (step S1052), and a request for lyrics extraction to the lyrics extraction unit 514 ( Step S1053) is included.
 感情推定が要求されると、感情推定部512は、ユーザの感情を推定し(ステップS106)、推定した感情を示す情報(以下「感情情報」という)を、要求元である処理部510に出力する(ステップS107)。感情推定部512は、入力音声を用いてユーザの感情を推定する。感情推定部512は、例えば、テキスト化された入力音声に基づいて感情を推定する。一例において、感情を示すキーワードがあらかじめ定義されており、テキスト化された入力音声がこのキーワードを含んでいた場合、感情推定部512は、ユーザがその感情であると判断する(例えば、「クソッ」というキーワードが含まれていた場合、ユーザの感情が「怒り」であると判断する)。別の例において、感情推定部512は、入力音声の音高、音量、速度又はこれらの時間変化に基づいて感情を推定する。一例において、入力音声の平均音高がしきい値よりも低い場合、感情推定部512はユーザの感情が「悲しい」であると判断する。別の例において、感情推定部512は、音声応答システム1が出力する音声における(平均)音高と、それに対するユーザの応答の音高との関係に基づいてユーザの感情を推定してもよい。具体的には、音声応答システム1が出力する音声の音高が高いにもかかわらず、ユーザが応答した音声の音高が低い場合、感情推定部512はユーザの感情が「悲しい」であると判断する。さらに別の例において、感情推定部512は、音声における語尾の音高と、それに対するユーザの応答の音高との関係に基づいてユーザの感情を推定してもよい。あるいは、感情推定部512は、これら複数の要素を複合的に考慮してユーザの感情を推定してもよい。 When emotion estimation is requested, the emotion estimation unit 512 estimates the user's emotion (step S106), and outputs information indicating the estimated emotion (hereinafter referred to as “emotion information”) to the processing unit 510 that is the request source. (Step S107). The emotion estimation unit 512 estimates the user's emotion using the input voice. The emotion estimation unit 512 estimates an emotion based on, for example, input text that has been converted into text. In one example, if a keyword indicating emotion is defined in advance, and the input voice that has been converted into text includes this keyword, the emotion estimation unit 512 determines that the user is the emotion (for example, “kutsu”). If the keyword is included, it is determined that the user's emotion is “anger”). In another example, the emotion estimation unit 512 estimates an emotion based on the pitch, volume, speed, or temporal change of the input voice. In one example, when the average pitch of the input voice is lower than the threshold value, the emotion estimation unit 512 determines that the user's emotion is “sad”. In another example, the emotion estimation unit 512 may estimate the user's emotion based on the relationship between the (average) pitch in the voice output by the voice response system 1 and the pitch of the user's response to the pitch. . Specifically, when the pitch of the voice that the user responds is low even though the pitch of the voice that the voice response system 1 outputs is high, the emotion estimation unit 512 indicates that the user's emotion is “sad”. to decide. In yet another example, the emotion estimation unit 512 may estimate the user's emotion based on the relationship between the pitch of the ending in the voice and the pitch of the user's response thereto. Alternatively, the emotion estimation unit 512 may estimate the user's emotion in consideration of these multiple factors.
 別の例において、感情推定部512は、音声以外の入力を用いてユーザの感情を推定してもよい。音声以外の入力としては、例えば、カメラにより撮影されたユーザの顔の映像、又は温度センサーにより検知されたユーザの体温、若しくはこれらの組み合わせが用いられる。具体的には、感情推定部512は、ユーザの顔の表情からユーザの感情が「楽しい」、「怒り」、「悲しい」のいずれであるかを判断する。また、感情推定部512は、ユーザの顔の動画において、顔の表情の変化に基づいてユーザの感情を判断してもよい。あるいは、感情推定部512は、ユーザの体温が高いと「怒り」、ユーザの体温が低いと「悲しい」と判断してもよい。 In another example, the emotion estimation unit 512 may estimate the user's emotion using an input other than voice. As the input other than the voice, for example, an image of a user's face taken by a camera, a user's body temperature detected by a temperature sensor, or a combination thereof is used. Specifically, the emotion estimation unit 512 determines whether the user's emotion is “fun”, “anger”, or “sad” from the facial expression of the user. The emotion estimation unit 512 may determine the user's emotion based on the change in facial expression in the user's facial video. Alternatively, the emotion estimation unit 512 may determine “anger” when the user's body temperature is high and “sad” when the user's body temperature is low.
 楽曲解析が要求されると、楽曲解析部513は、ユーザの指示により再生される楽曲を解析し(ステップS108)、解析結果を示す情報(以下「楽曲情報」という)を、要求元である処理部510に出力する(ステップS109)。 When the music analysis is requested, the music analysis unit 513 analyzes the music reproduced in accordance with the user's instruction (step S108), and the information indicating the analysis result (hereinafter referred to as “music information”) is processed as a request source. It outputs to the part 510 (step S109).
 歌詞抽出が要求されると、歌詞抽出部514は、ユーザの指示により再生される楽曲の歌詞を取得し(ステップS110)、取得した歌詞を示す情報(以下「歌詞情報」という)を、要求元である処理部510に出力する(ステップS111)。 When the lyrics extraction is requested, the lyrics extraction unit 514 acquires the lyrics of the music to be played according to the user's instruction (step S110), and obtains information indicating the acquired lyrics (hereinafter referred to as “lyric information”) as the request source. Is output to the processing unit 510 (step S111).
 ステップS112において、処理部510は、感情推定部512、楽曲解析部513、及び歌詞抽出部514からそれぞれ取得した感情情報、楽曲情報、及び歌詞情報の組を、嗜好分析部515に出力する。 In step S112, the processing unit 510 outputs the set of emotion information, music information, and lyrics information acquired from the emotion estimation unit 512, the music analysis unit 513, and the lyrics extraction unit 514 to the preference analysis unit 515.
 ステップS113において、嗜好分析部515は、複数組の情報を分析し、ユーザの嗜好を示す情報を得る。この分析のため、嗜好分析部515は、過去のある期間(例えば、システムの稼働開始から現時点までの期間)に渡って、これらの情報の組を複数、記録する。一例において、嗜好分析部515は、楽曲情報を統計処理し、統計的な代表値(例えば、平均値、最頻値、又は中央値)を計算する。この統計処理により、例えば、テンポの平均値、並びに音色、曲調、リズム、コード進行、作曲者名、作詞者名、及び実演者名の最頻値が得られる。また、嗜好分析部515は、形態素解析等の技術を用いて歌詞情報により示される歌詞を単語レベルに分解したうえで各単語の品詞を特定し、特定の品詞(例えば名詞)の単語についてヒストグラムを作成し、登場頻度が所定の範囲(例えば上位5%)にある単語を特定する。さらに、嗜好分析部515は、特定された単語を含み、構文上の所定の区切り(例えば、分、節、又は句)に相当する単語群を歌詞情報から抽出する。例えば、「好き」という語の登場頻度が高い場合、この語を含む「そんな君が好き」、「とても好きだから」等の単語群が歌詞情報から抽出される。これらの平均値、最頻値、及び単語群は、ユーザの嗜好を示す情報(パラメータ)の一例である。あるいは、嗜好分析部515は、単なる統計処理とは異なる所定のアルゴリズムに従って複数組の情報を分析し、ユーザの嗜好を示す情報を得てもよい。あるいは、嗜好分析部515は、ユーザからフィードバックを受け付け、これらのパラメータの重みをフィードバックに応じて調整してもよい。ステップS114において、嗜好分析部515は、ステップS113により得られた情報を用いて、分類テーブル5161を更新する。 In step S113, the preference analysis unit 515 analyzes a plurality of sets of information to obtain information indicating the user's preference. For this analysis, the preference analysis unit 515 records a plurality of sets of these information over a past period (for example, a period from the start of system operation to the present time). In one example, the preference analysis unit 515 statistically processes music information and calculates a statistical representative value (for example, an average value, a mode value, or a median value). By this statistical processing, for example, the average value of the tempo and the mode value of timbre, tone, rhythm, chord progression, composer name, songwriter name, and performer name are obtained. In addition, the preference analysis unit 515 identifies the part of speech of each word after decomposing the lyrics indicated by the lyrics information into a word level using a technique such as morphological analysis, and displays a histogram for the word of a specific part of speech (for example, a noun). A word that is created and whose appearance frequency is within a predetermined range (for example, the top 5%) is specified. Furthermore, the preference analysis unit 515 extracts a word group that includes the identified word and corresponds to a predetermined syntactic break (for example, minutes, clauses, or phrases) from the lyrics information. For example, when the word “like” appears frequently, word groups such as “I like you” and “I like it very much” are extracted from the lyrics information. These average values, mode values, and word groups are examples of information (parameters) indicating user preferences. Alternatively, the preference analysis unit 515 may analyze a plurality of sets of information according to a predetermined algorithm different from simple statistical processing to obtain information indicating the user's preference. Alternatively, the preference analysis unit 515 may receive feedback from the user and adjust the weights of these parameters according to the feedback. In step S114, the preference analysis unit 515 updates the classification table 5161 using the information obtained in step S113.
 図8は、分類テーブル5161を例示する図である。この図では、ユーザ名が「山田太郎」であるユーザの分類テーブル5161を示している。分類テーブル5161において、楽曲の特徴、属性、及び歌詞が、ユーザの感情と対応付けて記録されている。分類テーブル5161を参照すれば、例えば、ユーザ「山田太郎」が「嬉しい」という感情を抱いているときには、「恋」、「愛」、及び「love」という語を歌詞に含み、テンポが約60であり、「I→V→VIm→IIIm→IV→I→IV→V」というコード進行を有し、ピアノの音色が主である楽曲を好むことが示される。本実施形態によれば、ユーザの嗜好を示す情報を自動的に得ることができる。分類テーブル5161に記録される嗜好情報は、学習が進むにつれ、すなわち音声応答システム1の累積使用時間が増えるにつれ、蓄積され、よりユーザの嗜好を反映したものとなる。この例によれば、ユーザの嗜好を反映した情報を自動的に得ることができる。 FIG. 8 is a diagram illustrating a classification table 5161. This figure shows a classification table 5161 for users whose user name is “Taro Yamada”. In the classification table 5161, the features, attributes, and lyrics of the music are recorded in association with the user's emotions. Referring to the classification table 5161, for example, when the user “Taro Yamada” has a feeling of “happy”, the words “love”, “love”, and “love” are included in the lyrics, and the tempo is about 60. It is shown that the user prefers a music piece that has a chord progression of “I → V → VIm → IIIm → IV → I → IV → V” and is mainly a piano tone. According to the present embodiment, information indicating the user's preference can be obtained automatically. The preference information recorded in the classification table 5161 is accumulated as the learning progresses, that is, as the accumulated usage time of the voice response system 1 increases, and reflects the user's preference more. According to this example, information reflecting the user's preference can be obtained automatically.
 なお、嗜好分析部515は、分類テーブル5161の初期値をユーザ登録時又は初回ログイン時等、所定のタイミングにおいて設定してもよい。この場合において、音声応答システム1は、システム上でユーザを表すキャラクタ(例えばいわゆるアバター)をユーザに選択させ、選択されたキャラクタに応じた初期値を有する分類テーブル5161を、そのユーザに対応する分類テーブルとして設定してもよい。 Note that the preference analysis unit 515 may set the initial value of the classification table 5161 at a predetermined timing such as user registration or first login. In this case, the voice response system 1 causes the user to select a character (for example, a so-called avatar) representing the user on the system, and sets a classification table 5161 having an initial value corresponding to the selected character to the classification corresponding to the user. It may be set as a table.
 この実施形態において説明した分類テーブル5161に記録されるデータは一例である。例えば、分類テーブル5161にはユーザの感情が記録されず、少なくとも、歌詞が記録されていればよい。あるいは、分類テーブル5161には歌詞が記録されず、少なくとも、ユーザの感情と楽曲解析の結果とが記録されていればよい。 The data recorded in the classification table 5161 described in this embodiment is an example. For example, the user's emotion is not recorded in the classification table 5161, and at least lyrics may be recorded. Alternatively, the lyrics may not be recorded in the classification table 5161, and at least the user's emotion and the result of music analysis may be recorded.
3.歌唱合成機能
3-1.構成
 図9は、歌唱合成機能52に係る機能構成を例示する図である。歌唱合成機能52に係る機能要素として、音声応答システム1は、音声分析部511、感情推定部512、記憶部516、検知部521、歌唱生成部522、伴奏生成部523、及び合成部524を有する。歌唱生成部522は、メロディ生成部5221及び歌詞生成部5222を有する。以下において、学習機能51と共通する要素については説明を省略する。
3. Singing synthesis function 3-1. Configuration FIG. 9 is a diagram illustrating a functional configuration related to the song synthesis function 52. As functional elements related to the song synthesis function 52, the voice response system 1 includes a voice analysis unit 511, an emotion estimation unit 512, a storage unit 516, a detection unit 521, a song generation unit 522, an accompaniment generation unit 523, and a synthesis unit 524. . The song generation unit 522 includes a melody generation unit 5221 and a lyrics generation unit 5222. In the following, description of elements common to the learning function 51 is omitted.
 歌唱合成機能52に関し、記憶部516は、素片データベース5162を記憶する。素片データベースは、歌唱合成において用いられる音声素片データを記録したデータベースである。音声素片データは、1又は複数の音素をデータ化したものである。音素とは、言語上の意味の区別の最小単位(例えば母音や子音)に相当するものであり、ある言語の実際の調音と音韻体系全体を考慮して設定される、その言語の音韻論上の最小単位である。音声素片は、特定の発声者によって発声された入力音声のうち所望の音素や音素連鎖に相当する区間が切り出されたものである。本実施形態における音声素片データは、音声素片の周波数スペクトルを示すデータである。以下の説明では、「音声素片」の語は、単一の音素(例えばモノフォン)や、音素連鎖(例えばダイフォンやトライフォン)を含む。 Regarding the song composition function 52, the storage unit 516 stores a segment database 5162. The segment database is a database that records speech segment data used in singing synthesis. The speech segment data is obtained by converting one or more phonemes into data. A phoneme corresponds to the smallest unit of language semantic distinction (for example, vowels and consonants), and is set in consideration of the actual articulation of a language and the entire phonological system. Is the smallest unit. The speech segment is obtained by cutting out a section corresponding to a desired phoneme or phoneme chain from the input speech uttered by a specific speaker. The speech segment data in the present embodiment is data indicating the frequency spectrum of the speech segment. In the following description, the term “speech segment” includes a single phoneme (for example, a monophone) or a phoneme chain (for example, a diphone or a triphone).
 記憶部516は、素片データベース5162を複数、記憶してもよい。複数の素片データベース5162は、例えば、それぞれ異なる歌手(又は話者)により発音された音素を記録したものを含んでもよい。あるいは、複数の素片データベース5162は、単一の歌手(又は話者)により、それぞれ異なる歌い方又は声色で発音された音素を記録したものを含んでもよい。 The storage unit 516 may store a plurality of unit databases 5162. The plurality of segment databases 5162 may include, for example, records of phonemes pronounced by different singers (or speakers). Or the some segment database 5162 may contain what recorded the phoneme sounded by the single singer (or speaker) by a different way of singing or a voice color, respectively.
 歌唱生成部522は、歌唱音声を生成する、すなわち歌唱合成する。歌唱音声とは、与えられた歌詞を与えられたメロディに従って発した音声をいう。メロディ生成部5221は、歌唱合成に用いられるメロディを生成する。歌詞生成部5222は、歌唱合成に用いられる歌詞を生成する。メロディ生成部5221及び歌詞生成部5222は、分類テーブル5161に記録されている情報を用いてメロディ及び歌詞を生成してもよい。歌唱生成部522は、メロディ生成部5221により生成されたメロディ及び歌詞生成部5222により生成された歌詞を用いて歌唱音声を生成する。伴奏生成部523は、歌唱音声に対する伴奏を生成する。合成部519は、歌唱生成部522により生成された歌唱音声、伴奏生成部523により生成された伴奏、及び素片データベース5162に記録されている音声素片を用いて歌唱音声を合成する。 The song generation unit 522 generates a song voice, that is, synthesizes a song. The singing voice is a voice uttered according to a given melody with given lyrics. The melody generation unit 5221 generates a melody used for song synthesis. The lyrics generation unit 5222 generates lyrics used for singing synthesis. The melody generation unit 5221 and the lyric generation unit 5222 may generate melody and lyrics using information recorded in the classification table 5161. The song generation unit 522 generates a song voice using the melody generated by the melody generation unit 5221 and the lyrics generated by the lyrics generation unit 5222. The accompaniment production | generation part 523 produces | generates the accompaniment with respect to a song voice. The synthesizing unit 519 synthesizes the singing voice using the singing voice generated by the singing generation unit 522, the accompaniment generated by the accompaniment generation unit 523, and the voice unit recorded in the unit database 5162.
3-2.動作
 図10は、歌唱合成機能52に係る音声応答システム1の動作(歌唱合成方法)の概要を示すフローチャートである。ステップS21において、音声応答システム1は、歌唱合成をトリガするイベントが発生したか判断する(検知する)。歌唱合成をトリガするイベントは、例えば、ユーザから音声入力が行われたというイベント、カレンダーに登録されたイベント(例えば、アラーム又はユーザの誕生日)、ユーザから音声以外の手法(例えば入出力装置10に無線接続されたスマートフォン(図示略)への操作)により歌唱合成の指示が入力されたというイベント、及びランダムに発生するイベントのうち少なくとも1つを含む。歌唱合成をトリガするイベントが発生したと判断された場合(S21:YES)、音声応答システム1は、処理をステップS22に移行する。歌唱合成をトリガするイベントが発生していないと判断された場合(S21:NO)、音声応答システム1は、歌唱合成をトリガするイベントが発生するまで待機する。
3-2. Operation FIG. 10 is a flowchart showing an outline of the operation (song synthesis method) of the voice response system 1 according to the song synthesis function 52. In step S21, the voice response system 1 determines whether an event that triggers singing synthesis has occurred (detects). The event that triggers the singing synthesis is, for example, an event that a voice input is made by the user, an event registered in a calendar (for example, an alarm or a user's birthday), or a method other than a voice (for example, the input / output device 10). At least one of an event that a singing synthesis instruction is input by an operation on a smartphone (not shown) wirelessly connected to the mobile phone, and an event that occurs randomly. When it is determined that an event that triggers singing synthesis has occurred (S21: YES), the voice response system 1 moves the process to step S22. When it is determined that an event that triggers singing synthesis has not occurred (S21: NO), the voice response system 1 waits until an event that triggers singing synthesis occurs.
 ステップS22において、音声応答システム1は、歌唱合成パラメータを読み出す。ステップS23において、音声応答システム1は、歌詞を生成する。ステップS24において、音声応答システム1は、メロディを生成する。ステップS25において、音声応答システム1は、生成した歌詞及びメロディの一方を他方に合わせて修正する。ステップS26において、音声応答システム1は、使用する素片データベースを選択する(選択部の一例)。ステップS27において、音声応答システム1は、ステップS23、S26、及びS27において得られた、メロディ、歌詞、及び素片データベースを用いて歌唱合成を行う。ステップS28において、音声応答システム1は、伴奏を生成する。ステップS29において、音声応答システム1は、歌唱音声と伴奏とを合成する。ステップS23~S29の処理は、図6のフローにおけるステップS18の処理の一部である。以下、歌唱合成機能52に係る音声応答システム1の動作をより詳細に説明する。 In step S22, the voice response system 1 reads the singing synthesis parameters. In step S23, the voice response system 1 generates lyrics. In step S24, the voice response system 1 generates a melody. In step S25, the voice response system 1 corrects one of the generated lyrics and melody to match the other. In step S26, the voice response system 1 selects a segment database to be used (an example of a selection unit). In step S27, the voice response system 1 performs singing synthesis using the melody, lyrics, and segment database obtained in steps S23, S26, and S27. In step S28, the voice response system 1 generates an accompaniment. In step S29, the voice response system 1 synthesizes the singing voice and the accompaniment. The processing of steps S23 to S29 is a part of the processing of step S18 in the flow of FIG. Hereinafter, operation | movement of the voice response system 1 which concerns on the song synthesis | combination function 52 is demonstrated in detail.
 図11は、歌唱合成機能52に係る音声応答システム1の動作を例示するシーケンスチャートである。歌唱合成をトリガするイベントを検知すると、検知部521は歌唱生成部522に対し歌唱合成を要求する(ステップS201)。歌唱合成の要求はユーザの識別子を含む。歌唱合成を要求されると、歌唱生成部522は、記憶部516に対しユーザの嗜好を問い合わせる(ステップS202)。この問い合わせはユーザ識別子を含む。問い合わせを受けると、記憶部516は、分類テーブル5161の中から、問い合わせに含まれるユーザ識別子と対応する嗜好情報を読み出し、読み出した嗜好情報を歌唱生成部522に出力する(ステップS203)。さらに歌唱生成部522は、感情推定部512に対しユーザの感情を問い合わせる(ステップS204)。この問い合わせはユーザ識別子を含む。問い合わせを受けると、感情推定部512は、そのユーザの感情情報を歌唱生成部522に出力する(ステップS205)。 FIG. 11 is a sequence chart illustrating the operation of the voice response system 1 according to the song synthesis function 52. When detecting an event that triggers song synthesis, the detection unit 521 requests the song generation unit 522 to perform song synthesis (step S201). The request for song composition includes the user's identifier. When song synthesis is requested, the song generation unit 522 inquires of the storage unit 516 about the user's preference (step S202). This query includes a user identifier. When receiving the inquiry, the storage unit 516 reads the preference information corresponding to the user identifier included in the inquiry from the classification table 5161, and outputs the read preference information to the song generation unit 522 (step S203). Further, the song generation unit 522 inquires of the emotion estimation unit 512 about the user's emotion (step S204). This query includes a user identifier. When receiving the inquiry, the emotion estimation unit 512 outputs the emotion information of the user to the song generation unit 522 (step S205).
 ステップS206において、歌唱生成部522は、歌詞のソースを選択する。歌詞のソースは入力音声に応じて決められる。歌詞のソースは、大きくは、処理部510及び分類テーブル5161のいずれかである。処理部510から歌唱生成部522に出力される歌唱合成の要求は、歌詞(又は歌詞素材)を含んでいる場合と、歌詞を含んでいない場合とがある。歌詞素材とは、それ単独では歌詞を形成することができず、他の歌詞素材と組み合わせることによって歌詞を形成する文字列をいう。歌唱合成の要求が歌詞を含んでいる場合とは、例えば、AIによる応答そのもの(「明日の天気は晴れです」等)にメロディを付けて応答音声を出力する場合をいう。歌唱合成の要求は処理部510によって生成されることから、歌詞のソースは処理部510であるということもできる。さらに、処理部510は、コンテンツ提供部60からコンテンツを取得する場合があるので、歌詞のソースはコンテンツ提供部60であるということもできる。コンテンツ提供部60は、例えば、ニュースを提供するサーバ又は気象情報を提供するサーバである。あるいは、コンテンツ提供部60は、既存の楽曲の歌詞を記録したデータベースを有するサーバである。図ではコンテンツ提供部60は1台のみ示しているが、複数のコンテンツ提供部60が存在してもよい。歌唱合成の要求に歌詞が含まれている場合、歌唱生成部522は、歌唱合成の要求を歌詞のソースとして選択する。歌唱合成の要求に歌詞が含まれていない場合(例えば、入力音声による指示が「何か歌って」のように歌詞の内容を特に指定しないものである場合)、歌唱生成部522は、分類テーブル5161を歌詞のソースとして選択する。 In step S206, the song generation unit 522 selects a lyrics source. The source of the lyrics is determined according to the input sound. The source of the lyrics is roughly either the processing unit 510 or the classification table 5161. The request for song synthesis output from the processing unit 510 to the song generation unit 522 may include lyrics (or lyrics material) and may not include lyrics. The lyric material is a character string that cannot form lyrics by itself but forms lyrics by combining with other lyric materials. The case where the request for singing synthesis includes lyrics means, for example, a case where a response voice is output with a melody attached to the response itself by AI (“Tomorrow's weather is fine”, etc.). Since the singing synthesis request is generated by the processing unit 510, it can be said that the source of the lyrics is the processing unit 510. Furthermore, since the processing unit 510 may acquire content from the content providing unit 60, it can be said that the source of the lyrics is the content providing unit 60. The content providing unit 60 is, for example, a server that provides news or a server that provides weather information. Or the content provision part 60 is a server which has a database which recorded the lyrics of the existing music. Although only one content providing unit 60 is shown in the figure, a plurality of content providing units 60 may exist. When the lyrics are included in the request for song synthesis, the song generation unit 522 selects the request for song synthesis as the source of the lyrics. When the lyrics are not included in the singing synthesis request (for example, when the instruction by the input voice is not to specify the contents of the lyrics such as “sing something”), the singing generation unit 522 displays the classification table. 5161 is selected as the source of the lyrics.
 ステップS207において、歌唱生成部522は、選択されたソースに対し歌詞素材の提供を要求する。ここでは、分類テーブル5161すなわち記憶部516がソースとして選択された例を示している。この場合、この要求はユーザ識別子及びそのユーザの感情情報を含む。歌詞素材提供の要求を受けると、記憶部516は、要求に含まれるユーザ識別子及び感情情報に対応する歌詞素材を分類テーブル5161から抽出する(ステップS208)。記憶部516は、抽出した歌詞素材を歌唱生成部522に出力する(ステップS209)。 In step S207, the song generation unit 522 requests the selected source to provide lyrics material. In this example, the classification table 5161, that is, the storage unit 516 is selected as the source. In this case, the request includes a user identifier and emotion information of the user. Upon receiving the request for provision of lyrics material, the storage unit 516 extracts the lyrics material corresponding to the user identifier and emotion information included in the request from the classification table 5161 (step S208). The storage unit 516 outputs the extracted lyric material to the song generation unit 522 (step S209).
 歌詞素材を取得すると、歌唱生成部522は、歌詞生成部5222に対し歌詞の生成を要求する(ステップS210)。この要求は、ソースから取得した歌詞素材を含む。歌詞の生成が要求されると、歌詞生成部5222は、歌詞素材を用いて歌詞を生成する(ステップS211)。歌詞生成部5222は、例えば、歌詞素材を複数、組み合わせることにより歌詞を生成する。あるいは、各ソースは1曲全体分の歌詞を記憶していてもよく、この場合、歌詞生成部5222は、ソースが記憶している歌詞の中から、歌唱合成に用いる1曲分の歌詞を選択してもよい。歌詞生成部5222は、生成した歌詞を歌唱生成部522に出力する(ステップS212)。 When the lyrics material is acquired, the song generation unit 522 requests the lyrics generation unit 5222 to generate lyrics (step S210). This request includes the lyrics material obtained from the source. When generation of lyrics is requested, the lyrics generation unit 5222 generates lyrics using the lyrics material (step S211). The lyrics generation unit 5222 generates lyrics by combining a plurality of lyrics materials, for example. Alternatively, each source may store the lyrics for the entire song, and in this case, the lyrics generation unit 5222 selects the lyrics for one song used for singing synthesis from the lyrics stored in the source. May be. The lyrics generation unit 5222 outputs the generated lyrics to the song generation unit 522 (step S212).
 ステップS213において、歌唱生成部522は、メロディ生成部5221に対しメロディの生成を要求する。この要求は、ユーザの嗜好情報及び歌詞の音数を特定する情報を含む。歌詞の音数を特定する情報は、生成された歌詞の文字数、モーラ数、又は音節数である。メロディの生成が要求されると、メロディ生成部5221は、要求に含まれる嗜好情報に応じてメロディを生成する(ステップS214)。具体的には例えば以下のとおりである。メロディ生成部5221は、メロディの素材(例えば、2小節又は4小節程度の長さを有する音符列、又は音符列をリズムや音高の変化といった音楽的な要素に細分化した情報列)のデータベース(以下「メロディデータベース」という。図示略)にアクセスすることができる。メロディデータベースは、例えば記憶部516に記憶される。メロディデータベースには、メロディの属性が記録されている。メロディの属性は、例えば、適合する曲調又は歌詞、作曲者名等の楽曲情報を含む。メロディ生成部5221は、メロディデータベースに記録されている素材の中から、要求に含まれる嗜好情報に適合する1又は複数の素材を選択し、選択された素材を組み合わせて所望の長さのメロディを得る。歌唱生成部522は、生成したメロディを特定する情報(例えばMIDI等のシーケンスデータ)を歌唱生成部522に出力する(ステップS215)。 In step S213, the song generation unit 522 requests the melody generation unit 5221 to generate a melody. This request includes information specifying user preference information and the number of lyrics. The information for specifying the number of sounds in the lyrics is the number of characters in the generated lyrics, the number of mora, or the number of syllables. When generation of a melody is requested, the melody generation unit 5221 generates a melody according to the preference information included in the request (step S214). Specifically, for example, as follows. The melody generating unit 5221 is a database of melody material (for example, a note string having a length of about two or four bars, or an information string obtained by subdividing a note string into musical elements such as changes in rhythm and pitch). (Hereinafter referred to as “melody database”, not shown). The melody database is stored in the storage unit 516, for example. In the melody database, melody attributes are recorded. The attribute of the melody includes, for example, music information such as a suitable tune or lyrics and a composer name. The melody generation unit 5221 selects one or a plurality of materials that match the preference information included in the request from the materials recorded in the melody database, and combines the selected materials to generate a melody having a desired length. obtain. The song generation unit 522 outputs information for specifying the generated melody (for example, sequence data such as MIDI) to the song generation unit 522 (step S215).
 ステップS216において、歌唱生成部522は、メロディ生成部5221に対しメロディの修正、又は歌詞生成部5222に対し歌詞の生成を要求する。この修正の目的の一つは、歌詞の音数(例えばモーラ数)とメロディの音数とを一致させることである。例えば、歌詞のモーラ数がメロディの音数よりも少ない場合(字足らずの場合)、歌唱生成部522は、歌詞の文字数を増やすよう、歌詞生成部5222に要求する。あるいは、歌詞のモーラ数がメロディの音数よりも多い場合(字余りの場合)、歌唱生成部522は、メロディの音数を増やすよう、メロディ生成部5221に要求する。この図では、歌詞を修正する例を説明する。ステップS217において、歌詞生成部5222は、修正の要求に応じて歌詞を修正する。メロディの修正をする場合、メロディ生成部5221は、例えば音符を分割して音符数を増やすことによりメロディを修正する。歌詞生成部5222又はメロディ生成部5221は、歌詞の文節の区切りの部分とメロディのフレーズの区切り部分とを一致させるよう調整してもよい。歌詞生成部5222は、修正した歌詞を歌唱生成部522に出力する(ステップS218)。 In step S216, the song generation unit 522 requests the melody generation unit 5221 to correct the melody or generate the lyrics from the lyrics generation unit 5222. One of the purposes of this correction is to make the number of sounds of lyrics (for example, the number of mora) and the number of sounds of a melody match. For example, when the number of mora in the lyrics is less than the number of sounds in the melody (when there are not enough characters), the song generation unit 522 requests the lyrics generation unit 5222 to increase the number of characters in the lyrics. Alternatively, when the number of mora in the lyrics is greater than the number of sounds in the melody (in the case of remaining characters), the singing generation unit 522 requests the melody generation unit 5221 to increase the number of sounds in the melody. In this figure, an example of correcting lyrics will be described. In step S217, the lyrics generation unit 5222 corrects the lyrics in response to the request for correction. When correcting a melody, the melody generation part 5221 corrects a melody by dividing a note and increasing the number of notes, for example. The lyric generation unit 5222 or the melody generation unit 5221 may adjust the lyric phrase delimiter to match the melody phrase delimiter. The lyrics generation unit 5222 outputs the corrected lyrics to the song generation unit 522 (step S218).
 歌詞を受けると、歌唱生成部522は、歌唱合成に用いられる素片データベース5162を選択する(ステップS219)。素片データベース5162は、例えば、歌唱合成をトリガしたイベントに関するユーザの属性に応じて選択される。あるいは、素片データベース5162は、歌唱合成をトリガしたイベントの内容に応じて選択されてもよい。さらにあるいは、素片データベース5162は、分類テーブル5161に記録されているユーザの嗜好情報に応じて選択されてもよい。歌唱生成部522は、これまでの処理で得られた歌詞及びメロディに従って、選択された素片データベース5162から抽出された音声素片を合成し、合成歌唱のデータを得る(ステップS220)。なお、分類テーブル5161には、歌唱における声色の変更、タメ、しゃくり、ビブラート等の歌唱の奏法に関するユーザの嗜好を示す情報が記録されてもよく、歌唱生成部522は、これらの情報を参照して、ユーザの嗜好に応じた奏法を反映した歌唱を合成してもよい。歌唱生成部522は、生成された合成歌唱のデータを合成部524に出力する(ステップS2221)。 Upon receiving the lyrics, the song generation unit 522 selects the segment database 5162 used for song synthesis (step S219). The segment database 5162 is selected according to the user's attribute regarding the event that triggered the singing synthesis, for example. Alternatively, the segment database 5162 may be selected according to the content of the event that triggered the song synthesis. Further alternatively, the segment database 5162 may be selected according to user preference information recorded in the classification table 5161. The song generation unit 522 synthesizes the speech unit extracted from the selected unit database 5162 according to the lyrics and the melody obtained in the process so far, and obtains synthesized song data (step S220). The classification table 5161 may record information indicating user preferences regarding performance of singing, such as voice color change, tame, shackle, and vibrato in the singing, and the singing generation unit 522 refers to these information. Thus, a singing that reflects the performance according to the user's preference may be synthesized. The song generation unit 522 outputs the generated synthesized song data to the synthesis unit 524 (step S2221).
 さらに、歌唱生成部522は、伴奏生成部523に対し伴奏の生成を要求する(S222)。この要求は、歌唱合成におけるメロディを示す情報を含む。伴奏生成部523は、要求に含まれるメロディに応じて伴奏を生成する(ステップS223)。メロディに対し自動的に伴奏を付ける技術としては、周知の技術が用いられる。メロディデータベースにおいてメロディのコード進行を示すデータ(以下「コード進行データ」)が記録されている場合、伴奏生成部523は、このコード進行データを用いて伴奏を生成してもよい。あるいは、メロディデータベースにおいてメロディに対する伴奏用のコード進行データが記録されている場合、伴奏生成部523は、このコード進行データを用いて伴奏を生成してもよい。さらにあるいは、伴奏生成部523は、伴奏のオーディオデータをあらかじめ複数、記憶しておき、その中からメロディのコード進行に合ったものを読み出してもよい。また、伴奏生成部523は、例えば伴奏の曲調を決定するために分類テーブル5161を参照し、ユーザの嗜好に応じた伴奏を生成してもよい(決定部の一例)。伴奏生成部523は、生成された伴奏のデータを合成部524に出力する(ステップS224)。 Further, the song generation unit 522 requests the accompaniment generation unit 523 to generate an accompaniment (S222). This request includes information indicating a melody in singing synthesis. The accompaniment generation unit 523 generates an accompaniment according to the melody included in the request (step S223). A well-known technique is used as a technique for automatically adding an accompaniment to a melody. When data indicating the melody chord progression (hereinafter referred to as “chord progression data”) is recorded in the melody database, the accompaniment generation unit 523 may generate an accompaniment using the chord progression data. Alternatively, when accompaniment chord progression data for a melody is recorded in the melody database, the accompaniment generation unit 523 may generate an accompaniment using the chord progression data. Further alternatively, the accompaniment generation unit 523 may store a plurality of accompaniment audio data in advance, and read out the one that matches the chord progression of the melody. Moreover, the accompaniment production | generation part 523 may produce | generate the accompaniment according to a user preference with reference to the classification | category table 5161, for example in order to determine the music tone of an accompaniment (an example of a determination part). The accompaniment generation unit 523 outputs the generated accompaniment data to the synthesis unit 524 (step S224).
 合成歌唱及び伴奏のデータを受けると、合成部524は、合成歌唱及び伴奏を合成する(ステップS225)。合成に際しては、演奏の開始位置やテンポを合わせることによって、歌唱と伴奏とが同期するように合成される。こうして伴奏付きの合成歌唱のデータが得られる。合成部524は、合成歌唱のデータを出力する。 Upon receiving the synthesized singing and accompaniment data, the synthesizing unit 524 synthesizes the synthesized singing and accompaniment (step S225). In synthesizing, the singing and the accompaniment are synthesized in synchronism by matching the performance start position and tempo. Thus, synthetic singing data with accompaniment is obtained. The synthesizing unit 524 outputs synthetic singing data.
 ここでは、最初に歌詞が生成され、その後、歌詞に合わせてメロディを生成する例を説明した。しかし、音声応答システム1は、先にメロディを生成し、その後、メロディに合わせて歌詞を生成してもよい。また、ここでは歌唱と伴奏とが合成された後に出力される例を説明したが、伴奏が生成されず、歌唱のみが出力されてもよい(すなわちアカペラでもよい)。また、ここでは、まず歌唱が合成された後に歌唱に合わせて伴奏が生成される例を説明したが、まず伴奏が生成され、伴奏に合わせて歌唱が合成されてもよい。 Here, an example has been described in which lyrics are first generated and then a melody is generated in accordance with the lyrics. However, the voice response system 1 may generate a melody first, and then generate lyrics according to the melody. Moreover, although the example output after combining a song and an accompaniment was demonstrated here, an accompaniment is not produced | generated but only a song may be output (namely, a cappella may be sufficient). In addition, here, an example in which an accompaniment is generated in accordance with a song after the song has been synthesized has been described, but an accompaniment may be generated first and a song may be synthesized in accordance with the accompaniment.
4.応答機能
 図12は、応答機能53に係る音声応答システム1の機能構成を例示する図である。応答機能53に係る機能要素として、音声応答システム1は、音声分析部511、感情推定部512、及びコンテンツ分解部531を有する。以下において、学習機能51及び歌唱合成機能52と共通する要素については説明を省略する。コンテンツ分解部531は、一のコンテンツを複数の部分コンテンツに分解する。コンテンツとは、応答音声として出力される情報の内容をいい、具体的には、例えば、楽曲、ニュース、レシピ、又は教材(スポーツ教習、楽器教習、学習ドリル、クイズ)をいう。
4). Response Function FIG. 12 is a diagram illustrating a functional configuration of the voice response system 1 according to the response function 53. As functional elements related to the response function 53, the voice response system 1 includes a voice analysis unit 511, an emotion estimation unit 512, and a content decomposition unit 531. In the following, description of elements common to the learning function 51 and the song synthesis function 52 is omitted. The content decomposition unit 531 decomposes one content into a plurality of partial contents. The content refers to the content of information output as response voice, and specifically refers to, for example, music, news, recipes, or teaching materials (sports learning, instrument learning, learning drill, quiz).
 図13は、応答機能53に係る音声応答システム1の動作を例示するフローチャートである。ステップS31において、音声分析部511は、再生するコンテンツを特定する。再生するコンテンツは、例えばユーザの入力音声に応じて特定される。具体的には、音声分析部511が入力音声を解析し、入力音声により再生が指示されたコンテンツを特定する。一例において、「ハンバーグのレシピ教えて」という入力音声が与えられると、音声分析部11は、「ハンバーグのレシピ」を提供するよう、処理部510に指示する。処理部510は、コンテンツ提供部60にアクセスし、「ハンバーグのレシピ」を説明したテキストデータを取得する。こうして取得されたデータが、再生されるコンテンツとして特定される。処理部510は、特定されたコンテンツをコンテンツ分解部531に通知する。 FIG. 13 is a flowchart illustrating the operation of the voice response system 1 according to the response function 53. In step S31, the voice analysis unit 511 specifies content to be played back. The content to be reproduced is specified according to, for example, the user input voice. Specifically, the voice analysis unit 511 analyzes the input voice and specifies the content instructed to be played by the input voice. In one example, when an input voice “Tell me a hamburger recipe” is given, the voice analysis unit 11 instructs the processing unit 510 to provide a “hamburger recipe”. The processing unit 510 accesses the content providing unit 60 and acquires text data describing the “hamburger recipe”. The data acquired in this way is specified as the content to be played back. The processing unit 510 notifies the content decomposition unit 531 of the identified content.
 ステップS32において、コンテンツ分解部531は、コンテンツを複数の部分コンテンツに分解する。一例において、「ハンバーグのレシピ」は複数のステップ(材料を切る、材料を混ぜる、成形する、焼く等)から構成されるところ、コンテンツ分解部531は、「ハンバーグのレシピ」のテキストを、「材料を切るステップ」、「材料を混ぜるステップ」、「成形するステップ」、及び「焼くステップ」の4つの部分コンテンツに分解する。コンテンツの分解位置は、例えばAIにより自動的に判断される。あるいは、コンテンツに区切りを示すマーカーをあらかじめ埋め込んでおき、そのマーカーの位置でコンテンツが分解されてもよい。 In step S32, the content decomposition unit 531 decomposes the content into a plurality of partial contents. In one example, the “hamburger recipe” is composed of a plurality of steps (cutting ingredients, mixing ingredients, molding, baking, etc.), and the content decomposition unit 531 converts the text “hamburger recipe” into “materials”. It is broken down into four partial contents: “cutting step”, “mixing material”, “forming step”, and “baking step”. The content decomposition position is automatically determined by, for example, AI. Alternatively, a marker indicating a delimiter may be embedded in the content in advance, and the content may be decomposed at the position of the marker.
 ステップS33において、コンテンツ分解部531は、複数の部分コンテンツのうち対象となる一の部分コンテンツを特定する(特定部の一例)。対象となる部分コンテンツは再生される部分コンテンツであり、元のコンテンツにおけるその部分コンテンツの位置関係に応じて決められる。「ハンバーグのレシピ」の例では、コンテンツ分解部531は、まず、「材料を切るステップ」を対象となる部分コンテンツとして特定する。次にステップS33の処理が行われるとき、コンテンツ分解部531は、「材料を混ぜるステップ」を対象となる部分コンテンツとして特定する。コンテンツ分解部531は、特定した部分コンテンツをコンテンツ修正部532に通知する。 In step S33, the content decomposition unit 531 specifies one target partial content among the plurality of partial contents (an example of a specifying unit). The target partial content is the partial content to be played back, and is determined according to the positional relationship of the partial content in the original content. In the example of “hamburger recipe”, the content disassembling unit 531 first identifies “the step of cutting the material” as the target partial content. Next, when the process of step S33 is performed, the content decomposing unit 531 identifies “the step of mixing materials” as the target partial content. The content decomposition unit 531 notifies the content modification unit 532 of the identified partial content.
 ステップS34において、コンテンツ修正部532は、対象となる部分コンテンツを修正する。具体的修正の方法は、コンテンツに応じて定義される。例えば、ニュース、気象情報、及びレシピといったコンテンツに対して、コンテンツ修正部532は修正を行わない。例えば、教材又はクイズのコンテンツに対して、コンテンツ修正部532は、問題として隠しておきたい部分を他の音(例えばハミング、「ラララ」、ビープ音等)に置換する。このとき、コンテンツ修正部532は、置換前の文字列とモーラ数又は音節数が同一の文字列を用いて置換する。コンテンツ修正部532は、修正された部分コンテンツを歌唱生成部522に出力する。 In step S34, the content correction unit 532 corrects the target partial content. A specific correction method is defined according to the content. For example, the content correction unit 532 does not correct content such as news, weather information, and recipes. For example, for the teaching material or quiz content, the content correction unit 532 replaces the portion that is desired to be hidden as a problem with another sound (for example, humming, “rara”, beep sound, etc.). At this time, the content correction unit 532 performs replacement using a character string having the same number of mora or syllables as the character string before replacement. The content correction unit 532 outputs the corrected partial content to the song generation unit 522.
 ステップS35において、歌唱生成部522は、修正された部分コンテンツを歌唱合成する。歌唱生成部522により生成された歌唱音声は、最終的に、入出力装置10から応答音声として出力される。応答音声を出力すると、音声応答システム1はユーザの応答待ち状態となる(ステップS36)。ステップS36において、音声応答システム1は、ユーザの応答を促す歌唱又は音声(例えば「できましたか?」等)を出力してもよい。音声分析部511は、ユーザの応答に応じて次の処理を決定する。次の部分コンテンツの再生を促す応答が入力された場合(S36:次)、音声分析部511は、処理をステップS33に移行する。次の部分コンテンツの再生を促す応答は、例えば、「次のステップへ」、「できた」、「終わった」等の音声である。次の部分コンテンツの再生を促す応答以外の応答が入力された場合(S36:終了)、音声分析部511は、音声の出力を停止するよう処理部510に指示する。 In step S35, the song generation unit 522 sings the modified partial content. The singing voice generated by the singing generation unit 522 is finally output as a response voice from the input / output device 10. When the response voice is output, the voice response system 1 waits for a user response (step S36). In step S <b> 36, the voice response system 1 may output a singing or voice that prompts the user to respond (for example, “has it done?”). The voice analysis unit 511 determines the next process according to the user response. When a response for prompting the reproduction of the next partial content is input (S36: next), the voice analysis unit 511 moves the process to step S33. The response that prompts the reproduction of the next partial content is, for example, a voice such as “next step”, “completed”, “finished”, or the like. When a response other than a response prompting the reproduction of the next partial content is input (S36: end), the audio analysis unit 511 instructs the processing unit 510 to stop outputting the audio.
 ステップS37において、処理部510は、部分コンテンツの合成音声の出力を、少なくとも一時的に停止する。ステップS38において、処理部510は、ユーザの入力音声に応じた処理を行う。ステップS38における処理には、例えば、現在のコンテンツの再生中止、ユーザから指示されたキーワード検索、及び別のコンテンツの再生開始が含まれる。例えば、「歌を止めて欲しい」、「もう終わり」、又は「おしまい」等の応答が入力された場合、処理部510は、現在のコンテンツの再生を中止する。例えば、「短冊切りってどうやるの?」又は「アーリオオーリオって何?」等、質問型の応答が入力された場合、処理部510は、ユーザの質問に回答するための情報をコンテンツ提供部60から取得する。処理部510は、ユーザの質問に対する回答の音声を出力する。この回答は歌唱ではなく、話声であってもよい。「○○の曲かけて」等、別のコンテンツの再生を指示する応答が入力された場合、処理部510は、指示されたコンテンツをコンテンツ提供部60から取得し、再生する。 In step S37, the processing unit 510 stops the output of the synthesized voice of the partial content at least temporarily. In step S38, the processing unit 510 performs processing according to the user's input voice. The processing in step S38 includes, for example, stop playback of the current content, keyword search instructed by the user, and start playback of another content. For example, when a response such as “I want you to stop singing”, “End of song”, or “End” is input, the processing unit 510 stops the reproduction of the current content. For example, when a question-type response such as “How do you cut a strip?” Or “What is Ario Aurio?” Is input, the processing unit 510 provides content for answering a user's question as content. Obtained from the unit 60. The processing unit 510 outputs a sound of an answer to the user's question. This answer may be spoken voice instead of singing. When a response instructing the reproduction of another content, such as “Turn XXX”, is input, the processing unit 510 acquires the instructed content from the content providing unit 60 and reproduces it.
 コンテンツが複数の部分コンテンツに分解され、部分コンテンツ毎にユーザの反応に応じて次の処理を決定する例を説明した。しかし、コンテンツは部分コンテンツに分解されず、そのまま話声として、又はそのコンテンツを歌詞として用いた歌唱音声として出力されてもよい。音声応答システム1は、ユーザの入力音声に応じて、又は出力されるコンテンツに応じて、部分コンテンツに分解するか、分解せずそのまま出力するか判断してもよい。 An example has been described in which content is decomposed into a plurality of partial contents, and the next process is determined for each partial content in accordance with the user's reaction. However, the content may not be broken down into partial content, but may be output as speech or as singing voice using the content as lyrics. The voice response system 1 may determine whether to break down into partial contents according to user input voice or according to content to be output, or to output as it is without being decomposed.
5.動作例
 以下、具体的な動作例をいくつか説明する。各動作例において特に明示はしないが、各動作例は、それぞれ、上記の学習機能、歌唱合成機能、及び応答機能の少なくとも1つ以上に基づくものである。なお以下の動作例はすべて日本語が使用される例を説明するが、使用される言語は日本語に限定されず、どのような言語でもよい。
5). Operational Examples Some specific operational examples will be described below. Although not clearly indicated in each operation example, each operation example is based on at least one or more of the learning function, the song synthesis function, and the response function. In the following operation examples, examples in which Japanese is used will be described. However, the language used is not limited to Japanese, and any language may be used.
5-1.動作例1
 図14は、音声応答システム1の動作例1を示す図である。ユーザは「佐藤一太郎(実演者名)の『さくらさくら』(楽曲名)をかけて」という入力音声により、楽曲の再生を要求する。音声応答システム1は、この入力音声に従って楽曲データベースを検索し、要求された楽曲を再生する。このとき、音声応答システム1は、この入力音声を入力したときのユーザの感情及びこの楽曲の解析結果を用いて、分類テーブルを更新する。分類テーブルは、楽曲の再生が要求される度に分類テーブルを更新する。分類テーブルは、ユーザが音声応答システム1に対し楽曲の再生を要求する回数が増えるにつれ(すなわち、音声応答システム1の累積使用時間が増えるにつれ)、よりそのユーザの嗜好を反映したものになっていく。
5-1. Operation example 1
FIG. 14 is a diagram illustrating an operation example 1 of the voice response system 1. The user requests the reproduction of the musical piece by the input voice of “Kazutaro Sato (performer name)“ Sakura Sakura ”(music name)”. The voice response system 1 searches the music database according to the input voice and reproduces the requested music. At this time, the voice response system 1 updates the classification table using the emotion of the user when the input voice is input and the analysis result of the music. The classification table is updated every time music playback is requested. The classification table more reflects the user's preference as the number of times the user requests the voice response system 1 to play a song increases (that is, as the cumulative usage time of the voice response system 1 increases). Go.
5-2.動作例2
 図15は、音声応答システム1の動作例2を示す図である。ユーザは「何か楽しい曲歌って」という入力音声により、歌唱合成を要求する。音声応答システム1は、この入力音声に従って歌唱合成を行う。歌唱合成に際し、音声応答システム1は、分類テーブルを参照する。分類テーブルに記録されている情報を用いて、歌詞及びメロディを生成する。したがって、ユーザの嗜好を反映した楽曲を自動的に作成することができる。
5-2. Operation example 2
FIG. 15 is a diagram illustrating an operation example 2 of the voice response system 1. The user requests singing synthesis with an input voice of "Sing something fun". The voice response system 1 performs singing synthesis according to the input voice. At the time of singing synthesis, the voice response system 1 refers to the classification table. Lyrics and melodies are generated using information recorded in the classification table. Therefore, it is possible to automatically create music that reflects the user's preferences.
5-3.動作例3
 図16は、音声応答システム1の動作例3を示す図である。ユーザは「今日の天気は?」という入力音声により、気象情報の提供を要求する。この場合、処理部510はこの要求に対する回答として、コンテンツ提供部60のうち気象情報を提供するサーバにアクセスし、今日の天気を示すテキスト(例えば「今日は一日快晴」)を取得する。処理部510は、取得したテキストを含む、歌唱合成の要求を歌唱生成部522に出力する。歌唱生成部522は、この要求に含まれるテキストを歌詞として用いて、歌唱合成を行う。音声応答システム1は、入力音声に対する回答として「今日は一日快晴」にメロディ及び伴奏を付けた歌唱音声を出力する。
5-3. Operation example 3
FIG. 16 is a diagram illustrating an operation example 3 of the voice response system 1. The user requests the provision of weather information by an input voice “What is the weather today?”. In this case, as a response to this request, the processing unit 510 accesses a server that provides weather information in the content providing unit 60 and acquires text indicating today's weather (for example, “Today is sunny all day”). The processing unit 510 outputs a song synthesis request including the acquired text to the song generation unit 522. The song generation unit 522 performs song synthesis using the text included in the request as lyrics. The voice response system 1 outputs a singing voice with a melody and accompaniment added to “Today is sunny today” as an answer to the input voice.
5-4.動作例4
 図17は、音声応答システム1の動作例4を示す図である。図示された応答が開始される前に、ユーザは音声応答システム1を2週間、使用し、恋愛の歌をよく再生していた。そのため、分類テーブルには、そのユーザが恋愛の歌が好きであることを示す情報が記録される。音声応答システム1は、「出会いの場所はどこがいい?」や、「季節はいつがいいかな?」など、歌詞生成のヒントとなる情報を得るためにユーザに質問をする。音声応答システム1は、これらの質問に対するユーザの回答を用いて歌詞を生成する。使用期間がまだ2週間と短いため、音声応答システム1の分類テーブルは、まだユーザの嗜好を十分に反映できておらず、感情との対応付けも十分ではない。そのため、本当はユーザはバラード調の曲が好みであるにも関わらず、それとは異なるロック調の曲を生成したりする。
5-4. Operation example 4
FIG. 17 is a diagram illustrating an operation example 4 of the voice response system 1. Before the illustrated response was started, the user used the voice response system 1 for two weeks and often played romance songs. Therefore, information indicating that the user likes a love song is recorded in the classification table. The voice response system 1 asks the user a question in order to obtain information that can be used as a hint for generating lyrics, such as “Where is the meeting place?” And “When is the season?”. The voice response system 1 generates lyrics using the user's answers to these questions. Since the usage period is still as short as two weeks, the classification table of the voice response system 1 still does not sufficiently reflect the user's preference, and the association with emotions is not sufficient. Therefore, although the user really likes the ballad-like music, the user may generate a different rock-like music.
5-5.動作例5
 図18は、音声応答システム1の動作例5を示す図である。この例は、動作例3からさらに音声応答システム1の使用を続け、累積使用期間が1月半となった例を示している。動作例3と比較すると分類テーブルはユーザの嗜好をより反映したものとなっており、合成される歌唱はユーザの嗜好に沿ったものになっている。ユーザは、最初は不完全だった音声応答システム1の反応が徐々に自分の嗜好に合うように変化していく体験をすることができる。
5-5. Operation example 5
FIG. 18 is a diagram illustrating an operation example 5 of the voice response system 1. This example shows an example in which the use of the voice response system 1 is further continued from the operation example 3, and the cumulative use period becomes one and a half months. Compared to the operation example 3, the classification table more reflects the user's preference, and the synthesized singing is in accordance with the user's preference. The user can experience that the response of the voice response system 1 that was initially incomplete gradually changes so as to suit his / her preference.
5-6.動作例6
 図19は、音声応答システム1の動作例6を示す図である。ユーザは、「ハンバーグのレシピを教えてくれる?」という入力音声により、「ハンバーグ」の「レシピ」のコンテンツの提供を要求する。音声応答システム1は、「レシピ」というコンテンツが、あるステップが終了してから次のステップに進むべきものである点を踏まえ、コンテンツを部分コンテンツに分解し、ユーザの反応に応じて次の処理を決定する態様で再生することを決定する。
5-6. Operation example 6
FIG. 19 is a diagram illustrating an operation example 6 of the voice response system 1. The user requests the provision of the content of the “recipe” of “hamburger” by an input voice “Tell me a recipe for the hamburger?”. The voice response system 1 breaks down the content into partial content based on the fact that the content “recipe” should proceed to the next step after a certain step is completed, and performs the next process according to the user's reaction. It is determined to play in a manner of determining.
 「ハンバーグ」の「レシピ」はステップ毎に分解され、各ステップの歌唱を出力する度に、音声応答システム1は「できましたか?」、「終わりましたか?」等、ユーザの応答を促す音声を出力する。ユーザが「できたよ」、「次は?」等、次のステップの歌唱を指示する入力音声を発すると、音声応答システム1は、それに応答して次のステップの歌唱を出力する。ユーザが「タマネギのみじん切りってどうやるの?」と質問する入力音声を発すると、音声応答システム1は、それに応答して「タマネギのみじん切り」の歌唱を出力する。「タマネギのみじん切り」の歌唱を終えると、音声応答システム1は、「ハンバーグ」の「レシピ」の続きから歌唱を開始する。 “Recipe” of “hamburger” is disassembled step by step, and every time a singing of each step is output, the voice response system 1 is a voice prompting the user's response, such as “has it done?” Is output. When the user utters an input voice instructing the singing of the next step, such as “I'm done” or “What is next?”, The voice response system 1 outputs the singing of the next step in response. When the user utters an input voice asking “How do you chop the onion?”, The voice response system 1 outputs a singing of “chopped onion” in response. When the singing of “chopped onion” is finished, the voice response system 1 starts singing from the continuation of the “recipe” of “hamburg”.
 音声応答システム1は、第1の部分コンテンツの歌唱音声と、それに続く第2の部分コンテンツの歌唱音声との間に、別のコンテンツの歌唱音声を出力してもよい。音声応答システム1は、例えば、第1の部分コンテンツに含まれる文字列が示す事項に応じた時間長となるよう合成された歌唱音声を、第1の部分コンテンツの歌唱音声と第2の部分コンテンツの歌唱音声との間に出力する。具体的には、第1の部分コンテンツが「ここで材料を20分、煮込みましょう」というように、待ち時間が20分発生することを示していた場合、音声応答システム1は、材料を煮込んでいる間に流す20分の歌唱を合成し、出力する。 The voice response system 1 may output the singing voice of another content between the singing voice of the first partial content and the singing voice of the second partial content that follows. The voice response system 1 uses, for example, the singing voice synthesized so as to have a time length according to the matter indicated by the character string included in the first partial content, the singing voice of the first partial content, and the second partial content. Output between the singing voices. Specifically, when the first partial content indicates that the waiting time will occur for 20 minutes, such as “Let's boil the ingredients for 20 minutes”, the voice response system 1 boils the ingredients. Synthesize a 20-minute song that is played while you are playing.
 また、音声応答システム1は、第1の部分コンテンツに含まれる第1文字列が示す事項に応じた第2文字列を用いて合成された歌唱音声を、第1の部分コンテンツの歌唱音声の出力後、第1文字列が示す事項に応じた時間長に応じたタイミングで出力してもよい。具体的には、第1の部分コンテンツが「ここで材料を20分、煮込みましょう」というように、待ち時間が20分発生することを示していた場合、音声応答システム1は、「煮込み終了です」(第2文字列の一例)という歌唱音声を、第1の部分コンテンツを出力してから20分後に出力してもよい。あるいは、第1の部分コンテンツが「ここで材料を20分、煮込みましょう」である例において、待ち時間の半分(10分)経過したときに、「煮込み終了まであと10分です」などとラップ風に歌唱してもよい。 In addition, the voice response system 1 outputs the singing voice synthesized by using the second character string corresponding to the matter indicated by the first character string included in the first partial content, and the singing voice of the first partial content. Thereafter, the data may be output at a timing corresponding to the time length corresponding to the item indicated by the first character string. Specifically, when the first partial content indicates that the waiting time will occur for 20 minutes, such as “Let's boil the ingredients for 20 minutes”, the voice response system 1 The singing voice “(an example of the second character string)” may be output 20 minutes after the first partial content is output. Or, in the example where the first partial content is “Let's boil the ingredients for 20 minutes here”, when half of the waiting time (10 minutes) has passed, it wraps with “10 minutes until the end of boiling” You may sing in the wind.
5-7.動作例7
 図21は、音声応答システム1の動作例7を示す図である。ユーザは、「工場における工程の手順書を読み上げてくれる?」という入力音声により、「手順書」のコンテンツの提供を要求する。音声応答システム1は、「手順書」というコンテンツが、ユーザの記憶を確認するためのものである点を踏まえ、コンテンツを部分コンテンツに分解し、ユーザの反応に応じて次の処理を決定する態様で再生することを決定する。
5-7. Operation example 7
FIG. 21 is a diagram illustrating an operation example 7 of the voice response system 1. The user requests the provision of the contents of the “procedure manual” by an input voice “Will you read the process manual of the process in the factory?”. The voice response system 1 is based on the fact that the content called “procedure manual” is for confirming the user's memory, and decomposes the content into partial content and determines the next process according to the user's reaction Decide to play with.
 例えば、音声応答システム1は、手順書をランダムな位置で区切り、複数の部分コンテンツに分解する。音声応答システム1は、一の部分コンテンツの歌唱を出力すると、ユーザの反応を待つ。例えば「スイッチAを押した後、メータBの値が10以下となったところでスイッチBを押す」という手順のコンテンツにつき、音声応答システム1が「スイッチAを押した後」という部分を歌唱し、ユーザの反応を待つ。ユーザが何か音声を発すると、音声応答システム1は、次の部分コンテンツの歌唱を出力する。あるいはこのとき、ユーザが次の部分コンテンツを正しく言えたか否かに応じて、次の部分コンテンツの歌唱のスピードを変更してもよい。具体的には、ユーザが次の部分コンテンツを正しく言えた場合、音声応答システム1は、次の部分コンテンツの歌唱のスピードを上げる。あるいは、ユーザが次の部分コンテンツを正しく言えなかった場合、音声応答システム1は、次の部分コンテンツの歌唱のスピードを下げる。 For example, the voice response system 1 divides the procedure manual at random positions and breaks it down into a plurality of partial contents. When the voice response system 1 outputs a song of one partial content, the voice response system 1 waits for a user's reaction. For example, the voice response system 1 sings the part “after pressing switch A” for the content of the procedure “pressing switch B when the value of meter B becomes 10 or less after pressing switch A” Wait for user response. When the user utters some voice, the voice response system 1 outputs the next partial content song. Alternatively, at this time, the singing speed of the next partial content may be changed depending on whether or not the user can correctly say the next partial content. Specifically, when the user can correctly say the next partial content, the voice response system 1 increases the speed of singing the next partial content. Alternatively, when the user cannot correctly say the next partial content, the voice response system 1 reduces the speed of singing the next partial content.
5-8.動作例8
 図22は、音声応答システム1の動作例8を示す図である。動作例8は、高齢者の認知症対策の動作例である。ユーザが高齢者であることはあらかじめユーザ登録等により設定されている。音声応答システム1は、例えばユーザの指示に応じて既存の歌を歌い始める。音声応答システム1は、ランダムな位置、又は所定の位置(例えばサビの手前)において歌唱を一時停止する。その際、「うーん分からない」、「忘れちゃった」等のメッセージを発し、あたかも歌詞を忘れたかのように振る舞う。音声応答システム1は、この状態でユーザの応答を待つ。ユーザが何か音声を発すると、音声応答システム1は、ユーザが発した言葉(の一部)を正解の歌詞として、その言葉の続きから歌唱を出力する。なお、ユーザが何か言葉を発した場合、音声応答システム1は「ありがとう」等の応答を出力してもよい。ユーザの応答待ちの状態で所定時間が経過したときは、音声応答システム1は、「思い出した」等の話声を出力し、一時停止した部分の続きから歌唱を再開してもよい。
5-8. Operation example 8
FIG. 22 is a diagram illustrating an operation example 8 of the voice response system 1. The operation example 8 is an operation example of measures for dementia of elderly people. The fact that the user is an elderly person is set in advance by user registration or the like. The voice response system 1 starts singing an existing song in accordance with, for example, a user instruction. The voice response system 1 pauses singing at a random position or a predetermined position (for example, before rust). At that time, a message such as “I don't know” or “I forgot” is issued, and it behaves as if I forgot the lyrics. The voice response system 1 waits for a user's response in this state. When the user utters some voice, the voice response system 1 outputs a singing from the continuation of the word, with the part of the word uttered by the user as the correct lyrics. When the user utters something, the voice response system 1 may output a response such as “thank you”. When a predetermined time has elapsed while waiting for a response from the user, the voice response system 1 may output a speech such as “remembered” and resume singing from the continuation of the paused portion.
5-9.動作例9
 図23は、音声応答システム1の動作例9を示す図である。ユーザは「何か楽しい曲歌って」という入力音声により、歌唱合成を要求する。音声応答システム1は、この入力音声に従って歌唱合成を行う。歌唱合成の際に用いる素片データベースは、例えばユーザ登録時に選択されたキャラクタに応じて選択される(例えば、男性キャラクタが選択された場合、男性歌手による素片データベースが用いられる)。ユーザは、歌の途中で「女性の声に変えて」等、素片データベースの変更を指示する入力音声を発する。音声応答システム1は、ユーザの入力音声に応じて、歌唱合成に用いる素片データベースを切り替える。素片データベースの切り替えは、音声応答システム1が歌唱音声を出力しているときに行われてもよいし、動作例7~8のように音声応答システム1がユーザの応答待ちの状態のときに行われてもよい。
5-9. Example 9
FIG. 23 is a diagram illustrating an operation example 9 of the voice response system 1. The user requests singing synthesis with an input voice of "Sing something fun". The voice response system 1 performs singing synthesis according to the input voice. The segment database used at the time of singing synthesis is selected according to the character selected at the time of user registration, for example (for example, when a male character is selected, a segment database by a male singer is used). The user utters an input voice instructing to change the segment database, such as “change to a female voice” during the song. The voice response system 1 switches the segment database used for singing synthesis according to a user's input voice. The switching of the segment database may be performed when the voice response system 1 is outputting a singing voice, or when the voice response system 1 is in a state of waiting for a response from the user as in the operation examples 7 to 8. It may be done.
 音声応答システム1は、単一の歌手(又は話者)により、それぞれ異なる歌い方又は声色で発音された音素を記録した複数の素片データベースを有してもよい。音声応答システム1は、ある音素について、複数の素片データベースから抽出した複数の素片を、ある比率(利用比率)で組み合わせて、すなわち加算して用いてもよい。音声応答システム1は、この利用比率を、ユーザの反応に応じて決めてもよい。具体的には、ある歌手について、通常の声と甘い声とで2つの素片データベースが記録されているときに、ユーザが「もっと甘い声で」という入力音声を発すると甘い声の素片データベースの利用比率を高め、「もっともっと甘い声で」という入力音声を発すると甘い声の素片データベースの利用比率をさらい高める。 The voice response system 1 may have a plurality of segment databases that record phonemes that are pronounced by a single singer (or speaker) with different singing styles or voice colors. The voice response system 1 may use a plurality of segments extracted from a plurality of segment databases in combination with a certain ratio (usage ratio), that is, add a certain phoneme. The voice response system 1 may determine the usage ratio according to the user's reaction. Specifically, when two segment databases are recorded for a singer with a normal voice and a sweet voice, if the user utters an input voice of “a sweeter voice”, the sweet voice segment database If you increase the usage rate of the voice and utter the input voice "with a much sweeter voice", the usage rate of the sweet voice segment database will be increased.
6.変形例
 本発明は上述の実施形態に限定されるものではなく、種々の変形実施が可能である。以下、変形例をいくつか説明する。以下の変形例のうち2つ以上のものが組み合わせて用いられてもよい。
6). Modifications The present invention is not limited to the above-described embodiments, and various modifications can be made. Hereinafter, some modifications will be described. Two or more of the following modifications may be used in combination.
 本稿において歌唱音声とは、少なくともその一部に歌唱を含む音声をいい、歌唱を含まない伴奏のみの部分、又は話声のみの部分を含んでいてもよい。例えば、コンテンツを複数の部分コンテンツに分解する例において、少なくとも1つの部分コンテンツは、歌唱を含んでいなくてもよい。また、歌唱は、ラップ、又は詩の朗読を含んでもよい。 In this paper, the singing voice refers to a voice that includes at least a part of the singing voice, and may include only an accompaniment that does not include a singing, or a part that includes only a voice. For example, in an example in which content is decomposed into a plurality of partial contents, at least one partial content may not include a song. Singing may also include raps or poetry readings.
 実施形態においては、学習機能51、歌唱合成機能52、及び応答機能53が相互に関連している例を説明したが、これらの機能は、それぞれ単独で提供されてもよい。例えば、学習機能51により得られた分類テーブルが、例えば楽曲を配信する楽曲配信システムにおいてユーザの嗜好を知るために用いられてもよい。あるいは、歌唱合成機能52は、ユーザが手入力した分類テーブルを用いて歌唱合成を行ってもよい。また、音声応答システム1の機能要素の少なくとも一部は省略されてもよい。例えば、音声応答システム1は、感情推定部512を有していなくてもよい。 In the embodiment, the example in which the learning function 51, the song synthesis function 52, and the response function 53 are related to each other has been described, but these functions may be provided independently. For example, the classification table obtained by the learning function 51 may be used to know the user's preference in a music distribution system that distributes music, for example. Alternatively, the song synthesis function 52 may perform song synthesis using a classification table manually input by the user. Further, at least some of the functional elements of the voice response system 1 may be omitted. For example, the voice response system 1 may not have the emotion estimation unit 512.
 入出力装置10、応答エンジン20、及び歌唱合成エンジン30に対する機能の割り当ては、例えば、音声分析部511及び感情推定部512が入出力装置に実装されてもよい。また、入出力装置10、応答エンジン20、及び歌唱合成エンジン30の相対的な配置は、例えば、歌唱合成エンジン30は入出力装置10と応答エンジン20との間に配置され、応答エンジン20から出力される応答のうち歌唱合成が必要と判断される応答について、歌唱合成を行ってもよい。また、音声応答システム1において用いられるコンテンツは、音声応答システム1において用いられるコンテンツは、入出力装置10又は入出力装置10と通信可能な装置等の、ローカルな装置に記憶されていてもよい。 For the assignment of functions to the input / output device 10, the response engine 20, and the song synthesis engine 30, for example, the voice analysis unit 511 and the emotion estimation unit 512 may be implemented in the input / output device. The relative arrangement of the input / output device 10, the response engine 20, and the song synthesis engine 30 is, for example, that the song synthesis engine 30 is arranged between the input / output device 10 and the response engine 20 and is output from the response engine 20. Of the responses to be performed, singing synthesis may be performed for responses determined to require singing synthesis. The content used in the voice response system 1 may be stored in a local device such as the input / output device 10 or a device capable of communicating with the input / output device 10.
 入出力装置10、応答エンジン20、及び歌唱合成エンジン30のハードウェア構成は,例えば、スマートフォン又はタブレット端末であってもよい。音声応答システム1に対するユーザの入力は音声を介するものに限定されず、タッチスクリーン、キーボード、又はポインティングデバイスを介して入力されるものであってもよい。また、入出力装置10は、人感センサーを有してもよい。音声応答システム1は、この人感センサーを用いて、ユーザが近くにいるかいないかに応じて、動作を制御してもよい。例えば、ユーザが入出力装置10の近くにいないと判断される場合、音声応答システム1は、音声を出力しない(対話を返さない)という動作をしてもよい。ただし、音声応答システム1が出力する音声の内容によっては、ユーザが入出力装置10の近くにいるいないにかかわらず、音声応答システム1はその音声を出力してもよい。例えば、動作例6の後半で説明したような、残りの待ち時間を案内する音声については、音声応答システム1は、ユーザが入出力装置10の近くにいるいないにかかわらず出力してもよい。なお、ユーザが入出力装置10の近くにいるかいないかの検出については、カメラや温度センサーなど、人感センサー以外のセンサーを用いたり、複数のセンサーを併用したりしてもよい。 The hardware configuration of the input / output device 10, the response engine 20, and the song synthesis engine 30 may be, for example, a smartphone or a tablet terminal. The user input to the voice response system 1 is not limited to voice input, and may be input via a touch screen, a keyboard, or a pointing device. The input / output device 10 may have a human sensor. The voice response system 1 may control the operation using the human sensor depending on whether or not the user is nearby. For example, when it is determined that the user is not near the input / output device 10, the voice response system 1 may perform an operation of not outputting a voice (not returning a dialogue). However, depending on the content of the voice output by the voice response system 1, the voice response system 1 may output the voice regardless of whether the user is near the input / output device 10. For example, as described in the second half of the operation example 6, the voice response system 1 may output the voice that guides the remaining waiting time regardless of whether the user is near the input / output device 10 or not. For detecting whether the user is near the input / output device 10, a sensor other than a human sensor such as a camera or a temperature sensor may be used, or a plurality of sensors may be used in combination.
 実施形態において例示したフローチャート及びシーケンスチャートは一例である。実施形態で例示したフローチャート又はシーケンスチャートにおいて、処理の順序が入れ替えられたり、一部の処理が省略されたり、新たな処理が追加されたりしてもよい。 The flowcharts and sequence charts exemplified in the embodiments are examples. In the flowchart or sequence chart illustrated in the embodiment, the processing order may be changed, some of the processing may be omitted, or new processing may be added.
 入出力装置10、応答エンジン20、及び歌唱合成エンジン30において実行されるプログラムは、CD-ROM又は半導体メモリー等の記録媒体に記憶された状態で提供されてもよいし、インターネット等のネットワークを介したダウンロードにより提供されてもよい。 The programs executed in the input / output device 10, the response engine 20, and the song synthesis engine 30 may be provided in a state stored in a recording medium such as a CD-ROM or a semiconductor memory, or via a network such as the Internet. May be provided by download.
 本出願は、2017年6月14日に出願された日本特許出願(特願2017-116831)に基づくものであり、ここに参照として取り込まれる。 This application is based on a Japanese patent application (Japanese Patent Application No. 2017-116831) filed on June 14, 2017, and is incorporated herein by reference.
 本発明によれば、ユーザとのインタラクションに応じて歌唱音声を出力することができるため有用である。 According to the present invention, a singing voice can be output according to the interaction with the user, which is useful.
1…音声応答システム、10…入出力装置、20…応答エンジン、30…歌唱合成エンジン、51…学習機能、52…歌唱合成機能、53…応答機能、60…コンテンツ提供部、101…マイクロフォン、102…入力信号処理部、103…出力信号処理部、104…スピーカ、105…CPU、106…センサー、107…モータ、108…ネットワークIF、201…CPU、202…メモリー、203…ストレージ、204…通信IF、301…CPU、302…メモリー、303…ストレージ、304…通信IF、510…処理部、511…音声分析部、512…感情推定部、513…楽曲解析部、514…歌詞抽出部、515…嗜好分析部、516…記憶部、521…検知部、522…歌唱生成部、523…伴奏生成部、524…合成部、5221…メロディ生成部、5222…歌詞生成部、531…コンテンツ分解部、532…コンテンツ修正部 DESCRIPTION OF SYMBOLS 1 ... Voice response system, 10 ... Input / output device, 20 ... Response engine, 30 ... Singing synthesis engine, 51 ... Learning function, 52 ... Singing synthesis function, 53 ... Response function, 60 ... Content provision part, 101 ... Microphone, 102 DESCRIPTION OF SYMBOLS ... Input signal processing part, 103 ... Output signal processing part, 104 ... Speaker, 105 ... CPU, 106 ... Sensor, 107 ... Motor, 108 ... Network IF, 201 ... CPU, 202 ... Memory, 203 ... Storage, 204 ... Communication IF , 301 ... CPU, 302 ... memory, 303 ... storage, 304 ... communication IF, 510 ... processing unit, 511 ... voice analysis unit, 512 ... emotion estimation unit, 513 ... music analysis unit, 514 ... lyric extraction unit, 515 ... preference Analysis unit 516 ... Storage unit 521 ... Detection unit 522 ... Singing generation unit 523 ... Accompaniment generation unit 524 ... Generating unit, 5221 ... melody generation unit, 5222 ... lyrics generation unit, 531 ... content decomposition unit, 532 ... content modification unit

Claims (22)

  1.  コンテンツを複数の部分コンテンツに分解するステップと、
     前記複数の部分コンテンツから第1の部分コンテンツを特定するステップと、
     前記第1の部分コンテンツに含まれる文字列を用いて第1の歌唱音声を合成するステップと、
     前記第1の歌唱音声を出力するステップと、
     前記第1の歌唱音声に対して、ユーザの反応を受け付けるステップと、
     前記ユーザの反応に対して、前記第1の部分コンテンツに関連する第2の部分コンテンツを特定するステップと、
     前記第2の部分コンテンツに含まれる文字列を用いて第2の歌唱音声を合成するステップと、
     前記第2の歌唱音声を出力するステップと、
     を有する歌唱音声の出力方法。
    Decomposing the content into a plurality of partial contents;
    Identifying first partial content from the plurality of partial content;
    Synthesizing a first singing voice using a character string included in the first partial content;
    Outputting the first singing voice;
    Accepting a user's reaction to the first singing voice;
    Identifying a second partial content related to the first partial content in response to the user response;
    Synthesizing a second singing voice using a character string included in the second partial content;
    Outputting the second singing voice;
    A method for outputting singing voices.
  2.  前記ユーザの反応に対して、前記第2の部分コンテンツに含まれる文字列を用いた歌唱合成に用いられる要素を決定するステップ
     を有する請求項1に記載の歌唱音声の出力方法。
    The method for outputting a singing voice according to claim 1, further comprising: determining an element used for singing synthesis using a character string included in the second partial content in response to the user's reaction.
  3.  前記要素は、前記歌唱合成のパラメータ、メロディ、若しくはテンポ、又は前記歌唱音声における伴奏のアレンジを含む
     請求項2に記載の歌唱音声の出力方法。
    The singing voice output method according to claim 2, wherein the element includes a parameter of the singing synthesis, a melody, or a tempo, or an arrangement of accompaniment in the singing voice.
  4.  前記第1の歌唱音声及び前記第2の歌唱音声の合成は、複数のデータベースの中から選択された少なくとも1つのデータベースに記録された素片を用いて行われ、
     前記ユーザの反応に対して、前記第2の部分コンテンツに含まれる文字列を用いた歌唱合成の際に用いられるデータベースを選択するステップ
     を有する請求項1乃至3のいずれか一項に記載の歌唱音声の出力方法。
    The synthesis of the first singing voice and the second singing voice is performed using segments recorded in at least one database selected from a plurality of databases,
    The singing according to any one of claims 1 to 3, further comprising a step of selecting a database used for singing synthesis using a character string included in the second partial content in response to the user's reaction. Audio output method.
  5.  前記第1の歌唱音声及び前記第2の歌唱音声の合成は、複数のデータベースの中から選択された複数のデータベースに記録された素片を用いて行われ、
     前記データベースを選択するステップにおいて、複数のデータベースが選択され、
     前記複数のデータベースの利用比率を、前記ユーザの反応に応じて決定するステップを有する
     請求項4に記載の歌唱音声の出力方法。
    The synthesis of the first singing voice and the second singing voice is performed using segments recorded in a plurality of databases selected from a plurality of databases,
    In the step of selecting the database, a plurality of databases are selected,
    The method for outputting a singing voice according to claim 4, further comprising: determining a usage ratio of the plurality of databases according to a reaction of the user.
  6.  前記第1の部分コンテンツに含まれる文字列の一部を他の文字列に置換するステップを有し、
     前記第1の歌唱音声を合成するステップにおいて、一部が前記他の文字列に置換された前記第1の部分コンテンツに含まれる文字列を用いて前記第1の歌唱音声を合成する
     請求項1乃至5のいずれか一項に記載の歌唱音声の出力方法。
    Replacing a part of the character string included in the first partial content with another character string;
    The step of synthesizing the first singing voice synthesizes the first singing voice by using a character string included in the first partial content partly replaced with the other character string. The method for outputting the singing voice according to any one of claims 1 to 5.
  7.  前記他の文字列と前記置換の対象となる文字列とは、音節数又はモーラ数が同じである
     請求項6に記載の歌唱音声の出力方法。
    The singing voice output method according to claim 6, wherein the other character string and the character string to be replaced have the same number of syllables or mora.
  8.  前記ユーザの反応に対して、前記第2の部分コンテンツの一部を他の文字列に置換するステップを有し、
     前記第2の歌唱音声を合成するステップにおいて、一部が前記他の文字列に置換された前記第2の部分コンテンツに含まれる文字列を用いて前記第2の歌唱音声を合成する
     請求項1乃至7のいずれか一項に記載の歌唱音声の出力方法。
    In response to the user's reaction, replacing a part of the second partial content with another character string;
    The step of synthesizing the second singing voice synthesizes the second singing voice by using a character string included in the second partial content partly replaced with the other character string. The output method of the singing voice as described in any one of thru | or 7.
  9.  前記第1の部分コンテンツに含まれる文字列が示す事項に応じた時間長となるよう第3の歌唱音声を合成するステップと、
     前記第1の歌唱音声と前記第2の歌唱音声との間に前記第3の歌唱音声を出力するステップと、
     を有する請求項1乃至8のいずれか一項に記載の歌唱音声の出力方法。
    Synthesizing a third singing voice so as to have a time length according to a matter indicated by a character string included in the first partial content;
    Outputting the third singing voice between the first singing voice and the second singing voice;
    The method for outputting a singing voice according to any one of claims 1 to 8.
  10.  前記第1の部分コンテンツに含まれる第1文字列が示す事項に応じた第2文字列を用いて第4の歌唱音声を合成するステップと、
     前記第1の歌唱音声の出力後、前記第1文字列が示す事項に応じた時間長に応じたタイミングで前記第4の歌唱音声を出力するステップと、
     を有する請求項1乃至9のいずれか一項に記載の歌唱音声の出力方法。
    Synthesizing a fourth singing voice using a second character string corresponding to a matter indicated by a first character string included in the first partial content;
    After outputting the first singing voice, outputting the fourth singing voice at a timing according to a time length corresponding to a matter indicated by the first character string;
    The method for outputting a singing voice according to any one of claims 1 to 9.
  11.  前記コンテンツは文字列を含む請求項1乃至10のいずれか一項に記載の歌唱音声の出力方法。 The singing voice output method according to any one of claims 1 to 10, wherein the content includes a character string.
  12.  コンテンツを複数の部分コンテンツに分解する分解部と、
     前記複数の部分コンテンツから第1の部分コンテンツを特定する特定部と、
     前記第1の部分コンテンツに含まれる文字列を用いて第1の歌唱音声を合成する合成部と、
     前記第1の歌唱音声を出力する出力部と、
     前記第1の歌唱音声に対して、ユーザの反応を受け付ける受け付け部と、
     を有し、
     前記特定部は、前記ユーザの反応に対して、前記第1の部分コンテンツに関連する第2の部分コンテンツを特定し、
     前記合成部は、前記第2の部分コンテンツに含まれる文字列を用いて第2の歌唱音声を合成し、
     前記出力部は、前記第2の歌唱音声を出力する
     音声応答システム。
    A decomposition unit that decomposes content into a plurality of partial contents;
    A specifying unit for specifying the first partial content from the plurality of partial contents;
    A synthesis unit that synthesizes a first singing voice using a character string included in the first partial content;
    An output unit for outputting the first singing voice;
    An accepting unit that accepts a user's reaction to the first singing voice;
    Have
    The specifying unit specifies a second partial content related to the first partial content in response to the user's reaction,
    The synthesis unit synthesizes a second singing voice using a character string included in the second partial content,
    The output unit is a voice response system that outputs the second singing voice.
  13.  前記ユーザの反応に対して、前記第2の部分コンテンツに含まれる文字列を用いた歌唱合成に用いられる要素を決定する決定部
     を有する請求項12に記載の音声応答システム。
    The voice response system according to claim 12, further comprising: a determination unit that determines an element used for singing synthesis using a character string included in the second partial content in response to the user's reaction.
  14.  前記要素は、前記歌唱合成のパラメータ、メロディ、若しくはテンポ、又は前記歌唱音声における伴奏のアレンジを含む
     請求項13に記載の音声応答システム。
    The voice response system according to claim 13, wherein the element includes a parameter of the singing synthesis, a melody, or a tempo, or an arrangement of accompaniment in the singing voice.
  15.  前記第1の歌唱音声及び前記第2の歌唱音声の合成は、複数のデータベースの中から選択された少なくとも1つのデータベースに記録された素片を用いて行われ、
     前記ユーザの反応に対して、前記第2の部分コンテンツに含まれる文字列を用いた歌唱合成の際に用いられるデータベースを選択する選択部
     を有する請求項12乃至14のいずれか一項に記載の音声応答システム。
    The synthesis of the first singing voice and the second singing voice is performed using segments recorded in at least one database selected from a plurality of databases,
    The selection part which selects the database used in the case of the singing synthesis | combination using the character string contained in the said 2nd partial content with respect to the said user's reaction. Voice response system.
  16.  前記第1の歌唱音声及び前記第2の歌唱音声の合成は、複数のデータベースの中から選択された複数のデータベースに記録された素片を用いて行われ、
     前記選択部が、複数のデータベースを選択し、
     前記決定部が、前記複数のデータベースの利用比率を、前記ユーザの反応に応じて決定する
     請求項15に記載の音声応答システム。
    The synthesis of the first singing voice and the second singing voice is performed using segments recorded in a plurality of databases selected from a plurality of databases,
    The selection unit selects a plurality of databases;
    The voice response system according to claim 15, wherein the determination unit determines a usage ratio of the plurality of databases according to a reaction of the user.
  17.  前記第1の部分コンテンツに含まれる文字列の一部を他の文字列に置換する置換部を有し、
     前記合成部は、一部が前記他の文字列に置換された前記第1の部分コンテンツに含まれる文字列を用いて前記第1の歌唱音声を合成する
     請求項12乃至16のいずれか一項に記載の音声応答システム。
    A replacement unit that replaces a part of the character string included in the first partial content with another character string;
    The said synthetic | combination part synthesize | combines the said 1st song audio | voice using the character string contained in the said 1st partial content partially substituted by the said other character string. The voice response system described in 1.
  18.  前記他の文字列と前記置換の対象となる文字列とは、音節数又はモーラ数が同じである
     請求項17に記載の音声応答システム。
    The voice response system according to claim 17, wherein the other character string and the character string to be replaced have the same number of syllables or mora.
  19.  前記ユーザの反応に対して、前記第2の部分コンテンツの一部を他の文字列に置換するステップを有し、
     前記合成部は、一部が前記他の文字列に置換された前記第2の部分コンテンツに含まれる文字列を用いて前記第2の歌唱音声を合成する
     請求項12乃至18のいずれか一項に記載の音声応答システム。
    In response to the user's reaction, replacing a part of the second partial content with another character string;
    The said synthetic | combination part synthesize | combines the said 2nd singing voice using the character string contained in the said 2nd partial content partially substituted by the said other character string. The voice response system described in 1.
  20.  前記合成部は、前記第1の部分コンテンツに含まれる文字列が示す事項に応じた時間長となるよう第3の歌唱音声を合成し、
     前記第1の歌唱音声と前記第2の歌唱音声との間に前記第3の歌唱音声を出力する
     請求項12乃至19のいずれか一項に記載の音声応答システム。
    The synthesizing unit synthesizes a third singing voice so as to have a time length according to a matter indicated by a character string included in the first partial content,
    The voice response system according to any one of claims 12 to 19, wherein the third singing voice is output between the first singing voice and the second singing voice.
  21.  前記合成部は、前記第1の部分コンテンツに含まれる第1文字列が示す事項に応じた第2文字列を用いて第4の歌唱音声を合成し、
     前記出力部は、前記第1の歌唱音声の出力後、前記第1文字列が示す事項に応じた時間長に応じたタイミングで前記第4の歌唱音声を出力する
     請求項12乃至20のいずれか一項に記載の音声応答システム。
    The synthesizing unit synthesizes a fourth singing voice using a second character string corresponding to a matter indicated by the first character string included in the first partial content,
    The output unit outputs the fourth singing voice at a timing according to a time length corresponding to a matter indicated by the first character string after the output of the first singing voice. The voice response system according to one item.
  22.  前記コンテンツは文字列を含む請求項12乃至21のいずれか一項に記載の音声応答システム。 The voice response system according to any one of claims 12 to 21, wherein the content includes a character string.
PCT/JP2018/022816 2017-06-14 2018-06-14 Method for outputting singing voice, and voice response system WO2018230670A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2017116831A JP6977323B2 (en) 2017-06-14 2017-06-14 Singing voice output method, voice response system, and program
JP2017-116831 2017-06-14

Publications (1)

Publication Number Publication Date
WO2018230670A1 true WO2018230670A1 (en) 2018-12-20

Family

ID=64660282

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2018/022816 WO2018230670A1 (en) 2017-06-14 2018-06-14 Method for outputting singing voice, and voice response system

Country Status (2)

Country Link
JP (2) JP6977323B2 (en)
WO (1) WO2018230670A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113488007A (en) * 2021-07-07 2021-10-08 北京灵动音科技有限公司 Information processing method, information processing device, electronic equipment and storage medium
TWI749447B (en) * 2020-01-16 2021-12-11 國立中正大學 Synchronous speech generating device and its generating method
WO2022113914A1 (en) * 2020-11-25 2022-06-02 ヤマハ株式会社 Acoustic processing method, acoustic processing system, electronic musical instrument, and program

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6594577B1 (en) * 2019-03-27 2019-10-23 株式会社博報堂Dyホールディングス Evaluation system, evaluation method, and computer program.
JP2020177534A (en) * 2019-04-19 2020-10-29 京セラドキュメントソリューションズ株式会社 Transmission type wearable terminal

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH1097529A (en) * 1996-05-29 1998-04-14 Yamaha Corp Versification supporting device, method therefor and storage medium
JPH11219195A (en) * 1998-02-04 1999-08-10 Atr Chino Eizo Tsushin Kenkyusho:Kk Interactive mode poem reading aloud system
JP2003131548A (en) * 2001-10-29 2003-05-09 Mk Denshi Kk Language learning device
JP2006227589A (en) * 2005-01-20 2006-08-31 Matsushita Electric Ind Co Ltd Device and method for speech synthesis
JP2015022293A (en) * 2013-07-24 2015-02-02 カシオ計算機株式会社 Voice output controller, electronic device, and voice output control program

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3307283B2 (en) * 1997-06-24 2002-07-24 ヤマハ株式会社 Singing sound synthesizer
JPH11175082A (en) * 1997-12-10 1999-07-02 Toshiba Corp Voice interaction device and voice synthesizing method for voice interaction
JP2001043126A (en) 1999-07-27 2001-02-16 Tadamitsu Ryu Robot system
JP2002221978A (en) 2001-01-26 2002-08-09 Yamaha Corp Vocal data forming device, vocal data forming method and singing tone synthesizer
JP2002258872A (en) 2001-02-27 2002-09-11 Casio Comput Co Ltd Voice information service system and voice information service method
KR20090046003A (en) * 2007-11-05 2009-05-11 주식회사 마이크로로봇 Robot toy apparatus
WO2013190963A1 (en) 2012-06-18 2013-12-27 エイディシーテクノロジー株式会社 Voice response device
JP6166889B2 (en) 2012-11-15 2017-07-19 株式会社Nttドコモ Dialog support apparatus, dialog system, dialog support method and program
JP6596843B2 (en) 2015-03-02 2019-10-30 ヤマハ株式会社 Music generation apparatus and music generation method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH1097529A (en) * 1996-05-29 1998-04-14 Yamaha Corp Versification supporting device, method therefor and storage medium
JPH11219195A (en) * 1998-02-04 1999-08-10 Atr Chino Eizo Tsushin Kenkyusho:Kk Interactive mode poem reading aloud system
JP2003131548A (en) * 2001-10-29 2003-05-09 Mk Denshi Kk Language learning device
JP2006227589A (en) * 2005-01-20 2006-08-31 Matsushita Electric Ind Co Ltd Device and method for speech synthesis
JP2015022293A (en) * 2013-07-24 2015-02-02 カシオ計算機株式会社 Voice output controller, electronic device, and voice output control program

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI749447B (en) * 2020-01-16 2021-12-11 國立中正大學 Synchronous speech generating device and its generating method
WO2022113914A1 (en) * 2020-11-25 2022-06-02 ヤマハ株式会社 Acoustic processing method, acoustic processing system, electronic musical instrument, and program
CN113488007A (en) * 2021-07-07 2021-10-08 北京灵动音科技有限公司 Information processing method, information processing device, electronic equipment and storage medium
CN113488007B (en) * 2021-07-07 2024-06-11 北京灵动音科技有限公司 Information processing method, information processing device, electronic equipment and storage medium

Also Published As

Publication number Publication date
JP2022017561A (en) 2022-01-25
JP2019003000A (en) 2019-01-10
JP7424359B2 (en) 2024-01-30
JP6977323B2 (en) 2021-12-08

Similar Documents

Publication Publication Date Title
JP7424359B2 (en) Information processing device, singing voice output method, and program
JP7363954B2 (en) Singing synthesis system and singing synthesis method
US10629179B2 (en) Electronic musical instrument, electronic musical instrument control method, and storage medium
US11468870B2 (en) Electronic musical instrument, electronic musical instrument control method, and storage medium
US11854518B2 (en) Electronic musical instrument, electronic musical instrument control method, and storage medium
TWI497484B (en) Performance evaluation device, karaoke device, server device, performance evaluation system, performance evaluation method and program
EP3675122B1 (en) Text-to-speech from media content item snippets
KR101274961B1 (en) music contents production system using client device.
EP3759706B1 (en) Method, computer program and system for combining audio signals
US6737572B1 (en) Voice controlled electronic musical instrument
JP2005342862A (en) Robot
JP5598516B2 (en) Voice synthesis system for karaoke and parameter extraction device
JP2007264569A (en) Retrieval device, control method, and program
Lesaffre et al. The MAMI Query-By-Voice Experiment: Collecting and annotating vocal queries for music information retrieval
JP4808641B2 (en) Caricature output device and karaoke device
JP2016071187A (en) Voice synthesis device and voice synthesis system
JP2022065554A (en) Method for synthesizing voice and program
Bresin et al. Rule-based emotional coloring of music performance
JP2017062313A (en) Karaoke device, karaoke system and program
JPH1165594A (en) Musical sound generating device and computer-readable record medium recorded with musical sound generating and processing program

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18817438

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18817438

Country of ref document: EP

Kind code of ref document: A1