WO2018230670A1

WO2018230670A1 - Method for outputting singing voice, and voice response system

Info

Publication number: WO2018230670A1
Application number: PCT/JP2018/022816
Authority: WO
Inventors: 大樹倉光; 頌子奈良; 強宮木; 浩雅椎原; 健一山内; 晋山中
Original assignee: ヤマハ株式会社
Priority date: 2017-06-14
Filing date: 2018-06-14
Publication date: 2018-12-20
Also published as: JP2022017561A; JP2019003000A; JP7424359B2; JP6977323B2

Abstract

This method for outputting a singing voice has: a step (S31) for specifying first partial content from among a plurality of partial content obtained by analyzing content; a step (S35) for outputting a singing voice synthesized using a character string included in the first partial content; a step (S36) for receiving a response from a user with respect to the singing voice; and a step (S35) for outputting, in accordance with the response, a singing voice synthesized using a character string included in second partial content continuing from the first partial content.

Description

Singing voice output method and voice response system

The present invention relates to a technique for responding to a user input using a voice including a song.

There is a technology to output music according to user instructions. Patent Document 1 is a technique for changing the atmosphere of music according to the user's situation and preferences. Patent Document 2 is a technique for making a unique music selection that does not get bored in a device that outputs musical sounds according to the state of a moving body.

Japanese Unexamined Patent Publication No. 2006-85045 Japanese Patent No. 4496993

Neither Patent Literature 1 nor 2 outputs a singing voice according to the interaction with the user.
On the other hand, this invention provides the technique which outputs a singing voice according to the interaction with a user.

The present invention includes a step of decomposing content into a plurality of partial contents, a step of identifying first partial contents from the plurality of partial contents, and a character string included in the first partial contents. A step of synthesizing a singing voice; a step of outputting the first singing voice; a step of accepting a user's reaction to the first singing voice; Identifying a second partial content related to the partial content, synthesizing a second singing voice using a character string included in the second partial content, and outputting the second singing voice And providing a method for outputting a singing voice. The content includes, for example, a character string.

The singing voice output method may include a step of determining an element used for singing synthesis using a character string included in the second partial content in response to the user's reaction.

The element may include a parameter of the song synthesis, a melody, or a tempo, or an arrangement of accompaniment in the song voice.

The synthesis of the first singing voice and the second singing voice is performed using segments recorded in at least one database selected from a plurality of databases. You may have the step which selects the database used in the case of singing composition using the character string contained in the said 2nd partial content with respect to the said user's reaction.

The synthesis of the first singing voice and the second singing voice is performed using segments recorded in a plurality of databases selected from a plurality of databases. The database may be selected, and the singing voice output method may include a step of determining a usage ratio of the plurality of databases according to the reaction of the user.

This singing voice output method has a step of replacing a part of a character string included in the first partial content with another character string, and in the step of synthesizing the first singing voice, The first singing voice may be synthesized using a character string included in the first partial content replaced with the other character string.

The other character string and the character string to be replaced may have the same syllable number or mora number.

This singing voice output method has a step of replacing a part of the second partial content with another character string in response to the user's reaction, and in the step of synthesizing the second singing voice, The second singing voice may be synthesized using a character string included in the second partial content partly replaced with the other character string.

The method for outputting the singing voice includes a step of synthesizing a third singing voice so as to have a time length corresponding to a matter indicated by a character string included in the first partial content, the first singing voice, and the first singing voice. You may have the step which outputs the said 3rd song voice between 2 song voices.

The method for outputting the singing voice includes a step of synthesizing a fourth singing voice using a second character string corresponding to a matter indicated by the first character string included in the first partial content, and the first singing voice. A step of outputting the fourth singing voice at a timing corresponding to a time length corresponding to a matter indicated by the first character string after outputting the voice.

In addition, the present invention uses a decomposition unit that decomposes content into a plurality of partial contents, a specifying unit that specifies first partial content from the plurality of partial contents, and a character string included in the first partial content. And a synthesis unit that synthesizes the first singing voice, an output unit that outputs the first singing voice, and a reception unit that receives a user's reaction to the first singing voice, The specifying unit specifies a second partial content related to the first partial content in response to the user's reaction, and the combining unit uses a character string included in the second partial content. The singing voice of 2 is synthesized, and the output unit provides an information processing system that outputs the second singing voice.

According to the present invention, singing voice can be output in accordance with the interaction with the user.

FIG. 1 is a diagram illustrating an outline of a voice response system 1 according to an embodiment. FIG. 2 is a diagram illustrating an outline of functions of the voice response system 1. FIG. 3 is a diagram illustrating a hardware configuration of the input / output device 10. FIG. 4 is a diagram illustrating a hardware configuration of the response engine 20 and the song synthesis engine 30. FIG. 5 is a diagram illustrating a functional configuration related to the learning function 51. FIG. 6 is a flowchart showing an outline of an operation related to the learning function 51. FIG. 7 is a sequence chart illustrating an operation related to the learning function 51. FIG. 8 is a diagram illustrating a classification table 5161. FIG. 9 is a diagram illustrating a functional configuration related to the song synthesis function 52. FIG. 10 is a flowchart showing an outline of the operation related to the song synthesis function 52. FIG. 11 is a sequence chart illustrating an operation related to the song synthesis function 52. FIG. 12 is a diagram illustrating a functional configuration related to the response function 53. FIG. 13 is a flowchart illustrating an operation related to the response function 53. FIG. 14 is a diagram showing an operation example 1 of the voice response system 1. FIG. 15 is a diagram illustrating an operation example 2 of the voice response system 1. FIG. 16 is a diagram showing an operation example 3 of the voice response system 1. FIG. 17 is a diagram showing an operation example 4 of the voice response system 1. FIG. 18 is a diagram illustrating an operation example 5 of the voice response system 1. FIG. 19 is a diagram illustrating an operation example 6 of the voice response system 1. FIG. 20 is a diagram illustrating an operation example 7 of the voice response system 1. FIG. 21 is a diagram showing an operation example 8 of the voice response system 1. FIG. 22 is a diagram showing an operation example 9 of the voice response system 1.

1. System Overview FIG. 1 is a diagram illustrating an overview of a voice response system 1 according to an embodiment. The voice response system 1 is a so-called AI (Artificial Intelligence) voice assistant that automatically outputs a voice response in response to an input (or instruction) by a user. Hereinafter, the voice input from the user to the voice response system 1 is referred to as “input voice”, and the voice output from the voice response system 1 in response to the input voice is referred to as “response voice”. Voice response includes singing. The voice response system 1 is an example of a song synthesis system. For example, when the user speaks “Sing something” to the voice response system 1, the voice response system 1 automatically synthesizes the song and outputs the synthesized song.

The voice response system 1 includes an input / output device 10, a response engine 20, and a song synthesis engine 30. The input / output device 10 is a device that provides a man-machine interface, and is a device that receives an input voice from a user and outputs a response voice in response to the input voice. The response engine 20 analyzes the input voice received by the input / output device 10 and generates a response voice. At least a part of the response voice includes singing voice. The singing voice synthesis engine 30 synthesizes the singing voice used for the response voice.

FIG. 2 is a diagram illustrating an outline of functions of the voice response system 1. The voice response system 1 has a learning function 51, a song synthesis function 52, and a response function 53. The response function 53 is a function of analyzing a user input voice and providing a response voice based on the analysis result, and is provided by the input / output device 10 and the response engine 20. The learning function 51 is a function for learning the user's preference from the user's input voice, and is provided by the singing synthesis engine 30. The singing voice synthesizing function 52 is a function for synthesizing the singing voice used for the response voice, and is provided by the singing voice synthesis engine 30. The learning function 51 learns the user's preference using the analysis result obtained by the response function 53. The singing voice synthesis function 52 synthesizes a singing voice based on learning performed by the learning function 51. The response function 53 makes a response using the singing voice synthesized by the singing voice synthesis function 52.

FIG. 3 is a diagram illustrating a hardware configuration of the input / output device 10. The input / output device 10 includes a microphone 101, an input signal processing unit 102, an output signal processing unit 103, a speaker 104, a CPU (Central Processing Unit) 105, a sensor 106, a motor 107, and a network IF 108. The microphone 101 converts the user's voice into an electric signal (input sound signal). The input signal processing unit 102 performs processing such as analog / digital conversion on the input sound signal, and outputs data indicating the input sound (hereinafter referred to as “input sound data”). The output signal processing unit 103 performs processing such as digital / analog conversion on data indicating response sound (hereinafter referred to as “response sound data”), and outputs an output sound signal. The speaker 104 converts the output sound signal into sound (outputs sound based on the output sound signal). The CPU 105 controls other elements of the input / output device 10 and reads and executes a program from a memory (not shown). The sensor 106 detects the position of the user (the direction of the user viewed from the input / output device 10), and is an infrared sensor or an ultrasonic sensor, for example. The motor 107 changes the direction of at least one of the microphone 101 and the speaker 104 so as to face the direction in which the user is present. The microphone 101 may be configured by a microphone array, and the CPU 105 may detect the direction in which the user is present based on the sound collected by the microphone array. The network IF 108 is an interface for performing communication via a network (for example, the Internet), and includes, for example, an antenna and a chip set for performing communication in accordance with a predetermined wireless communication standard (for example, WiFi (registered trademark)). .

FIG. 4 is a diagram illustrating a hardware configuration of the response engine 20 and the song synthesis engine 30. The response engine 20 includes a CPU 201, a memory 202, a storage 203, and a communication IF 204. The CPU 201 performs various calculations according to the program and controls other elements of the computer apparatus. The memory 202 is a main storage device that functions as a work area when the CPU 201 executes a program, and includes, for example, a RAM (Random Access Memory). The storage 203 is a nonvolatile auxiliary storage device that stores various programs and data, and includes, for example, an HDD (Hard Disk Drive) or an SSD (Solid State Drive). The communication IF 204 includes a connector and a chip set for performing communication according to a predetermined communication standard (for example, Ethernet). The storage 203 stores a program for causing the computer device to function as the response engine 20 in the voice response system 1 (hereinafter referred to as “response program”). The computer device functions as the response engine 20 by the CPU 201 executing the response program. The response engine 20 is, for example, a so-called AI.

The song synthesis engine 30 includes a CPU 301, a memory 302, a storage 303, and a communication IF 304. Details of each element are the same as those of the response engine 20. The storage 303 stores a program for causing the computer device to function as the song synthesis engine 30 in the voice response system 1 (hereinafter referred to as “song synthesis program”). When the CPU 301 executes the song synthesis program, the computer device functions as the song synthesis engine 30.

The response engine 20 and the song synthesis engine 30 are provided as cloud services on the Internet. Note that the response engine 20 and the song synthesis engine 30 may be services that do not depend on cloud computing.

2. Learning function 2-1. Configuration FIG. 5 is a diagram illustrating a functional configuration related to the learning function 51. As functional elements related to the learning function 51, the voice response system 1 includes a voice analysis unit 511, an emotion estimation unit 512, a music analysis unit 513, a lyrics extraction unit 514, a preference analysis unit 515, a storage unit 516, and a processing unit 510. . The input / output device 10 functions as a receiving unit that receives user input voice and an output unit that outputs response voice.

The voice analysis unit 511 analyzes the input voice. This analysis is a process of acquiring information used for generating a response voice from the input voice. Specifically, the input voice is converted into text (that is, converted into a character string). Processing for determining a request for the content, processing for specifying the content providing unit 60 that provides the content in response to a user request, processing for instructing the specified content providing unit 60, and processing for acquiring data from the content providing unit And a process of generating a response using the acquired data. In this example, the content providing unit 60 is an external system of the voice response system 1. The content providing unit 60 provides a service (for example, a music streaming service or a net radio) that outputs data for reproducing content such as music as sound (hereinafter referred to as “music data”). 1 external server.

The music analysis unit 513 analyzes the music data output from the content providing unit 60. The analysis of music data refers to a process of extracting music characteristics. The music features include at least one of tune, rhythm, chord progression, tempo, and arrangement. A known technique is used for feature extraction.

The lyrics extracting unit 514 extracts lyrics from the music data output from the content providing unit 60. In one example, the music data includes metadata in addition to sound data. The sound data is data indicating a signal waveform of music, and includes, for example, uncompressed data such as PCM (Pulse Code Modulation) data or compressed data such as MP3 data. The metadata is data including information related to the music, and includes, for example, music title, performer name, composer name, songwriter name, album title, genre and other music attributes, and lyrics information. . The lyrics extraction unit 514 extracts lyrics from metadata included in the music data. When the music data does not include metadata, the lyrics extraction unit 514 performs speech recognition processing on the sound data, and extracts lyrics from text obtained by the speech recognition.

The emotion estimation unit 512 estimates the user's emotion. The emotion estimation unit 512 estimates the user's emotion from the input voice. A known technique is used for emotion estimation. The emotion estimation unit 512 may estimate the user's emotion based on the relationship between the (average) pitch in the voice output by the voice response system 1 and the pitch of the user's response to the pitch. The emotion estimation unit 512 may estimate the user's emotion based on the input voice converted into text by the voice analysis unit 511 or the analyzed user request.

The preference analysis unit 515 indicates the user's preference using at least one of the reproduction history of the music that the user has instructed to reproduce, the analysis result, the lyrics, and the user's emotion when the reproduction of the music is instructed. Information (hereinafter referred to as “preference information”) is generated. The preference analysis unit 515 updates the classification table 5161 stored in the storage unit 516 using the generated preference information. The classification table 5161 is a table (or database) in which user preferences are recorded. For example, for each user and for each emotion, the characteristics of the music (for example, timbre, tune, rhythm, chord progression, and tempo), and the attributes of the music (Performer name, composer name, songwriter name, and genre) and lyrics are recorded. The memory | storage part 516 is an example of the read-out part which reads the parameter according to the user who input the trigger from the table which matched and recorded the parameter used for song synthesis | combination with the user. The parameters used for singing synthesis are data to be referred to at the time of singing synthesis, and the classification table 5161 includes timbre, tone, rhythm, chord progression, tempo, performer name, composer name, songwriter name, genre, And a concept including lyrics.

2-2. Operation FIG. 6 is a flowchart showing an outline of the operation of the voice response system 1 according to the learning function 51. In step S11, the voice response system 1 analyzes the input voice. In step S12, the voice response system 1 performs processing instructed by the input voice. In step S <b> 13, the voice response system 1 determines whether the input voice includes an item to be learned. When it is determined that the input voice includes an item to be learned (S13: YES), the voice response system 1 moves the process to step S14. When it is determined that the input voice does not include items to be learned (S13: NO), the voice response system 1 moves the process to step S18. In step S14, the voice response system 1 estimates the user's emotion. In step S15, the voice response system 1 analyzes the music for which playback has been instructed. In step S <b> 16, the voice response system 1 acquires the lyrics of the music that is instructed to be played. In step S17, the voice response system 1 updates the classification table using the information obtained in steps S14 to S16.

The processing after step S18 is not directly related to the learning function 51, that is, the update of the classification table, but includes the processing using the classification table. In step S18, the voice response system 1 generates a response voice for the input voice. At this time, the classification table is referred to as necessary. In step S19, the voice response system 1 outputs a response voice.

FIG. 7 is a sequence chart illustrating the operation of the voice response system 1 according to the learning function 51. The user performs user registration with the voice response system 1 when, for example, the voice response system 1 is subscribed or activated for the first time. User registration includes setting of a user name (or login ID) and a password. The input / output device 10 is activated at the start of the sequence in FIG. 7, and the login process of the user is completed. That is, in the voice response system 1, a user who uses the input / output device 10 is specified. The input / output device 10 is in a state of waiting for a user's voice input (speech). Note that the method by which the voice response system 1 identifies the user is not limited to the login process. For example, the voice response system 1 may specify the user based on the input voice.

In step S101, the input / output device 10 receives an input voice. The input / output device 10 converts the input voice into data and generates voice data. The sound data includes sound data indicating a signal waveform of the input sound and a header. The header includes information indicating the attribute of the input voice. The attributes of the input voice include, for example, an identifier for specifying the input / output device 10, a user identifier (for example, a user name or a login ID) of a user who has issued the voice, and a time stamp indicating the time at which the voice was emitted. Including. In step S <b> 102, the input / output device 10 outputs voice data indicating the input voice to the voice analysis unit 511.

In step S103, the voice analysis unit 511 analyzes the input voice using the voice data. In this analysis, the voice analysis unit 511 determines whether the input voice includes items to be learned. The item to be learned is a matter for specifying a song, specifically, a music playback instruction.

In step S104, the processing unit 510 performs processing instructed by the input voice. The processing performed by the processing unit 510 is, for example, streaming playback of music. In this case, the content providing unit 60 has a music database in which a plurality of music data is recorded. The processing unit 510 reads the music data of the instructed music from the music database. The processing unit 510 transmits the read music data to the input / output device 10 that is the transmission source of the input sound. In another example, the processing performed by the processing unit 510 is playback of a net radio. In this case, the content providing unit 60 performs streaming broadcasting of radio sound. The processing unit 510 transmits the streaming data received from the content providing unit 60 to the input / output device 10 that is the transmission source of the input audio.

If it is determined in step S103 that the input speech includes an item to be learned, the processing unit 510 further performs processing for updating the classification table (step S105). The processing for updating the classification table includes a request for emotion estimation to the emotion estimation unit 512 (step S1051), a request for music analysis to the music analysis unit 513 (step S1052), and a request for lyrics extraction to the lyrics extraction unit 514 ( Step S1053) is included.

When emotion estimation is requested, the emotion estimation unit 512 estimates the user's emotion (step S106), and outputs information indicating the estimated emotion (hereinafter referred to as “emotion information”) to the processing unit 510 that is the request source. (Step S107). The emotion estimation unit 512 estimates the user's emotion using the input voice. The emotion estimation unit 512 estimates an emotion based on, for example, input text that has been converted into text. In one example, if a keyword indicating emotion is defined in advance, and the input voice that has been converted into text includes this keyword, the emotion estimation unit 512 determines that the user is the emotion (for example, “kutsu”). If the keyword is included, it is determined that the user's emotion is “anger”). In another example, the emotion estimation unit 512 estimates an emotion based on the pitch, volume, speed, or temporal change of the input voice. In one example, when the average pitch of the input voice is lower than the threshold value, the emotion estimation unit 512 determines that the user's emotion is “sad”. In another example, the emotion estimation unit 512 may estimate the user's emotion based on the relationship between the (average) pitch in the voice output by the voice response system 1 and the pitch of the user's response to the pitch. . Specifically, when the pitch of the voice that the user responds is low even though the pitch of the voice that the voice response system 1 outputs is high, the emotion estimation unit 512 indicates that the user's emotion is “sad”. to decide. In yet another example, the emotion estimation unit 512 may estimate the user's emotion based on the relationship between the pitch of the ending in the voice and the pitch of the user's response thereto. Alternatively, the emotion estimation unit 512 may estimate the user's emotion in consideration of these multiple factors.

In another example, the emotion estimation unit 512 may estimate the user's emotion using an input other than voice. As the input other than the voice, for example, an image of a user's face taken by a camera, a user's body temperature detected by a temperature sensor, or a combination thereof is used. Specifically, the emotion estimation unit 512 determines whether the user's emotion is “fun”, “anger”, or “sad” from the facial expression of the user. The emotion estimation unit 512 may determine the user's emotion based on the change in facial expression in the user's facial video. Alternatively, the emotion estimation unit 512 may determine “anger” when the user's body temperature is high and “sad” when the user's body temperature is low.

When the music analysis is requested, the music analysis unit 513 analyzes the music reproduced in accordance with the user's instruction (step S108), and the information indicating the analysis result (hereinafter referred to as “music information”) is processed as a request source. It outputs to the part 510 (step S109).

When the lyrics extraction is requested, the lyrics extraction unit 514 acquires the lyrics of the music to be played according to the user's instruction (step S110), and obtains information indicating the acquired lyrics (hereinafter referred to as “lyric information”) as the request source. Is output to the processing unit 510 (step S111).

In step S112, the processing unit 510 outputs the set of emotion information, music information, and lyrics information acquired from the emotion estimation unit 512, the music analysis unit 513, and the lyrics extraction unit 514 to the preference analysis unit 515.

In step S113, the preference analysis unit 515 analyzes a plurality of sets of information to obtain information indicating the user's preference. For this analysis, the preference analysis unit 515 records a plurality of sets of these information over a past period (for example, a period from the start of system operation to the present time). In one example, the preference analysis unit 515 statistically processes music information and calculates a statistical representative value (for example, an average value, a mode value, or a median value). By this statistical processing, for example, the average value of the tempo and the mode value of timbre, tone, rhythm, chord progression, composer name, songwriter name, and performer name are obtained. In addition, the preference analysis unit 515 identifies the part of speech of each word after decomposing the lyrics indicated by the lyrics information into a word level using a technique such as morphological analysis, and displays a histogram for the word of a specific part of speech (for example, a noun). A word that is created and whose appearance frequency is within a predetermined range (for example, the top 5%) is specified. Furthermore, the preference analysis unit 515 extracts a word group that includes the identified word and corresponds to a predetermined syntactic break (for example, minutes, clauses, or phrases) from the lyrics information. For example, when the word “like” appears frequently, word groups such as “I like you” and “I like it very much” are extracted from the lyrics information. These average values, mode values, and word groups are examples of information (parameters) indicating user preferences. Alternatively, the preference analysis unit 515 may analyze a plurality of sets of information according to a predetermined algorithm different from simple statistical processing to obtain information indicating the user's preference. Alternatively, the preference analysis unit 515 may receive feedback from the user and adjust the weights of these parameters according to the feedback. In step S114, the preference analysis unit 515 updates the classification table 5161 using the information obtained in step S113.

FIG. 8 is a diagram illustrating a classification table 5161. This figure shows a classification table 5161 for users whose user name is “Taro Yamada”. In the classification table 5161, the features, attributes, and lyrics of the music are recorded in association with the user's emotions. Referring to the classification table 5161, for example, when the user “Taro Yamada” has a feeling of “happy”, the words “love”, “love”, and “love” are included in the lyrics, and the tempo is about 60. It is shown that the user prefers a music piece that has a chord progression of “I → V → VIm → IIIm → IV → I → IV → V” and is mainly a piano tone. According to the present embodiment, information indicating the user's preference can be obtained automatically. The preference information recorded in the classification table 5161 is accumulated as the learning progresses, that is, as the accumulated usage time of the voice response system 1 increases, and reflects the user's preference more. According to this example, information reflecting the user's preference can be obtained automatically.

Note that the preference analysis unit 515 may set the initial value of the classification table 5161 at a predetermined timing such as user registration or first login. In this case, the voice response system 1 causes the user to select a character (for example, a so-called avatar) representing the user on the system, and sets a classification table 5161 having an initial value corresponding to the selected character to the classification corresponding to the user. It may be set as a table.

The data recorded in the classification table 5161 described in this embodiment is an example. For example, the user's emotion is not recorded in the classification table 5161, and at least lyrics may be recorded. Alternatively, the lyrics may not be recorded in the classification table 5161, and at least the user's emotion and the result of music analysis may be recorded.

3. Singing synthesis function 3-1. Configuration FIG. 9 is a diagram illustrating a functional configuration related to the song synthesis function 52. As functional elements related to the song synthesis function 52, the voice response system 1 includes a voice analysis unit 511, an emotion estimation unit 512, a storage unit 516, a detection unit 521, a song generation unit 522, an accompaniment generation unit 523, and a synthesis unit 524. . The song generation unit 522 includes a melody generation unit 5221 and a lyrics generation unit 5222. In the following, description of elements common to the learning function 51 is omitted.

Regarding the song composition function 52, the storage unit 516 stores a segment database 5162. The segment database is a database that records speech segment data used in singing synthesis. The speech segment data is obtained by converting one or more phonemes into data. A phoneme corresponds to the smallest unit of language semantic distinction (for example, vowels and consonants), and is set in consideration of the actual articulation of a language and the entire phonological system. Is the smallest unit. The speech segment is obtained by cutting out a section corresponding to a desired phoneme or phoneme chain from the input speech uttered by a specific speaker. The speech segment data in the present embodiment is data indicating the frequency spectrum of the speech segment. In the following description, the term “speech segment” includes a single phoneme (for example, a monophone) or a phoneme chain (for example, a diphone or a triphone).

The storage unit 516 may store a plurality of unit databases 5162. The plurality of segment databases 5162 may include, for example, records of phonemes pronounced by different singers (or speakers). Or the some segment database 5162 may contain what recorded the phoneme sounded by the single singer (or speaker) by a different way of singing or a voice color, respectively.

The song generation unit 522 generates a song voice, that is, synthesizes a song. The singing voice is a voice uttered according to a given melody with given lyrics. The melody generation unit 5221 generates a melody used for song synthesis. The lyrics generation unit 5222 generates lyrics used for singing synthesis. The melody generation unit 5221 and the lyric generation unit 5222 may generate melody and lyrics using information recorded in the classification table 5161. The song generation unit 522 generates a song voice using the melody generated by the melody generation unit 5221 and the lyrics generated by the lyrics generation unit 5222. The accompaniment production | generation part 523 produces | generates the accompaniment with respect to a song voice. The synthesizing unit 519 synthesizes the singing voice using the singing voice generated by the singing generation unit 522, the accompaniment generated by the accompaniment generation unit 523, and the voice unit recorded in the unit database 5162.

3-2. Operation FIG. 10 is a flowchart showing an outline of the operation (song synthesis method) of the voice response system 1 according to the song synthesis function 52. In step S21, the voice response system 1 determines whether an event that triggers singing synthesis has occurred (detects). The event that triggers the singing synthesis is, for example, an event that a voice input is made by the user, an event registered in a calendar (for example, an alarm or a user's birthday), or a method other than a voice (for example, the input / output device 10). At least one of an event that a singing synthesis instruction is input by an operation on a smartphone (not shown) wirelessly connected to the mobile phone, and an event that occurs randomly. When it is determined that an event that triggers singing synthesis has occurred (S21: YES), the voice response system 1 moves the process to step S22. When it is determined that an event that triggers singing synthesis has not occurred (S21: NO), the voice response system 1 waits until an event that triggers singing synthesis occurs.

In step S22, the voice response system 1 reads the singing synthesis parameters. In step S23, the voice response system 1 generates lyrics. In step S24, the voice response system 1 generates a melody. In step S25, the voice response system 1 corrects one of the generated lyrics and melody to match the other. In step S26, the voice response system 1 selects a segment database to be used (an example of a selection unit). In step S27, the voice response system 1 performs singing synthesis using the melody, lyrics, and segment database obtained in steps S23, S26, and S27. In step S28, the voice response system 1 generates an accompaniment. In step S29, the voice response system 1 synthesizes the singing voice and the accompaniment. The processing of steps S23 to S29 is a part of the processing of step S18 in the flow of FIG. Hereinafter, operation | movement of the voice response system 1 which concerns on the song synthesis | combination function 52 is demonstrated in detail.

FIG. 11 is a sequence chart illustrating the operation of the voice response system 1 according to the song synthesis function 52. When detecting an event that triggers song synthesis, the detection unit 521 requests the song generation unit 522 to perform song synthesis (step S201). The request for song composition includes the user's identifier. When song synthesis is requested, the song generation unit 522 inquires of the storage unit 516 about the user's preference (step S202). This query includes a user identifier. When receiving the inquiry, the storage unit 516 reads the preference information corresponding to the user identifier included in the inquiry from the classification table 5161, and outputs the read preference information to the song generation unit 522 (step S203). Further, the song generation unit 522 inquires of the emotion estimation unit 512 about the user's emotion (step S204). This query includes a user identifier. When receiving the inquiry, the emotion estimation unit 512 outputs the emotion information of the user to the song generation unit 522 (step S205).

In step S206, the song generation unit 522 selects a lyrics source. The source of the lyrics is determined according to the input sound. The source of the lyrics is roughly either the processing unit 510 or the classification table 5161. The request for song synthesis output from the processing unit 510 to the song generation unit 522 may include lyrics (or lyrics material) and may not include lyrics. The lyric material is a character string that cannot form lyrics by itself but forms lyrics by combining with other lyric materials. The case where the request for singing synthesis includes lyrics means, for example, a case where a response voice is output with a melody attached to the response itself by AI (“Tomorrow's weather is fine”, etc.). Since the singing synthesis request is generated by the processing unit 510, it can be said that the source of the lyrics is the processing unit 510. Furthermore, since the processing unit 510 may acquire content from the content providing unit 60, it can be said that the source of the lyrics is the content providing unit 60. The content providing unit 60 is, for example, a server that provides news or a server that provides weather information. Or the content provision part 60 is a server which has a database which recorded the lyrics of the existing music. Although only one content providing unit 60 is shown in the figure, a plurality of content providing units 60 may exist. When the lyrics are included in the request for song synthesis, the song generation unit 522 selects the request for song synthesis as the source of the lyrics. When the lyrics are not included in the singing synthesis request (for example, when the instruction by the input voice is not to specify the contents of the lyrics such as “sing something”), the singing generation unit 522 displays the classification table. 5161 is selected as the source of the lyrics.

In step S207, the song generation unit 522 requests the selected source to provide lyrics material. In this example, the classification table 5161, that is, the storage unit 516 is selected as the source. In this case, the request includes a user identifier and emotion information of the user. Upon receiving the request for provision of lyrics material, the storage unit 516 extracts the lyrics material corresponding to the user identifier and emotion information included in the request from the classification table 5161 (step S208). The storage unit 516 outputs the extracted lyric material to the song generation unit 522 (step S209).

When the lyrics material is acquired, the song generation unit 522 requests the lyrics generation unit 5222 to generate lyrics (step S210). This request includes the lyrics material obtained from the source. When generation of lyrics is requested, the lyrics generation unit 5222 generates lyrics using the lyrics material (step S211). The lyrics generation unit 5222 generates lyrics by combining a plurality of lyrics materials, for example. Alternatively, each source may store the lyrics for the entire song, and in this case, the lyrics generation unit 5222 selects the lyrics for one song used for singing synthesis from the lyrics stored in the source. May be. The lyrics generation unit 5222 outputs the generated lyrics to the song generation unit 522 (step S212).

In step S213, the song generation unit 522 requests the melody generation unit 5221 to generate a melody. This request includes information specifying user preference information and the number of lyrics. The information for specifying the number of sounds in the lyrics is the number of characters in the generated lyrics, the number of mora, or the number of syllables. When generation of a melody is requested, the melody generation unit 5221 generates a melody according to the preference information included in the request (step S214). Specifically, for example, as follows. The melody generating unit 5221 is a database of melody material (for example, a note string having a length of about two or four bars, or an information string obtained by subdividing a note string into musical elements such as changes in rhythm and pitch). (Hereinafter referred to as “melody database”, not shown). The melody database is stored in the storage unit 516, for example. In the melody database, melody attributes are recorded. The attribute of the melody includes, for example, music information such as a suitable tune or lyrics and a composer name. The melody generation unit 5221 selects one or a plurality of materials that match the preference information included in the request from the materials recorded in the melody database, and combines the selected materials to generate a melody having a desired length. obtain. The song generation unit 522 outputs information for specifying the generated melody (for example, sequence data such as MIDI) to the song generation unit 522 (step S215).

In step S216, the song generation unit 522 requests the melody generation unit 5221 to correct the melody or generate the lyrics from the lyrics generation unit 5222. One of the purposes of this correction is to make the number of sounds of lyrics (for example, the number of mora) and the number of sounds of a melody match. For example, when the number of mora in the lyrics is less than the number of sounds in the melody (when there are not enough characters), the song generation unit 522 requests the lyrics generation unit 5222 to increase the number of characters in the lyrics. Alternatively, when the number of mora in the lyrics is greater than the number of sounds in the melody (in the case of remaining characters), the singing generation unit 522 requests the melody generation unit 5221 to increase the number of sounds in the melody. In this figure, an example of correcting lyrics will be described. In step S217, the lyrics generation unit 5222 corrects the lyrics in response to the request for correction. When correcting a melody, the melody generation part 5221 corrects a melody by dividing a note and increasing the number of notes, for example. The lyric generation unit 5222 or the melody generation unit 5221 may adjust the lyric phrase delimiter to match the melody phrase delimiter. The lyrics generation unit 5222 outputs the corrected lyrics to the song generation unit 522 (step S218).

Upon receiving the lyrics, the song generation unit 522 selects the segment database 5162 used for song synthesis (step S219). The segment database 5162 is selected according to the user's attribute regarding the event that triggered the singing synthesis, for example. Alternatively, the segment database 5162 may be selected according to the content of the event that triggered the song synthesis. Further alternatively, the segment database 5162 may be selected according to user preference information recorded in the classification table 5161. The song generation unit 522 synthesizes the speech unit extracted from the selected unit database 5162 according to the lyrics and the melody obtained in the process so far, and obtains synthesized song data (step S220). The classification table 5161 may record information indicating user preferences regarding performance of singing, such as voice color change, tame, shackle, and vibrato in the singing, and the singing generation unit 522 refers to these information. Thus, a singing that reflects the performance according to the user's preference may be synthesized. The song generation unit 522 outputs the generated synthesized song data to the synthesis unit 524 (step S2221).

Further, the song generation unit 522 requests the accompaniment generation unit 523 to generate an accompaniment (S222). This request includes information indicating a melody in singing synthesis. The accompaniment generation unit 523 generates an accompaniment according to the melody included in the request (step S223). A well-known technique is used as a technique for automatically adding an accompaniment to a melody. When data indicating the melody chord progression (hereinafter referred to as “chord progression data”) is recorded in the melody database, the accompaniment generation unit 523 may generate an accompaniment using the chord progression data. Alternatively, when accompaniment chord progression data for a melody is recorded in the melody database, the accompaniment generation unit 523 may generate an accompaniment using the chord progression data. Further alternatively, the accompaniment generation unit 523 may store a plurality of accompaniment audio data in advance, and read out the one that matches the chord progression of the melody. Moreover, the accompaniment production | generation part 523 may produce | generate the accompaniment according to a user preference with reference to the classification | category table 5161, for example in order to determine the music tone of an accompaniment (an example of a determination part). The accompaniment generation unit 523 outputs the generated accompaniment data to the synthesis unit 524 (step S224).

Upon receiving the synthesized singing and accompaniment data, the synthesizing unit 524 synthesizes the synthesized singing and accompaniment (step S225). In synthesizing, the singing and the accompaniment are synthesized in synchronism by matching the performance start position and tempo. Thus, synthetic singing data with accompaniment is obtained. The synthesizing unit 524 outputs synthetic singing data.

Here, an example has been described in which lyrics are first generated and then a melody is generated in accordance with the lyrics. However, the voice response system 1 may generate a melody first, and then generate lyrics according to the melody. Moreover, although the example output after combining a song and an accompaniment was demonstrated here, an accompaniment is not produced | generated but only a song may be output (namely, a cappella may be sufficient). In addition, here, an example in which an accompaniment is generated in accordance with a song after the song has been synthesized has been described, but an accompaniment may be generated first and a song may be synthesized in accordance with the accompaniment.

4). Response Function FIG. 12 is a diagram illustrating a functional configuration of the voice response system 1 according to the response function 53. As functional elements related to the response function 53, the voice response system 1 includes a voice analysis unit 511, an emotion estimation unit 512, and a content decomposition unit 531. In the following, description of elements common to the learning function 51 and the song synthesis function 52 is omitted. The content decomposition unit 531 decomposes one content into a plurality of partial contents. The content refers to the content of information output as response voice, and specifically refers to, for example, music, news, recipes, or teaching materials (sports learning, instrument learning, learning drill, quiz).

FIG. 13 is a flowchart illustrating the operation of the voice response system 1 according to the response function 53. In step S31, the voice analysis unit 511 specifies content to be played back. The content to be reproduced is specified according to, for example, the user input voice. Specifically, the voice analysis unit 511 analyzes the input voice and specifies the content instructed to be played by the input voice. In one example, when an input voice “Tell me a hamburger recipe” is given, the voice analysis unit 11 instructs the processing unit 510 to provide a “hamburger recipe”. The processing unit 510 accesses the content providing unit 60 and acquires text data describing the “hamburger recipe”. The data acquired in this way is specified as the content to be played back. The processing unit 510 notifies the content decomposition unit 531 of the identified content.

In step S32, the content decomposition unit 531 decomposes the content into a plurality of partial contents. In one example, the “hamburger recipe” is composed of a plurality of steps (cutting ingredients, mixing ingredients, molding, baking, etc.), and the content decomposition unit 531 converts the text “hamburger recipe” into “materials”. It is broken down into four partial contents: “cutting step”, “mixing material”, “forming step”, and “baking step”. The content decomposition position is automatically determined by, for example, AI. Alternatively, a marker indicating a delimiter may be embedded in the content in advance, and the content may be decomposed at the position of the marker.

In step S33, the content decomposition unit 531 specifies one target partial content among the plurality of partial contents (an example of a specifying unit). The target partial content is the partial content to be played back, and is determined according to the positional relationship of the partial content in the original content. In the example of “hamburger recipe”, the content disassembling unit 531 first identifies “the step of cutting the material” as the target partial content. Next, when the process of step S33 is performed, the content decomposing unit 531 identifies “the step of mixing materials” as the target partial content. The content decomposition unit 531 notifies the content modification unit 532 of the identified partial content.

In step S34, the content correction unit 532 corrects the target partial content. A specific correction method is defined according to the content. For example, the content correction unit 532 does not correct content such as news, weather information, and recipes. For example, for the teaching material or quiz content, the content correction unit 532 replaces the portion that is desired to be hidden as a problem with another sound (for example, humming, “rara”, beep sound, etc.). At this time, the content correction unit 532 performs replacement using a character string having the same number of mora or syllables as the character string before replacement. The content correction unit 532 outputs the corrected partial content to the song generation unit 522.

In step S35, the song generation unit 522 sings the modified partial content. The singing voice generated by the singing generation unit 522 is finally output as a response voice from the input / output device 10. When the response voice is output, the voice response system 1 waits for a user response (step S36). In step S <b> 36, the voice response system 1 may output a singing or voice that prompts the user to respond (for example, “has it done?”). The voice analysis unit 511 determines the next process according to the user response. When a response for prompting the reproduction of the next partial content is input (S36: next), the voice analysis unit 511 moves the process to step S33. The response that prompts the reproduction of the next partial content is, for example, a voice such as “next step”, “completed”, “finished”, or the like. When a response other than a response prompting the reproduction of the next partial content is input (S36: end), the audio analysis unit 511 instructs the processing unit 510 to stop outputting the audio.

In step S37, the processing unit 510 stops the output of the synthesized voice of the partial content at least temporarily. In step S38, the processing unit 510 performs processing according to the user's input voice. The processing in step S38 includes, for example, stop playback of the current content, keyword search instructed by the user, and start playback of another content. For example, when a response such as “I want you to stop singing”, “End of song”, or “End” is input, the processing unit 510 stops the reproduction of the current content. For example, when a question-type response such as “How do you cut a strip?” Or “What is Ario Aurio?” Is input, the processing unit 510 provides content for answering a user's question as content. Obtained from the unit 60. The processing unit 510 outputs a sound of an answer to the user's question. This answer may be spoken voice instead of singing. When a response instructing the reproduction of another content, such as “Turn XXX”, is input, the processing unit 510 acquires the instructed content from the content providing unit 60 and reproduces it.

An example has been described in which content is decomposed into a plurality of partial contents, and the next process is determined for each partial content in accordance with the user's reaction. However, the content may not be broken down into partial content, but may be output as speech or as singing voice using the content as lyrics. The voice response system 1 may determine whether to break down into partial contents according to user input voice or according to content to be output, or to output as it is without being decomposed.

5). Operational Examples Some specific operational examples will be described below. Although not clearly indicated in each operation example, each operation example is based on at least one or more of the learning function, the song synthesis function, and the response function. In the following operation examples, examples in which Japanese is used will be described. However, the language used is not limited to Japanese, and any language may be used.

5-1. Operation example 1
FIG. 14 is a diagram illustrating an operation example 1 of the voice response system 1. The user requests the reproduction of the musical piece by the input voice of “Kazutaro Sato (performer name)“ Sakura Sakura ”(music name)”. The voice response system 1 searches the music database according to the input voice and reproduces the requested music. At this time, the voice response system 1 updates the classification table using the emotion of the user when the input voice is input and the analysis result of the music. The classification table is updated every time music playback is requested. The classification table more reflects the user's preference as the number of times the user requests the voice response system 1 to play a song increases (that is, as the cumulative usage time of the voice response system 1 increases). Go.

5-2. Operation example 2
FIG. 15 is a diagram illustrating an operation example 2 of the voice response system 1. The user requests singing synthesis with an input voice of "Sing something fun". The voice response system 1 performs singing synthesis according to the input voice. At the time of singing synthesis, the voice response system 1 refers to the classification table. Lyrics and melodies are generated using information recorded in the classification table. Therefore, it is possible to automatically create music that reflects the user's preferences.

5-3. Operation example 3
FIG. 16 is a diagram illustrating an operation example 3 of the voice response system 1. The user requests the provision of weather information by an input voice “What is the weather today?”. In this case, as a response to this request, the processing unit 510 accesses a server that provides weather information in the content providing unit 60 and acquires text indicating today's weather (for example, “Today is sunny all day”). The processing unit 510 outputs a song synthesis request including the acquired text to the song generation unit 522. The song generation unit 522 performs song synthesis using the text included in the request as lyrics. The voice response system 1 outputs a singing voice with a melody and accompaniment added to “Today is sunny today” as an answer to the input voice.

5-4. Operation example 4
FIG. 17 is a diagram illustrating an operation example 4 of the voice response system 1. Before the illustrated response was started, the user used the voice response system 1 for two weeks and often played romance songs. Therefore, information indicating that the user likes a love song is recorded in the classification table. The voice response system 1 asks the user a question in order to obtain information that can be used as a hint for generating lyrics, such as “Where is the meeting place?” And “When is the season?”. The voice response system 1 generates lyrics using the user's answers to these questions. Since the usage period is still as short as two weeks, the classification table of the voice response system 1 still does not sufficiently reflect the user's preference, and the association with emotions is not sufficient. Therefore, although the user really likes the ballad-like music, the user may generate a different rock-like music.

5-5. Operation example 5
FIG. 18 is a diagram illustrating an operation example 5 of the voice response system 1. This example shows an example in which the use of the voice response system 1 is further continued from the operation example 3, and the cumulative use period becomes one and a half months. Compared to the operation example 3, the classification table more reflects the user's preference, and the synthesized singing is in accordance with the user's preference. The user can experience that the response of the voice response system 1 that was initially incomplete gradually changes so as to suit his / her preference.

5-6. Operation example 6
FIG. 19 is a diagram illustrating an operation example 6 of the voice response system 1. The user requests the provision of the content of the “recipe” of “hamburger” by an input voice “Tell me a recipe for the hamburger?”. The voice response system 1 breaks down the content into partial content based on the fact that the content “recipe” should proceed to the next step after a certain step is completed, and performs the next process according to the user's reaction. It is determined to play in a manner of determining.

“Recipe” of “hamburger” is disassembled step by step, and every time a singing of each step is output, the voice response system 1 is a voice prompting the user's response, such as “has it done?” Is output. When the user utters an input voice instructing the singing of the next step, such as “I'm done” or “What is next?”, The voice response system 1 outputs the singing of the next step in response. When the user utters an input voice asking “How do you chop the onion?”, The voice response system 1 outputs a singing of “chopped onion” in response. When the singing of “chopped onion” is finished, the voice response system 1 starts singing from the continuation of the “recipe” of “hamburg”.

The voice response system 1 may output the singing voice of another content between the singing voice of the first partial content and the singing voice of the second partial content that follows. The voice response system 1 uses, for example, the singing voice synthesized so as to have a time length according to the matter indicated by the character string included in the first partial content, the singing voice of the first partial content, and the second partial content. Output between the singing voices. Specifically, when the first partial content indicates that the waiting time will occur for 20 minutes, such as “Let's boil the ingredients for 20 minutes”, the voice response system 1 boils the ingredients. Synthesize a 20-minute song that is played while you are playing.

In addition, the voice response system 1 outputs the singing voice synthesized by using the second character string corresponding to the matter indicated by the first character string included in the first partial content, and the singing voice of the first partial content. Thereafter, the data may be output at a timing corresponding to the time length corresponding to the item indicated by the first character string. Specifically, when the first partial content indicates that the waiting time will occur for 20 minutes, such as “Let's boil the ingredients for 20 minutes”, the voice response system 1 The singing voice “(an example of the second character string)” may be output 20 minutes after the first partial content is output. Or, in the example where the first partial content is “Let's boil the ingredients for 20 minutes here”, when half of the waiting time (10 minutes) has passed, it wraps with “10 minutes until the end of boiling” You may sing in the wind.

5-7. Operation example 7
FIG. 21 is a diagram illustrating an operation example 7 of the voice response system 1. The user requests the provision of the contents of the “procedure manual” by an input voice “Will you read the process manual of the process in the factory?”. The voice response system 1 is based on the fact that the content called “procedure manual” is for confirming the user's memory, and decomposes the content into partial content and determines the next process according to the user's reaction Decide to play with.

For example, the voice response system 1 divides the procedure manual at random positions and breaks it down into a plurality of partial contents. When the voice response system 1 outputs a song of one partial content, the voice response system 1 waits for a user's reaction. For example, the voice response system 1 sings the part “after pressing switch A” for the content of the procedure “pressing switch B when the value of meter B becomes 10 or less after pressing switch A” Wait for user response. When the user utters some voice, the voice response system 1 outputs the next partial content song. Alternatively, at this time, the singing speed of the next partial content may be changed depending on whether or not the user can correctly say the next partial content. Specifically, when the user can correctly say the next partial content, the voice response system 1 increases the speed of singing the next partial content. Alternatively, when the user cannot correctly say the next partial content, the voice response system 1 reduces the speed of singing the next partial content.

5-8. Operation example 8
FIG. 22 is a diagram illustrating an operation example 8 of the voice response system 1. The operation example 8 is an operation example of measures for dementia of elderly people. The fact that the user is an elderly person is set in advance by user registration or the like. The voice response system 1 starts singing an existing song in accordance with, for example, a user instruction. The voice response system 1 pauses singing at a random position or a predetermined position (for example, before rust). At that time, a message such as “I don't know” or “I forgot” is issued, and it behaves as if I forgot the lyrics. The voice response system 1 waits for a user's response in this state. When the user utters some voice, the voice response system 1 outputs a singing from the continuation of the word, with the part of the word uttered by the user as the correct lyrics. When the user utters something, the voice response system 1 may output a response such as “thank you”. When a predetermined time has elapsed while waiting for a response from the user, the voice response system 1 may output a speech such as “remembered” and resume singing from the continuation of the paused portion.

5-9. Example 9
FIG. 23 is a diagram illustrating an operation example 9 of the voice response system 1. The user requests singing synthesis with an input voice of "Sing something fun". The voice response system 1 performs singing synthesis according to the input voice. The segment database used at the time of singing synthesis is selected according to the character selected at the time of user registration, for example (for example, when a male character is selected, a segment database by a male singer is used). The user utters an input voice instructing to change the segment database, such as “change to a female voice” during the song. The voice response system 1 switches the segment database used for singing synthesis according to a user's input voice. The switching of the segment database may be performed when the voice response system 1 is outputting a singing voice, or when the voice response system 1 is in a state of waiting for a response from the user as in the operation examples 7 to 8. It may be done.

The voice response system 1 may have a plurality of segment databases that record phonemes that are pronounced by a single singer (or speaker) with different singing styles or voice colors. The voice response system 1 may use a plurality of segments extracted from a plurality of segment databases in combination with a certain ratio (usage ratio), that is, add a certain phoneme. The voice response system 1 may determine the usage ratio according to the user's reaction. Specifically, when two segment databases are recorded for a singer with a normal voice and a sweet voice, if the user utters an input voice of “a sweeter voice”, the sweet voice segment database If you increase the usage rate of the voice and utter the input voice "with a much sweeter voice", the usage rate of the sweet voice segment database will be increased.

6). Modifications The present invention is not limited to the above-described embodiments, and various modifications can be made. Hereinafter, some modifications will be described. Two or more of the following modifications may be used in combination.

In this paper, the singing voice refers to a voice that includes at least a part of the singing voice, and may include only an accompaniment that does not include a singing, or a part that includes only a voice. For example, in an example in which content is decomposed into a plurality of partial contents, at least one partial content may not include a song. Singing may also include raps or poetry readings.

In the embodiment, the example in which the learning function 51, the song synthesis function 52, and the response function 53 are related to each other has been described, but these functions may be provided independently. For example, the classification table obtained by the learning function 51 may be used to know the user's preference in a music distribution system that distributes music, for example. Alternatively, the song synthesis function 52 may perform song synthesis using a classification table manually input by the user. Further, at least some of the functional elements of the voice response system 1 may be omitted. For example, the voice response system 1 may not have the emotion estimation unit 512.

For the assignment of functions to the input / output device 10, the response engine 20, and the song synthesis engine 30, for example, the voice analysis unit 511 and the emotion estimation unit 512 may be implemented in the input / output device. The relative arrangement of the input / output device 10, the response engine 20, and the song synthesis engine 30 is, for example, that the song synthesis engine 30 is arranged between the input / output device 10 and the response engine 20 and is output from the response engine 20. Of the responses to be performed, singing synthesis may be performed for responses determined to require singing synthesis. The content used in the voice response system 1 may be stored in a local device such as the input / output device 10 or a device capable of communicating with the input / output device 10.

The hardware configuration of the input / output device 10, the response engine 20, and the song synthesis engine 30 may be, for example, a smartphone or a tablet terminal. The user input to the voice response system 1 is not limited to voice input, and may be input via a touch screen, a keyboard, or a pointing device. The input / output device 10 may have a human sensor. The voice response system 1 may control the operation using the human sensor depending on whether or not the user is nearby. For example, when it is determined that the user is not near the input / output device 10, the voice response system 1 may perform an operation of not outputting a voice (not returning a dialogue). However, depending on the content of the voice output by the voice response system 1, the voice response system 1 may output the voice regardless of whether the user is near the input / output device 10. For example, as described in the second half of the operation example 6, the voice response system 1 may output the voice that guides the remaining waiting time regardless of whether the user is near the input / output device 10 or not. For detecting whether the user is near the input / output device 10, a sensor other than a human sensor such as a camera or a temperature sensor may be used, or a plurality of sensors may be used in combination.

The flowcharts and sequence charts exemplified in the embodiments are examples. In the flowchart or sequence chart illustrated in the embodiment, the processing order may be changed, some of the processing may be omitted, or new processing may be added.

The programs executed in the input / output device 10, the response engine 20, and the song synthesis engine 30 may be provided in a state stored in a recording medium such as a CD-ROM or a semiconductor memory, or via a network such as the Internet. May be provided by download.

This application is based on a Japanese patent application (Japanese Patent Application No. 2017-116831) filed on June 14, 2017, and is incorporated herein by reference.

According to the present invention, a singing voice can be output according to the interaction with the user, which is useful.

DESCRIPTION OF SYMBOLS 1 ... Voice response system, 10 ... Input / output device, 20 ... Response engine, 30 ... Singing synthesis engine, 51 ... Learning function, 52 ... Singing synthesis function, 53 ... Response function, 60 ... Content provision part, 101 ... Microphone, 102 DESCRIPTION OF SYMBOLS ... Input signal processing part, 103 ... Output signal processing part, 104 ... Speaker, 105 ... CPU, 106 ... Sensor, 107 ... Motor, 108 ... Network IF, 201 ... CPU, 202 ... Memory, 203 ... Storage, 204 ... Communication IF , 301 ... CPU, 302 ... memory, 303 ... storage, 304 ... communication IF, 510 ... processing unit, 511 ... voice analysis unit, 512 ... emotion estimation unit, 513 ... music analysis unit, 514 ... lyric extraction unit, 515 ... preference Analysis unit 516 ... Storage unit 521 ... Detection unit 522 ... Singing generation unit 523 ... Accompaniment generation unit 524 ... Generating unit, 5221 ... melody generation unit, 5222 ... lyrics generation unit, 531 ... content decomposition unit, 532 ... content modification unit

Claims

Decomposing the content into a plurality of partial contents;
Identifying first partial content from the plurality of partial content;
Synthesizing a first singing voice using a character string included in the first partial content;
Outputting the first singing voice;
Accepting a user's reaction to the first singing voice;
Identifying a second partial content related to the first partial content in response to the user response;
Synthesizing a second singing voice using a character string included in the second partial content;
Outputting the second singing voice;
A method for outputting singing voices.
The method for outputting a singing voice according to claim 1, further comprising: determining an element used for singing synthesis using a character string included in the second partial content in response to the user's reaction.
The singing voice output method according to claim 2, wherein the element includes a parameter of the singing synthesis, a melody, or a tempo, or an arrangement of accompaniment in the singing voice.
The synthesis of the first singing voice and the second singing voice is performed using segments recorded in at least one database selected from a plurality of databases,
The singing according to any one of claims 1 to 3, further comprising a step of selecting a database used for singing synthesis using a character string included in the second partial content in response to the user's reaction. Audio output method.
The synthesis of the first singing voice and the second singing voice is performed using segments recorded in a plurality of databases selected from a plurality of databases,
In the step of selecting the database, a plurality of databases are selected,
The method for outputting a singing voice according to claim 4, further comprising: determining a usage ratio of the plurality of databases according to a reaction of the user.
Replacing a part of the character string included in the first partial content with another character string;
The step of synthesizing the first singing voice synthesizes the first singing voice by using a character string included in the first partial content partly replaced with the other character string. The method for outputting the singing voice according to any one of claims 1 to 5.
The singing voice output method according to claim 6, wherein the other character string and the character string to be replaced have the same number of syllables or mora.
In response to the user's reaction, replacing a part of the second partial content with another character string;
The step of synthesizing the second singing voice synthesizes the second singing voice by using a character string included in the second partial content partly replaced with the other character string. The output method of the singing voice as described in any one of thru | or 7.
Synthesizing a third singing voice so as to have a time length according to a matter indicated by a character string included in the first partial content;
Outputting the third singing voice between the first singing voice and the second singing voice;
The method for outputting a singing voice according to any one of claims 1 to 8.
Synthesizing a fourth singing voice using a second character string corresponding to a matter indicated by a first character string included in the first partial content;
After outputting the first singing voice, outputting the fourth singing voice at a timing according to a time length corresponding to a matter indicated by the first character string;
The method for outputting a singing voice according to any one of claims 1 to 9.
The singing voice output method according to any one of claims 1 to 10, wherein the content includes a character string.
A decomposition unit that decomposes content into a plurality of partial contents;
A specifying unit for specifying the first partial content from the plurality of partial contents;
A synthesis unit that synthesizes a first singing voice using a character string included in the first partial content;
An output unit for outputting the first singing voice;
An accepting unit that accepts a user's reaction to the first singing voice;
Have
The specifying unit specifies a second partial content related to the first partial content in response to the user's reaction,
The synthesis unit synthesizes a second singing voice using a character string included in the second partial content,
The output unit is a voice response system that outputs the second singing voice.
The voice response system according to claim 12, further comprising: a determination unit that determines an element used for singing synthesis using a character string included in the second partial content in response to the user's reaction.
The voice response system according to claim 13, wherein the element includes a parameter of the singing synthesis, a melody, or a tempo, or an arrangement of accompaniment in the singing voice.
The synthesis of the first singing voice and the second singing voice is performed using segments recorded in at least one database selected from a plurality of databases,
The selection part which selects the database used in the case of the singing synthesis | combination using the character string contained in the said 2nd partial content with respect to the said user's reaction. Voice response system.
The synthesis of the first singing voice and the second singing voice is performed using segments recorded in a plurality of databases selected from a plurality of databases,
The selection unit selects a plurality of databases;
The voice response system according to claim 15, wherein the determination unit determines a usage ratio of the plurality of databases according to a reaction of the user.
A replacement unit that replaces a part of the character string included in the first partial content with another character string;
The said synthetic | combination part synthesize | combines the said 1st song audio | voice using the character string contained in the said 1st partial content partially substituted by the said other character string. The voice response system described in 1.
The voice response system according to claim 17, wherein the other character string and the character string to be replaced have the same number of syllables or mora.
In response to the user's reaction, replacing a part of the second partial content with another character string;
The said synthetic | combination part synthesize | combines the said 2nd singing voice using the character string contained in the said 2nd partial content partially substituted by the said other character string. The voice response system described in 1.
The synthesizing unit synthesizes a third singing voice so as to have a time length according to a matter indicated by a character string included in the first partial content,
The voice response system according to any one of claims 12 to 19, wherein the third singing voice is output between the first singing voice and the second singing voice.
The synthesizing unit synthesizes a fourth singing voice using a second character string corresponding to a matter indicated by the first character string included in the first partial content,
The output unit outputs the fourth singing voice at a timing according to a time length corresponding to a matter indicated by the first character string after the output of the first singing voice. The voice response system according to one item.
The voice response system according to any one of claims 12 to 21, wherein the content includes a character string.