WO2019044534A1

WO2019044534A1 - Information processing device and information processing method

Info

Publication number: WO2019044534A1
Application number: PCT/JP2018/030487
Authority: WO
Inventors: 角川　元輝; 政明星野; 亜由美中川
Original assignee: ソニー株式会社
Priority date: 2017-08-31
Filing date: 2018-08-17
Publication date: 2019-03-07
Also published as: JPWO2019044534A1

Abstract

The present invention pertains to an information processing device and an information processing method with which it is possible to utter a more humanly speech by limitation to the seven-and-five syllable meter. Due to the fact that an information processing device equipped with a processing unit for converting inputted text information into text in seven-and-five syllable meter and outputting the resulting text is provided, it is possible to utter a more humanly speech by limitation to the seven-and-five syllable meter. The present feature can, for example, be applied to a system for generating a response to a user's utterance as in a dialog system, or a system for reading aloud text information by speech synthesis as in a news program production system or a digital signage system.

Description

INFORMATION PROCESSING APPARATUS AND INFORMATION PROCESSING METHOD

The present technology relates to an information processing apparatus and an information processing method, and more particularly, to an information processing apparatus and an information processing method capable of more human-like speech by a seventy-five tone constraint.

In recent years, products and services for conducting dialogue by voice have become widespread. For example, Patent Document 1 discloses an electronic toy (home robot) that performs an arbitrary operation according to an external voice.

JP 2002-307354 A

By the way, when conducting a dialogue by voice, for example, there are many cases in which the response is monotonous like "Tomorrow's weather in Yokohama is fine." And clearly different from the dialogue between humans.

Then, depending on the user, for example, it may be assumed that the user is not entertained, can not enjoy the dialogue, hardly retains in memory, does not want to continue using it, etc., so that more human speech can be performed. The technology to make it was required.

The present technology has been made in view of such a situation, and makes it possible to perform more human-like speech by the seventy-five tone restriction.

An information processing apparatus according to one aspect of the present technology is an information processing apparatus including a processing unit that converts input text information into seventy-five tones and outputs the converted information.

An information processing method according to one aspect of the present technology is an information processing method in which the information processing device converts input text information into seventy-five tones and outputs the information processing method.

In the information processing apparatus and the information processing method according to one aspect of the present technology, input text information is converted to seven-tones and output.

The information processing apparatus according to one aspect of the present technology may be an independent apparatus or an internal block constituting one apparatus.

According to one aspect of the present technology, it is possible to perform more human speech by the seventy-five tone constraint.

In addition, the effect described here is not necessarily limited, and may be any effect described in the present disclosure.

It is a block diagram showing an example of hardware constitutions of an information processor to which this art is applied. It is a block diagram showing an example of software composition of an information processor to which this art is applied. It is a flowchart which shows the flow of interactive processing. It is a flowchart which shows the flow of interactive processing. It is a figure which shows the example of context information DB. It is a figure which shows the example of user feedback information DB. It is a figure which shows the example of an end list. It is a figure which shows the example of a meaning invariant word list | wrist. It is a figure which shows the example of onomatopeist. It is a figure which shows the example of a synonym dictionary. FIG. 1 is a diagram showing an example of a news program production system to which the present technology is applied. It is a figure showing the example of the digital signage system to which this art is applied. It is a block diagram showing an example of composition of a dialogue system.

Hereinafter, embodiments of the present technology will be described with reference to the drawings. The description will be made in the following order.

1. First embodiment: dialogue system Second embodiment: news program production system Third embodiment: digital signage system Modified example

<1. First embodiment>

(Hardware configuration example)
FIG. 1 is a block diagram illustrating an example of a hardware configuration of an information processing apparatus to which the present technology is applied.

The information processing apparatus 10 in FIG. 1 is, for example, a speaker connectable to a network, and is also called a so-called smart speaker or a home agent. This type of speaker can perform, for example, voice dialogue with the user and voice operation to devices such as lighting fixtures and air conditioners in addition to reproduction of music.

The information processing apparatus 10 is not limited to a speaker, and may be configured as, for example, a mobile device such as a smartphone or a mobile phone, or an electronic device such as a tablet computer, a personal computer, a television receiver, or a game machine. It is also good.

In FIG. 1, the information processing apparatus 10 includes a CPU 101, a ROM 102, a RAM 103, an information access unit 104, a hard disk 105, an operation I / F 106, an operation unit 107, an audio input I / F 108, a microphone 109, an image input I / F 110, and a camera 111. , An audio output I / F 112, a speaker 113, a video output I / F 114, a display 115, a communication I / F 116, and a bus 117.

A central processing unit (CPU) 101, a read only memory (ROM) 102, and a random access memory (RAM) 103 are configured as the control unit 100. The control unit 100 operates as a central processing unit in the information processing apparatus 10, such as various arithmetic processing and operation control of each unit.

The information access unit 104 includes, for example, a data write / read circuit. The information access unit 104 writes various data to the hard disk 105 or reads various data recorded on the hard disk 105 under the control of the control unit 100 via the bus 117. The hard disk 105 is an HDD (Hard Disk Drive), and is configured as a large-capacity recording device.

The operation I / F 106 includes, for example, an operation interface circuit. The operation unit 107 includes, for example, a button, a keyboard, a mouse, and the like. The operation I / F 106 supplies an operation signal corresponding to the user's operation on the operation unit 107 to the control unit 100 via the bus 117.

The voice input I / F 108 includes, for example, a voice input interface circuit. The microphone 109 is a device (sound collector) that converts external sound into an electrical signal. The audio input I / F 108 supplies an audio signal corresponding to the sound collected by the microphone 109 to the control unit 100, the audio output I / F 112, and the like via the bus 117.

The video input I / F 110 includes, for example, a video input interface circuit. The camera 111 includes an image sensor, a signal processing circuit, and the like, and captures an object to generate a captured image. The video input I / F 110 supplies image data of a captured image generated by the camera 111 to the control unit 100, the information access unit 104, the video output I / F 114, and the like via the bus 117.

The audio output I / F 112 includes, for example, an audio output interface circuit. The speaker 113 is a device that converts an electrical signal into physical vibration to produce sound. The audio output I / F 112 outputs a sound corresponding to the audio signal from the speaker 113 according to the control from the control unit 100 via the bus 117.

The video output I / F 114 includes, for example, a video output interface circuit. The display 115 includes, for example, a liquid crystal display or an organic EL display. Under control of the control unit 100 via the bus 117, the video output I / F 114 causes the display 115 to display various information (for example, characters, images, etc.) according to the video signal.

The communication I / F 116 includes, for example, a communication interface circuit. The communication I / F 116 accesses a server (not shown) connected to the Internet 30 under the control of the control unit 100 via the bus 117 to exchange various data.

In addition, although the image sensor which the camera 111 has was illustrated as a sensor in the structure shown in FIG. 1, another sensor may be provided. Sensor information obtained as a result of sensing by various sensors is supplied to the control unit 100 via the bus 117 and processed.

Here, for example, as a sensor, a magnetic sensor that detects the magnitude and direction of a magnetic field (magnetic field), an acceleration sensor that detects acceleration, a gyro sensor that detects angle (posture), angular velocity, and angular acceleration, and one that approaches is detected A variety of sensors can be included, such as proximity sensors or biometric sensors that detect biometric information such as fingerprints, irises, and pulses.

Furthermore, the sensor may include a sensor for measuring the surrounding environment such as a temperature sensor for detecting temperature, a humidity sensor for detecting humidity, and an ambient light sensor for detecting ambient brightness. In addition to the sensor information obtained from the above-mentioned sensors, the sensor information may include various information such as position information calculated from a GPS (Global Positioning System) signal or the like, time information measured by a clock means, etc. Can.

(Example of software configuration)
FIG. 2 is a block diagram illustrating an example of a software configuration of an information processing apparatus to which the present technology is applied.

That is, in the information processing apparatus 10 shown in FIG. 1, the CPU 101 loads the program stored in the recording device such as the ROM 102 or the hard disk 105 into the RAM 103 and executes the program, thereby performing the information processing shown in FIG. The functions of unit 150 are realized, and various types of processing are performed.

In FIG. 2, the information processing unit 150 includes a speech recognition unit 151, an utterance intention understanding unit 152, an application / service execution unit 153, a response generation unit 154, a context acquisition unit 155, a seventy-five tone conversion unit 156, a speech synthesis unit 157, and a user. It comprises a feedback collection unit 158.

In the following description, among the response sentences for the user, the response sentence before being converted to the seventy-five by the seventy-five conversion unit 156 is described as “response sentence (before conversion)”, and the seven-five conversion unit 156 converts the response sentence to seventy-five. The response sentence after conversion is distinguished by being written as "response sentence (after conversion)". Further, among the response sentences, the response sentence generated as a candidate for the response sentence (after conversion) is denoted as “response sentence (candidate after conversion)”.

Further, the term "seven tones" refers to a form in which five sounds and seven sounds are repeated, such as "5, 7, 5," or "5, 7, 5, 7, 7,". For example, a mora (phoneme) can be used as the unit here.

Note that syllables are the number of vowels when written in Roman letters and are different from mora. That is, in addition to the number of syllables, the mora includes the number of "", "n" and "-". For example, the word "Zurich" is three syllables and also five moras. However, these definitions of syllables and moras are merely examples, and other definitions may be adopted.

The speech recognition unit 151 executes speech recognition processing to replace the speech signal according to the user's speech input thereto via the microphone 109 with text information, and the speech recognition result obtained as a result is understood as speech intention It supplies to the part 152. In the speech recognition process, a database for speech-to-text conversion is used.

Based on the speech recognition result supplied from the speech recognition unit 151, the speech intention understanding unit 152 executes analysis processing (processing to understand the intention of the user) on text information according to the speech of the user. The utterance intention understanding unit 152 supplies the application / service execution unit 153 with the intention understanding result obtained as a result of the analysis processing.

Here, for example, intentions such as “weather confirmation” and “schedule confirmation” are estimated as the user's intention, and “date and time” and “place” are analyzed as the parameters (slots).

The application / service execution unit 153 executes an application or service meeting the user's intention based on the intention understanding result supplied from the speech intention understanding unit 152, and supplies the result of the execution to the response generation unit 154. .

Here, for example, when “weather confirmation” is estimated as the user's intention, and “date and time” and “place” are analyzed as parameters, weather confirmation API (provided by an external service) By passing parameters of “date and time” and “location” as arguments to Application Programming Interface, it is possible to obtain information on weather of the target date and time and place.

The response generation unit 154 generates a response sentence (before conversion) for the user based on the execution result of the application or service supplied from the application / service execution unit 153, and supplies the response sentence to the quaternary conversion unit 156.

The context acquisition unit 155 acquires context information on the basis of time information, position information, an analysis result of a captured image captured by the camera 111, and the like, and supplies the context information to the seven-to-five conversion unit 156.

The seven-seventh conversion unit 156 performs seven-scene conversion processing on the response sentence (before conversion) supplied from the response generation unit 154, and supplies the response sentence (after conversion) obtained as a result to the speech synthesis unit 157. At this time, based on the context information supplied from the context acquisition unit 155 and the user feedback information supplied from the user feedback collection unit 158, the heptagon conversion unit 156 more appropriately converts the response sentence (before conversion) to seventy-five. Then, a response sentence (after conversion) can be obtained.

The speech synthesis unit 157 synthesizes speech based on the response sentence (after conversion) supplied from the seven-seventh conversion unit 156 and outputs the synthesized speech from the speaker 113. That is, the speech synthesis unit 157 has a function (TTS: Text To Speech) for reading out text information of a response sentence (after conversion).

The user feedback collecting unit 158 collects information on the reaction of the user when the response sentence (after conversion) is output to the user, and supplies the information as user feedback information to the seven-seventh conversion unit 156.

The configuration of the information processing apparatus 10 has been described above.

(Flow of dialogue processing)
Next, with reference to the flowcharts of FIG. 3 to FIG. 4, the overall flow of the interactive processing performed by the information processing unit 150 shown in FIG. 2 will be described.

Here, as an input to the information processing unit 150, there is an audio signal supplied from the microphone 109 according to a user's speech or text information according to an operation signal supplied from the operation unit 107 which is a keyboard.

The information processing unit 150 executes the process of step S11 when an audio signal according to the user's utterance is input. In step S11, the speech recognition unit 151 executes speech recognition processing to replace the speech signal corresponding to the user's speech with text information.

Thus, text information obtained as a result of the process (voice recognition process) in step S11 or text information corresponding to the user's input operation is input to the speech intention understanding unit 152.

In step S12, the speech intention understanding unit 152 understands the user's intention by executing analysis processing on the text information input thereto.

For example, when the user utters "Tell me the weather of Yokohama tomorrow?", The intention of "weather confirmation" is estimated as the user's intention, and "Tomorrow" as the parameter. "Date and time" that is, and "place" that is "Yokohama" are obtained.

In step S13, the application / service execution unit 153 executes an application or service matching the user's intention based on the intention understanding result obtained in the process of step S12.

Here, for example, when "weather confirmation" is estimated as the user's intention, and "date and time" which is "morning" and "place" which is "Yokohama" are analyzed as the parameter, an external service is analyzed. Provides information about "Tomorrow's weather in Yokohama" by passing as parameters the "date and time" that is "Tomorrow" and the "place" that is "Yokohama" to the weather confirmation API provided by be able to.

In step S14, the response generation unit 154 generates a response sentence (before conversion) for the user based on the execution result of the application or service obtained in the process of step S13.

Here, for example, a response sentence (before conversion) that is “Tomorrow's Yokohama weather is fine” is generated based on the information on “Tomorrow's Yokohama weather” obtained from the weather confirmation API.

In step S15, the context acquisition unit 155 acquires context information based on time information, position information, an analysis result of a captured image captured by the camera 111, and the like.

Here, the context information includes, for example, current environment information related to the utterance such as a time zone or place of the user's utterance, a speaker, a person who is present, an atmosphere of the place, etc., and is temporarily recorded in the context information DB 201 Can. However, the current environment information may be processed directly without being recorded in the context information DB 201.

In step S16, the seven-seventh conversion unit 156 converts the response sentence (before conversion) into seven-seven tones by executing the seven-five-tone conversion process on the response sentence (before conversion) generated in the process of step S14. After conversion) is generated.

In this heptatonic conversion process, a heptatonic conversion process conforming to the context information acquired in the process of step S15 is selected with reference to user feedback information collected in the process of step S18 described later, and a response sentence (before conversion) The selected seven-tone conversion process is executed.

Specifically, for example, the user feedback information includes information obtained by scoring the past environment information and the user's reaction for each of the candidate generation patterns for generating seventy five candidates, and the context information is the current environment information In the past environmental information, the seventy-five tone conversion unit 156 can select a candidate generation pattern in which the score of past environmental information that is the same as or similar to current environmental information is equal to or higher than a threshold.

At this time, when a plurality of candidate generation patterns can be selected, for example, one candidate generation pattern can be selected at random. That is, in the 75-tone conversion process, each process is executed such that the response sentence (before conversion) becomes 75-tone by sequentially executing the selected candidate generation pattern (the (combination of the 7-tone candidate generation process)). .

The details of the seven-tone conversion process will be described after describing the overall flow of this interactive process, but the candidate generation patterns may include, for example, the following seven-tone candidate generation process.

(A) Remove a particle and generate a candidate with seventy-five tones Remove a particle (B) Remove a semantically unnecessary part and generate a candidate with seven-five tones Remove unnecessary portion (C) Add a tail and become seven-five tones Add a suffix that generates a candidate (D) Add a word that does not change semantically to generate a candidate that becomes a seventy-seven key Add another word that generates a seven-five key by repeating a word (E) Onomatopoeia is added to generate a candidate with seventy-five tones Onomatopoeic addition (G) Synonyms replaced with a synonym to generate seven-tone candidates A combination with synonyms (H) Combination of (A) to (G) above

In addition, in the case of the seventy five-tonal candidate generation of the end addition of (C), the end list 211 is used. In addition, the semantic invariant word list 212 is used at the time of the generation of the seven tone candidate candidate of the semantic invariant word addition of (D). Furthermore, the onomatopoeist 213 is used in the generation of the onomatopoeia-added seven-tone candidate of (F), and the synonym dictionary 214 is used in the generation of the seven-tone candidate of the addition in (G).

In this way, the response sentence (before conversion) is converted to seven-tone, and a response sentence (after conversion) is generated. Then, in the case of outputting the hepta-tonic conversion result obtained in the process of step S16 as voice, the response sentence (after conversion) is output to the voice synthesis unit 157.

In step S17, the speech synthesis unit 157 synthesizes speech based on the response sentence (after conversion) obtained in the process of step S16, and outputs the speech from the speaker 113 via the speech output I / F 112. As a result, the speaker 113 outputs a sound corresponding to the response of the seventy-five tone.

On the other hand, in the case of outputting the seventy-five tone conversion result obtained in the process of step S16 as an image, the response sentence (after conversion) is output to the display 115 via the image output I / F 114. As a result, on the screen of the tis play 115, a text corresponding to the response of the seventy-five tone is displayed.

More specifically, here, for example, a response sentence (before conversion) which is the above-mentioned "Tomorrow's weather in Yokohama is fine." Is a response sentence in which "Yokohama's tomorrow's weather is fine." As in (after conversion), it will be converted to seven-tone and output.

Note that the user feedback collection unit 158 collects information on the reaction of the user when outputting the response sentence (after conversion) to the user (S18), and the user feedback information obtained at that time is It is recorded in the information DB 202. This user feedback information is used in the above-described seven-tone conversion process.

The above has described the overall flow of interactive processing.

(Flow of the conversion process of seventy-five)
Next, among the above-described dialog processing shown in FIG. 3 to FIG. 4, the details of the processing in step S16 of FIG.

In step S111, the seventy-five conversion unit 156 executes language analysis processing.

In this language analysis processing, morphological analysis processing is performed on the response sentence (before conversion) input thereto, and the morphological analysis result of the response sentence (before conversion) is obtained. However, in this morpheme analysis process, the response sentence (before conversion) is divided into a sequence of morphemes (words), and the part of speech of each morpheme is determined, but each morpheme is also given a reading kana It shall be.

Contents of language analysis process (S111 in FIG. 4) Input (IN): Response sentence (before conversion)
Output (OUT): Morphological analysis result Processing: Morphological analysis processing

In step S112, the conversion function 156 performs candidate generation selection processing.

In this candidate generation selection processing, for example, with reference to the user feedback information stored in the user feedback information DB 202, a case close to the current context information is identified, and a candidate generation processing in which the feedback score is equal to or higher than a threshold Among the combinations, one candidate generation process can be randomly selected.

Moreover, in this candidate generation selection process, for example, it is possible to randomly select a combination of candidate generation processes that is not present in the user feedback information, and also to randomly select whether or not it is selected. it can. If there is no case where the feedback score is equal to or higher than the threshold value, the response sentence may be output as it is, without executing the heptatonic conversion process on the response sentence (before conversion).

FIG. 5 is a diagram showing an example of the context information DB 201. As shown in FIG.

In FIG. 5, in the context information DB 201, values are stored for each item of context information. In the example of FIG. 5, as the context information, the time zone being "Sunday 21 o'clock", the place being "home", the speaker being "Yamada Hiroshi", the companion being a "family" and "fun" The atmosphere is stored.

Here, for the “time zone”, for example, time information clocked by a clock unit incorporated in the information processing apparatus 10 may be used. Further, as the “place”, position information calculated from a GPS (Global Positioning System) signal or the like may be used.

Further, “speaker”, “accompanying person”, and “atmosphere” may be determined by analyzing the image data of the subject captured by the camera 111 and based on the analysis result. For example, “atmosphere” determines “fun” or “sad” from the expression of the speaker or the attendant obtained from the image analysis result, and determines the atmosphere of the place from the sum of the determination results. Can.

In addition, various studies have already been made for techniques for reading human facial expressions, emotions, etc. Here, for example, an API (Application Programming Interface) for recognizing faces, images, and voices using such known techniques. Can be used. For example, by sending a captured image including the face of a speaker or an attendant using a service that provides this kind of API, for example, the speaker or the like, such as pleasure or surprise, anger, sadness, contempt, disgust, etc. It is possible to obtain information on the expressions and emotions of the attendants.

Furthermore, here, context information is managed in the database as the context information DB 201, but it is not necessary to manage in the database. However, for example, of the current environmental information, it is better to manage information that is difficult to change with time, such as location, as a database.

FIG. 6 is a diagram showing an example of the user feedback information DB 202. As shown in FIG.

In FIG. 6, in the user feedback information DB 202, the context information and the user's reaction are scored and stored for each candidate generation pattern. Here, although it is a calculation method of a score value, when a user's reaction is good with respect to the output of the response sentence (after conversion) produced | generated using the candidate production | generation pattern of object, for example, "+1" If, on the other hand, the user's response is poor, then "-1" is set.

For example, “when the reaction of the user is good” means that the speaker or the attendant present can recognize that it was “interesting” as the speech recognition result, or as the image analysis result, the speaker or It is assumed, for example, that the attendant can recognize that he has laughed. On the other hand, “when the reaction of the user is not good” means that the speaker or the attendant can recognize that “what it is”, or as a result of the image analysis, the speaker or the attendant However, it is assumed that it is possible to recognize that "anger".

When it is determined that there is no reaction from the user based on the voice recognition result or the image analysis result, the score value may be, for example, “0”.

In the example shown in FIG. 6, with respect to the candidate generation pattern of “add end & add onomatopoeia”, context information in which “time zone = Sunday 21 o'clock”, “place = home”, “speaker = Yamada Hiroshi”, A score of "+129" is stored.

Also, in the example of FIG. 6, various types of context information are also generated for candidate generation patterns such as "semantically invariant word addition", "post particle removal & repeat addition & synonym substitution", "" post particle removal & unnecessary part removal ", etc. And scores such as "+86", "+29", and "-42".

Here, for example, when “+80” is set as a threshold value to be compared with the score, “add end & add onomatopoeic” and “add semantic invariant word” may be selected as candidate generation patterns. Here, for example, one candidate generation pattern can be selected at random among “add end & add onomatopoeic” and “add semantically invariant word”.

In this manner, when a plurality of candidate generation patterns can be selected, one candidate generation pattern is selected at random, whereby one of the candidate generation patterns selected from among the candidate generation patterns satisfying a certain condition is selected. The conversion of seven-tone according to the candidate generation pattern is performed, and seven-tone response sentences of various patterns can be presented to the user.

Note that, for example, when a candidate generation pattern with the highest score is selected, there is a high possibility that only a specific candidate generation pattern is repeatedly selected, so candidate generation patterns are randomly selected as described above. However, depending on the operation of the dialogue system, it is not limited to random, and for example, the one with the highest score may be selected. Furthermore, in this example, an example of selecting from among candidate generation patterns having a threshold value or more is shown, but for example, candidate generation patterns smaller than the threshold value may be randomly selected.

Contents of candidate generation selection processing (S112 in FIG. 4) Input (IN): response sentence (before conversion), morpheme analysis result (S111 in FIG. 4), context information (S15 in FIG. 4), user feedback information (FIG. 4) S18)
Output (OUT): Selection result of candidate generation processing (which candidate generation processing is to be performed)
Processing: candidate generation selection processing

In step S113, the seven-seventh conversion unit 156 executes seven-seventh candidate generation processing.

In this seven-tone candidate generation process, for example, step S112 in the candidate generation process for generating seven-tone candidates by particle removal, unnecessary part removal, word addition, meaning-invariant word addition, repeat addition, onomatopoeic addition, or synonym substitution. One or more candidate generation processes are performed according to the selection result obtained in the process of 3.

The first candidate generation process to the seventh candidate generation are the details of candidate generation processing such as removing the unnecessary part, removing unnecessary parts, adding an ending, adding a semantic invariant word, repeatedly adding, onomatopoeic addition, or synonym substitution listed here. It will be described later as processing.

The contents of the seven key gradation candidate generation process (S113 in FIG. 4) Input (IN): response sentence (before conversion), morpheme analysis result (S111 in FIG. 4), selection result of candidate generation process (S112 in FIG. 4)
Output (OUT): Response statement (candidate after conversion)
Processing: Seventy-five tone candidate generation processing according to the selection result of candidate generation processing

(A) First candidate generation processing

In the first candidate generation process, a response sentence (post-conversion candidate) that becomes swashish is generated by removing a particle included in the response sentence (before conversion) (S113A in FIG. 4).

The following shows examples of response sentences of the candidate before conversion and after conversion at the time of generation of a seven tone standard candidate without a particle. However, for the response sentence (before conversion), three types of notations of Japanese, Roman, and English are described, and for the response sentence (after conversion), two types of Japanese and Roman character are described.

In that case, Japanese is abbreviated as "day", Roman alphabet is "ro" and English is abbreviated as "English". In addition, about the description of the example of these response sentences, it is made the same also in the other response sentences demonstrated below.

An example of the seven-tone candidate generation without the particle

Response statement (before conversion)
(Sun): A popular tavern in Matsuyama is the mackerel mackerel.
(B): matsuyama de ninki no izakaya ha sabasabatei
(English): Popular bar in Matsuyama is Sabasabatate.

Response sentence (candidate after conversion)
(Sun): Popular tavern in Matsuyama.
(B): matsuyama de ninki no izakaya sabasabatei

In this example of seven-tone candidate generation without the particle, the response sentence (post-conversion candidate) is the seven-five tone rule by removing the word “ha” that connects the “Japanese-style bar” and “Sabasaba-don” in the response sentence (before conversion). I am trying to be

(B) Second candidate generation processing

In the second candidate generation process, a semantically unnecessary part included in the response sentence (before conversion) is removed to generate a response sentence (post-conversion candidate) that becomes seven-tone (S113B in FIG. 4).

Below, examples of response sentences of the candidate before conversion and after conversion at the time of generation of the seven tone-like candidate of unnecessary part removal are shown.

An example of the formation of seventy-five candidates of unnecessary part removal

Response statement (before conversion)
(Sunday): The current time in London is now 8 o'clock.
(B): rondon no genzai zikoku ha yoru hatizi ni narimasita
(English): The current time in London is 8 o'clock in the evening.

Response sentence (candidate after conversion)
(Sunday): The current time in London is 8 o'clock.
(B): rondon no genzai zikoku ha yoru hatizi

In this example of the generation of the seventy-five candidate generation of unnecessary part removal, the response sentence (before conversion) in the response sentence (before conversion) is processed by removing the semantically unnecessary part which is “it became”. The candidate after conversion is made to become seventy-five.

(C) Third candidate generation processing

In the third candidate generation process, a response sentence (post-conversion candidate) that has seventy-five tones is generated by adding a word end to the response sentence (before conversion) using the word list 211 (S113C in FIG. 4).

FIG. 7 is a diagram showing an example of the ending list 211. As shown in FIG. In FIG. 7, "*" (asterisk) represents an arbitrary character (character string) or part of speech.

In FIG. 7, "this notation" is a word added to the ending, and its part of speech is expressed as "this part of speech". In addition, in a certain response sentence, the notation before "this notation" is "predescription", and the notation after "this notation" is "postscript". Furthermore, the part-of-speech of "pre-notation" is expressed as "pre-part-of-speech", and the part-of-speech of "post-notation" is expressed as "post-part of speech".

In the example of FIG. 7, the present notation "Y" is a particle (final particle), and an arbitrary character consisting of an auxiliary verb (termination form) is designated as the preceding notation, and an arbitrary part of speech as the subsequent notation Any character that becomes is specified. Further, in the example of FIG. 7, the main notation “ne” is a particle (final particle), and an arbitrary character string consisting of a particle (case particle) is designated as the pre-notation, and an optional post-expression Any character consisting of the part of speech of is specified.

At this time, if the former morpheme, the present morpheme, and the latter morpheme match the conditions, the meaning does not change even if the present morpheme is added, so the present morpheme can be added to the response sentence (before conversion) .

In addition, the relationship between the meaning of "*" shown in FIG. 7 and the "present notation" and the "pre-notation" and "post-notation" is the same as in the other figures described later (FIG. 8 and FIG. 9). It is assumed.

The following shows examples of response sentences before and after conversion at the time of generation of the seven-tone candidate with addition of endings.

A first example of seven-tone candidate generation with additional endings

Response statement (before conversion)
(Sun): I sent an email to Mr. Kishi.
(B): kisi san ni me-ru wo sousin simasita
(English): I sent an email to Kishi.

Response sentence (candidate after conversion)
(Sun): I sent an email to Mr. Kishi.
(B): kisi san ni me-ru wo sousin simasita yo

In the first example of the additional seven-tone candidate generation with this ending, the response sentence (post-conversion candidate) is obtained by adding an end that is "Y" following "sent" in the response sentence (before conversion). I'm trying to get seventy five.

A second example of seven-tone candidate generation with additional endings

Response statement (before conversion)
(Sun): The weather of tomorrow in Yokohama is fine.
(B): yokohama no asita no tenki ha hare desu
(English): The weather in Yokohama is sunny tomorrow.

Response sentence (candidate after conversion)
(Sun): The weather in Yokohama tomorrow is sunny.
(B): yokohama no asita no tenki ha hare nanosa

In the second example of this seven-tonal candidate generation with additional endings, the response sentence is seven-six-tonal (post-conversion candidate) by adding an ending that is "nanosa" following "fine" in the response sentence (before conversion) I am trying to be However, in this second example, the word ending “is” in the response sentence (before conversion) is deleted.

In addition, as another example of this seven-tonal candidate generation with the ending after the word, for example, add the ending after the word “ne” to the response sentence “before the conversion” (before the conversion), and It is assumed that a certain response sentence (post-conversion candidate) is generated.

(D) Fourth candidate generation processing

In the fourth candidate generation process, using the semantic invariant word list 212, adding a word (meaning invariant word) whose meaning does not change to the response sentence (before conversion) causes a response sentence (post-conversion candidate) to be converted into seven tones. It generates (S113D of FIG. 4).

FIG. 8 is a diagram showing an example of the semantic invariant word list 212. As shown in FIG.

In the example of FIG. 8, the main notation “after all” is an adverb, and an arbitrary letter consisting of an arbitrary part of speech is designated as the pre-notation and the post-notation. Further, in the example of FIG. 8, the main notation “by the way” is a conjunction, and an arbitrary character consisting of an arbitrary part of speech is designated as the pre-notation and the post-notation.

An example of response sentences before and after conversion at the time of generation of the seven tone candidate with semantic invariant word addition will be shown below.

An example of the generation of seventy-five candidates for adding semantically invariant words

Response statement (before conversion)
(Sun): The map of Kawasaki is projected.
(B): kawasaki no tizu wo utusi masu
(English): I will display a map of Kawasaki.

Response sentence (candidate after conversion)
(Sun): I will reflect the map of Kawasaki after all.
(B): kawasaki no tizu wo yappari utusi masu

In this example of the seventy-five candidate generation with semantic invariant word addition, the response sentence is added by adding a semantic invariant word that is "after all" between "show map" and "project" in the response sentence (before conversion). The (post-conversion candidate) is set to be in seventy-five tones.

(E) Fifth candidate generation processing

In the fifth candidate generation process, the words included in the response sentence (before conversion) are repeated to generate a response sentence (post-conversion candidate) that becomes seven-tone (S113E in FIG. 4).

Below, examples of response sentences of the candidate before conversion and after conversion at the time of repetitively added seven tone candidate generation are shown.

Example of repetitively added seven-tone candidate generation

Response statement (before conversion)
(Sun): The weather forecast for the day after tomorrow is fine.
(B): asatte no tenkiyohou ha hare desu
(English): The weather forecast for the day after tomorrow is sunny.

Response sentence (candidate after conversion)
(Sun): The weather forecast for the day after tomorrow is sunny and sunny.
(B): asatte no tenkiyohou ha hare hare hare

In this example of repetitively added seven-tone candidate generation, the response sentence (post-conversion candidate) is made to have seven-tone by repeating the word “fine” in the response sentence (before conversion) three times. However, in this example of the seven-tone candidate generation, the word ending “is” in the response sentence (before conversion) is removed.

(F) Sixth candidate generation processing

In the sixth candidate generation process, the onomatopoeia is added to the response sentence (before conversion) using the onomatope list 213, thereby generating a response sentence (post-conversion candidate) that becomes seven-tone (S113F in FIG. 4).

FIG. 9 is a diagram showing an example of the onomatopo list 213. As shown in FIG.

In the example of FIG. 9, the present notation "Jan-Jan" is an adverb, and an arbitrary letter consisting of an arbitrary part of speech is designated as the prenotation, and an arbitrary letter consisting of a verb is designated as the postnotation. There is. Further, in the example of FIG. 9, the present notation "Gingin" is an adjective verb, and an arbitrary letter consisting of an arbitrary part of speech is designated as the prenotation, and an arbitrary letter consisting of the verb as the postnotation Is specified.

An example of response sentences before and after conversion when generating the seven tone candidate with onomatopoeia is shown below.

The first example of the generation of seven tone-like candidates with additional onomatopoeia

Response statement (before conversion)
(Sun): Email is coming. what should I do?
(B): me-ru ga kite imasu dou simasuka
(English): E-mail is coming. What will you do?

Response sentence (candidate after conversion)
(Sun): I am sending email. What will you do?
(B): me-ru gane zyanzyan kite imasu dou simasu

In the first example of the generation of the five-tone candidate generation of this onomatopoeia addition, the response sentence (post-conversion candidate) is obtained by adding the onomatopoeia "Janjan" before "I am doing" in the response sentence (before conversion). I'm trying to get seventy five. Also, in this first example, the word end of “ne (final particle)” is added after “e-mail is”, and further, the particle elimination of “ka (final particle)” is performed.

In addition, although "Janjan" is 4 mora, it is 5 sounds, 8 sounds, 5 sounds here, but it is considered to be within the allowable range. Also, the concept of morphemes before and after may be taken as a condition. For example, in the case of this first example, there are morphemes representing quantities before and after “Janjan”.

A second example of the generation of seven tone-like candidates with additional onomatopoeia

Response statement (before conversion)
(Sun): The timer was set after 2 minutes.
(B): ni hungo ni taima setto simasita
(English): Timer set after 2 minutes.

Response sentence (candidate after conversion)
(Sun): 2 minutes after the timer set Chick Tack.
(B): ni hungo ni taima setto tikutakku

In the second example of the seven-tone candidate generation with this onomatopoeia added, following the “timer set” in the response sentence (before conversion), the response sentence (candidate after conversion) is seventy-five tone by adding the onomatopoeic “chuck tack”. I am trying to be However, in this second example, the unnecessary part removal is performed on "I've done" in the response statement (before conversion).

A third example of the generation of seven tone-like candidates with additional onomatopoeia

Response statement (before conversion)
(Sun): The weather forecast for the day after tomorrow is fine.
(B): asatte no tenkiyohou ha hare desu
(English): The weather forecast for the day after tomorrow is sunny.

Response sentence (candidate after conversion)
(Sun): The weather forecast for the day after tomorrow is sunny (R): asatte no tenkiyohou ha hare re re re

In the third example of the seven-tone candidate generation with this onomatopoeia added, when the response sentence (before conversion) is converted to the seventy-five tone, since it does not become the seven-five tone, the "re" included in the word "fine" is repeated three times In this way, the response sentence (post-conversion candidate) is made to be seventy-five. However, in this third example, unnecessary part removal is performed on "is" in the response sentence (before conversion). In the case where the tone does not become complete by any means, for example, a word such as "Dadada" may be added and filled.

(G) Seventh candidate generation processing

In the seventh candidate generation process, the synonym dictionary 214 is used to substitute the words included in the response sentence (before conversion) with synonyms to generate a response sentence (post-conversion candidate) that becomes seven-tone (see FIG. S113G of 4).

FIG. 10 is a diagram showing an example of the synonym dictionary 214. As shown in FIG.

In FIG. 10, “1” and “2” have the same meaning, and the part of speech of “1” is expressed as “part of speech 1”, and the part of speech of “2” is “part 2 of speech” Is represented as

In the example of FIG. 10, the noun "ramen" and the noun "Chinese buckwheat noodles" are designated as synonyms. Further, in the example of FIG. 10, the adjective "happy" and the noun "happy" are designated as synonyms.

At this time, the first morpheme can be replaced with the second morpheme. Conversely, the second morpheme may be replaced with the first morpheme.

The following shows examples of response sentences of candidate before conversion and after conversion at the time of generation of the seven tone-like candidate of synonym substitution.

A First Example of Seven-Traditional Candidate Generation of Synonym Substitution

Response statement (before conversion)
(Sun): Chinese Soba is recommended tonight.
(B): konban ha tyuuka soba ga osusume desu
(English): Chinese noodle is recommended tonight

Response sentence (candidate after conversion)
(Sun): I recommend ramen tonight.
(B): konban ha ra-men ga osusume yo

In the first example of the generation of the sloppy candidates for this synonym substitution, the response sentence (post-conversion candidate) is obtained by replacing the word “Chinese buckwheat noodles” in the response sentence (before conversion) with a synonym for “ramen”. ) To be in seventy-five. However, in this first example, unnecessary parts are removed with respect to "is" in the response sentence (before conversion), and an end which is "yo (final particle)" is added.

A second example of syllabary candidate generation for synonym substitution

Response statement (before conversion)
(Sun): The day after tomorrow will be shopping.
(B): asatte ha kaimono no yotei desu
(English): I will go shopping the day after tomorrow.

Response sentence (candidate after conversion)
(Sun): The day after tomorrow will be shopping.
(B): asatte ha syoppingu no yotei desu

In the second example of the generation of seven tone candidates of this synonym substitution, the response sentence (post-conversion candidate) is obtained by replacing the word "shopping" in the response sentence (before conversion) with the synonym "shopping". Is supposed to be seventy-five.

(H) Combination of first candidate generation processing to seventh candidate generation processing

The first candidate generation process to the seventh candidate generation process described above are independently considered as candidate generation patterns and, of course, using a candidate generation pattern combining them, a response sentence (post-conversion candidate) that becomes swashish is generated. You may do it.

It should be noted that the formation of the seven tone candidate generation of (A) particle removal and (B) unnecessary part removal can be classified as a first case excluding words included in the response sentence (before conversion). In addition, (C) word addition, (D) semantically invariant word addition, (E) repeated addition, and (F) onomatopoeic addition are classified as a second case in which a word is added to the response sentence (before conversion) be able to. Furthermore, (G) synonym substitution can be classified as a third case of replacing a word included in a response sentence (before conversion).

In the following, as an example of a response sentence at the time of seven tone candidate generation in the case of combining any of the first candidate generation processing to the seventh candidate generation process, before conversion at the time of seven tone candidate generation combining word addition and synonym substitution And an example of a response sentence of a candidate after conversion.

Example of Seven Tones Generation with Appending & Synonym Substitution

Response statement (before conversion)
(Sun): Today's horoscope results are very good.
(B): kyou no uranai kekka ha totemo ii desu
(English): Today's fortunetelling results are very good.

Response sentence (candidate after conversion)
(Sun): Today's negotiating result is good.
(B): kyou no ne uranai kekka ha guddo desu

In this example of addition of endings & generation of synonyms of seven-tone candidate generation, after “today's” in the response sentence (before conversion), an end that is “ne” is added, and further, a word that is “very good” Is replaced with the synonym which is "Good", so that the response sentence (post-conversion candidate) is made to be 75-tone.

(I) Examples of languages other than Japanese

Although the above-mentioned description demonstrated the example which converts the Japanese response sentence into the seventy-five tone, the seventy-five tone conversion can be similarly performed also to other languages other than Japanese. Here, as another language, taking English as an example, a case where an English response sentence is converted to seventy-five tonality is shown.

In the following, examples of response sentences before conversion and after conversion at the time of generation of the syllabary candidate for English response sentences are shown.

A First Example of the Generation of the Seventy-Five Candidates for the English Response

Response statement (before conversion)
(English): Today's weather in Tokyo is rainy.

Response sentence (candidate after conversion)
(English): In Tokyo today's weather is rainy.

In the first example of the generation of seven-seven-those candidates for this English response sentence, the word order of “today's weather” and “in Tokyo” is exchanged between the response sentence (before conversion) and the response sentence (post-conversion candidate), It is made for the response sentence to be seventy-five.

A second example of the formation of seventy-five candidates for the English response sentence

Response statement (before conversion)
(English): These are maps you want.

Response sentence (candidate after conversion)
(English): These are maps which you want.

In the second example of the generation of the seven-seventh tone candidate of the English response sentence, the response sentence is made to have seventy-seven tones by inserting “which” following “maps” in the response sentence (before conversion).

A third example of the generation of seventy-five candidates for the English response sentence

Response statement (before conversion)
(English): This is recommendation for you.

(Candidate after conversion)
(English): This is a recommended restaurant.

In the third example of the generation of the seventy-seventh tone candidate of the English response sentence, the response sentence is seventy-seven tones by deleting “for you” in the response sentence (before conversion) and inserting “a” and “restaurant”. I am trying to be

A fourth example of the seventy-seven key candidate generation of the English response sentence

Response statement (before conversion)
(English): You got a mail.

Response sentence (candidate after conversion)
(English): Just you got a mail you're happy?

In the fourth example of the generation of seventy-five candidates for this English response sentence, the response sentence becomes seventy-five by inserting "Just" and "you're happy?" Into the response sentence (before conversion) I have to.

In addition, although there are 75 languages in other languages other than Japanese, such as English mentioned above, there are also languages that are not as familiar as Japanese, so when processing for such languages, user feedback The threshold value in the threshold process using information may be set to be higher than the threshold value in the case of Japanese.

The first embodiment has been described above.

In the first embodiment, by converting the response (text information) to the user's speech into text and outputting it, more human-like speech (for example, comfortable speech) can be performed by the seven-key constraint. I have to.

That is, in recent years, products and services that perform voice dialogues have become widespread, but when conducting voice dialogues, for example, the response is simple as in “Tomorrow's Yokohama weather is fine.” There are many things that are clearly different from the dialogue that Then, depending on the user, for example, it may be assumed that the user is not entertained, can not enjoy the dialogue, hardly retains in memory, does not want to continue using it, etc., so that more human speech can be performed. It was required to

Therefore, in the first embodiment, by returning the response to the user's speech in 75 tones, human taste comes out and it is possible to enjoy the dialog with the system. For example, by making the above-mentioned “The weather in Tomorrow is fine in Yokohama” as “5 sounds, 7 sounds, 5 sounds” as in “Tomorrow's weather in Yokohama is fine.” It can be a human dialogue.

Further, in the first embodiment, by using the user feedback information and the context information, it is possible to more appropriately convert to the seventy-five tone when performing the seventy-five tone constraint.

That is, some users may feel uncomfortable or bothersome by giving a response in the seventy-five tone, and a case may be considered uncomfortable, so as a result of the utterance of the system, in the past situation similar to the present situation, The user's reaction (for example, whether it was good or bad) is collected, and it is made to preferentially select a seven-tone conversion process in which the user's reaction is good. This makes it possible, for example, to provide a 75-tone response based on time and place (TPO: Time Place Occasion).

In addition, by using the present technology, the agent of the dialogue system can perform the character (characteristic) attachment of the agent by more human-like speech due to the seventy-five tone restriction.

In addition, even if the accuracy of the speech synthesis (TTS) by the speech synthesis unit 157 is low (for example, due to instability of intonation) by using the present technology, the accuracy is low by performing the 75-tone restriction. Can hide it. Furthermore, by using the present technology, even if the speech intention understanding unit 152 can not analyze the user's intention, the user accepts that the system can not understand the intention by returning the response in 75 tones. The possibilities can be increased.

Note that the above-described seven tone candidate generation process is an example, and any process may be adopted as long as the response sentence (before conversion) can be converted into the seven tone pattern. For example, when the user utters "I play a song of classical music", the system usually responds such as "I have no classical music", but here the case frame dictionary May be used to respond, such as "Mayonnaise.

That is, in addition to the meaning of "putting music", the word "putting" also has the meaning of "putting seasoning" such as, for example, "mayonnaise". By using it in the sense, human dialogue is made to take place. The case frame is a list of words and nouns related to them according to the usage of words. In addition, when converting the response sentence (before conversion) into the Japanese syllabary, the speaking style may be given priority as a whole.

<2. Second embodiment>

By the way, in the first embodiment described above, when the user's dialogue, that is, the content of the user's utterance is analyzed and the response to the content of the utterance is outputted, the more human-like response is performed by the 75-degree constraint. It was made to be known.

The present technology is not limited to such a dialogue system, and can be applied to various speech utterance scenes, for example, a scene in which a character such as avatar reads a document (text) such as news or weather forecast. Therefore, hereinafter, a case where the present technology is applied to a news program production system will be described as a second embodiment.

(Example of a news program production system)
FIG. 11 shows an example of a news program production system to which the present technology is applied.

In FIG. 11, the information processing apparatus 10 is configured as a part of a news program production system, and for example, in a large vision installed in a city or in front of a station, a female character has text information of news or weather forecast. , Read aloud by speech synthesis.

At this time, a female character reads the text information converted into the seventy-five tone by applying the present technology, for example, at intervals of news, at the timing of the end of the program, or the like.

In this way, by causing a more human-like speech to be performed by the seventy-five tone restriction, for example, a blurred character is directed to impress that the user has an unexpected side, so that the female's The number of users interested in the character can be increased, and as a result, the viewer of the news program can be increased.

Here, user feedback information in the case where the present technology is applied to a news program production system, for example, analyzes a captured image captured by a camera 111 provided in the information processing apparatus 10 and installs it in a town etc. The score for each candidate generation pattern can be calculated from the expression of the passerby who is watching the large-sized vision.

Also here, by using the service for providing the above-described API for recognition of the face and image, by sending a captured image including the faces of a large number of passersby, for example, expressions of the passersby such as pleasure or surprise You can get information about emotions.

For example, when it is recognized that many passers are having a pleasant expression, for example, many passers have a sloppy expression while the score is added. If recognized, the score is subtracted.

Note that, for example, when the user views the news program at each home using a television receiver or a smartphone, the user operates the remote controller or an application activated by the smartphone. A vote (for example, good or bad) or the like may be made to the seventy-five utterances of the female character.

Also here, as the context information, other than time information and position information, for example, information obtained from an analysis result of a captured image captured by the camera 111 provided in the information processing apparatus 10 can be used.

In the news program production system to which the present technology is applied, there is no need to recognize the user's speech or generate a response, so the functions of the information processing unit 150 shown in FIG. Of the feedback collection unit 158), for example, functions may be included according to the context acquisition unit 155, the 75-tone conversion unit 156, the speech synthesis unit 157, and the user feedback collection unit 158.

Further, in the above description, the voice uttering by the female character has been described as an example, but not limited to voice uttering, the contents of news and weather forecast, and the contents of uttered seventy-five tones are text information such as large vision and display It may be displayed on the screen.

The second embodiment has been described above.

In the second embodiment, text information (for example, an utterance between news pieces) intended to be read out by speech synthesis is converted to seventy-five tones and outputted, thereby making the human being more human by the seventy-five constraints. It makes it possible to make an utterance (for example, an utterance with a sense of closeness). Further, also in the second embodiment, by using the user feedback information and the context information, it is possible to more appropriately convert to the seventy-five tone when performing the seventy-five tone constraint.

<3. Third embodiment>

Further, although the news program production system has been described in the second embodiment described above as a configuration other than the dialogue system, the present invention may be applied to, for example, a scene in which CM is streamed by digital signage as another configuration. Can. Therefore, the case where the present technology is applied to a digital signage system will be described below as a third embodiment.

(Example of digital signage system)
FIG. 12 shows an example of a digital signage system to which the present technology is applied.

In FIG. 12, the information processing apparatus 10 is configured as a part of a digital signage system, for example, installed indoors at a station or a commercial facility, and displays information such as an advertisement or guidance.

At this time, for example, at the timing between CM and CM, the present technology is applied to read out the text information converted into the seventy-five tone. More specifically, as shown in FIG. 12, a passerby who is walking near digital signage gets interested in between the CM of a car flowing at a certain time and the CM of a detergent flowing at a later time For example, the voice of the 75 tone speech of the blurred content is output.

In this way, by enabling the human speech to be performed by the seventy-five-tone restriction, for example, it is possible to impress the goods targeted for CM and make the goods interested in the passersby .

In addition, although voice uttering was demonstrated to an example here, you may make it the text information converted not only into voice uttering but into the seventy-five tone be displayed on the screen of digital signage.

Here, user feedback information in the case where the present technology is applied to a digital signage system is, for example, analyzing a captured image captured by a camera 111 provided in the information processing apparatus 10 and using it indoors at a station or the like. The score for each candidate generation pattern can be calculated from the expression of the passerby viewing the digital signage installed.

In the digital signage system to which the present technology is applied, there is no need to recognize the user's speech or to generate a response, so the functions of the information processing unit 150 shown in FIG. Of the feedback collection unit 158), for example, functions may be included according to the context acquisition unit 155, the 75-tone conversion unit 156, the speech synthesis unit 157, and the user feedback collection unit 158.

The third embodiment has been described above.

In the third embodiment, text information (for example, an utterance output between CMs and the like) intended to be read out by speech synthesis is converted into seventy-five tones and output, thereby the seven-five tones restriction is applied. Make it possible to perform more human-like speech (for example, speech that gives interest to goods). Also in the third embodiment, the user feedback information and the context information can be used to more appropriately convert to the seventy-five tone when performing the seventy-five tone constraint.

<4. Modified example>

(Example of configuration of dialogue system)
In the above description, the case where the interactive service is realized by executing the interactive process by (the information processing unit 150 of) the information processing apparatus 10 is exemplified, but as a configuration for realizing such an interactive service For example, a configuration as shown in FIG. 13 can be employed.

In FIG. 13, the dialogue system 1 is installed on the local side such as a user home and is installed on the information processing apparatus 10 functioning as a user interface of the dialogue service and on the cloud side such as a data center to realize the dialogue function of the dialogue service. It comprises the server 20 which performs the process for doing.

In the dialogue system 1, the information processing apparatus 10 and the server 20 are mutually connected via the Internet 30.

The information processing apparatus 10 is, for example, a speaker that can be connected to a network such as a home LAN, and is also called a so-called smart speaker or a home agent. This type of speaker can perform, for example, voice dialogue with the user and voice operation to devices such as lighting fixtures and air conditioners in addition to reproduction of music.

The information processing apparatus 10 can provide a dialog service (a user interface of the user) to the user by cooperating with the server 20 via the Internet 30.

That is, the information processing apparatus 10 picks up the voice (user's utterance) emitted from the user, and transmits the voice data to the server 20 via the Internet 30. Further, the information processing apparatus 10 receives processing data transmitted from the server 20 via the Internet, and outputs a voice corresponding to the processing data.

The server 20 is a server that provides a cloud-based dialog service. The server 20 performs voice recognition processing for converting voice data transmitted from the information processing apparatus 10 via the Internet 30 into text information. Further, the server 20 performs processing such as interactive processing according to the user's intention on the text information, and transmits processing data obtained as a result of the processing to the information processing apparatus 10 via the Internet 30.

In the configuration including the local side and the cloud side shown in FIG. 13, although a system for generating a response to the user's speech has been described as the dialogue system 1, the above-mentioned news program production system, digital signage system, etc. The system may be configured as a system for reading out the text information of the above by speech synthesis.

As described above, in the above description, the functions (voice recognition unit 151 to user feedback collection unit 158) of the information processing unit 150 in FIG. 2 are described as being incorporated in the information processing apparatus 10. The functions of 150 may be incorporated as the functions of the server 20. That is, each of the functions (the speech recognition unit 151 to the user feedback collection unit 158) of the information processing unit 150 in FIG. 2 may be incorporated in any of the information processing apparatus 10 and the server 20.

For example, among the functions of the information processing unit 150 in FIG. 2, the speech recognition unit 151 to the response generation unit 154 are incorporated in the server 20 on the cloud side, and the context acquisition unit 155 to the user feedback collection unit 158 are information on the local side. It can be incorporated into the processing device 10.

Note that, regardless of which configuration is adopted, the server 20 on the Internet 30 manages databases such as the user feedback information DB 202, the ending list 211, the meaning-invariant word list 212, the onomatopolist 213, and the synonym dictionary 214. Can.

As described above, the present technology can have a cloud computing configuration in which one function is shared and processed by a plurality of devices via a network.

(Computer configuration)
The above-described series of processes (for example, the conversion process of the conversion into seventy-five shown in FIGS. 3 to 4) can be performed by hardware or can be performed by software. As shown in the configurations of FIG. 1 and FIG. 2, when a series of processes are executed by software, a program constituting the software is installed in the information processing apparatus 10 (computer).

The program executed by the information processing apparatus 10 (CPU 101) of FIG. 1 can be provided by being recorded on, for example, a removable recording medium as a package medium or the like. The removable recording medium is composed of, for example, a magnetic disk, an optical disk, a magneto-optical disk, or a semiconductor memory.

Also, the program can be provided via a wired or wireless transmission medium such as a local area network or digital satellite broadcasting besides the Internet 30.

In the information processing apparatus 10 of FIG. 1, the program can be installed in a recording apparatus such as the hard disk 105 via the information access unit 104 by attaching a removable recording medium to a drive (not shown).

Further, the program can be received by the communication I / F 116 via a wired or wireless transmission medium, and can be installed in a recording device such as the hard disk 105 or the like. In addition, the program can be installed in advance in the ROM 102, a recording device, or the like.

Here, in the present specification, the processing performed by the information processing apparatus 10 (CPU 101) of FIG. 1 according to the program does not necessarily need to be performed chronologically in the order described as the flowcharts shown in FIGS. Absent. That is, the processing performed by the information processing apparatus 10 (CPU 101) of FIG. 1 according to the program includes processing (for example, parallel processing or processing by an object) executed in parallel or individually.

Further, the program may be processed by one computer (processor) or may be distributed and processed by a plurality of computers. That is, each step of the heptatonic conversion output process shown in FIG. 3 to FIG. 4 can be executed by a plurality of devices in addition to being executed by one device. Furthermore, in the case where a plurality of processes are included in one step, the plurality of processes included in one step can be executed by being shared by a plurality of devices in addition to being executed by one device.

Note that the embodiments of the present technology are not limited to the above-described embodiments, and various modifications can be made without departing from the scope of the present technology.

Further, the present technology can have the following configurations.

(1)
An information processing apparatus, comprising: a processing unit that converts input text information into seventy-five tones and outputs the converted information.
(2)
The information processing apparatus according to (1), wherein the processing unit converts the text information into a heptad according to user feedback information obtained by feedback from a user.
(3)
The information processing apparatus according to (1) or (2), wherein the processing unit converts the text information into a heptad according to context information on the text information.
(4)
The processing unit is
The seven-to-five conversion process adapted to the context information is selected with reference to the user feedback information,
The information processing apparatus according to (3), which executes selected seven-toned conversion processing on the text information.
(5)
The user feedback information includes information obtained by scoring the past environment information and the reaction of the user for each candidate generation pattern for generating seventy-five candidates.
The context information includes current environment information,
The processing unit selects a candidate generation pattern in which a score of past environmental information that is the same as or similar to the current environmental information is equal to or higher than a threshold among the past environmental information. apparatus.
(6)
The information processing apparatus according to (5), wherein the processing unit randomly selects one candidate generation pattern when a plurality of candidate generation patterns can be selected.
(7)
The candidate generation pattern is
Remove the particle and generate a candidate that becomes 75 grade, without the particle,
Unnecessary part elimination, which removes candidates that are meaningless by eliminating semantically unnecessary parts
Add endings, add endings to generate candidates that become 75-tone,
Add semantically invariant words, adding semantically unchanged words to generate candidates that become 75-tone,
Repeat additions to generate a candidate that will repeat the word and become 75-tone,
One of the seven tone candidate generation or a plurality of seven tone candidates out of seven tone candidate generation by onomatopoeia adding the onomatopoeia and generating the seventy five candidate, and synonym addition adding the synonym substitution and generating the seven fifth tone candidate The information processing apparatus according to (5) or (6), including a combination of generation.
(8)
The information processing apparatus according to any one of (5) to (7), wherein the context information includes at least one piece of information indicating a time zone, a place, a speaker, an attendant, and an atmosphere of a place.
(9)
The information processing apparatus according to any one of (1) to (8), wherein the text information is a response to an utterance of a user.
(10)
The information processing apparatus according to any one of (1) to (8), wherein the text information is information intended to be read out by speech synthesis.
(11)
In an information processing method of an information processing apparatus,
The information processing apparatus
An information processing method that converts input text information into seventy-five tone and outputs it.

Reference Signs List 1 dialogue system, 10 information processing apparatus, 20 server, 30 internet, 100 control unit, 101 CPU, 102 ROM, 103 RAM, 150 information processing unit, 151 speech recognition unit, 152 utterance intention understanding unit, 153 application / service execution unit , 154 response generation unit, 155 context acquisition unit, 156 seventy-five conversion unit, 157 speech synthesis unit, 158 user feedback collection unit, 201 context information DB, 202 user feedback information DB, 211 tail list, 212 semantic invariant word list, 213 onomatopoeia Lists, 214 Synonym Lists

Claims

An information processing apparatus, comprising: a processing unit that converts input text information into seventy-five tones and outputs the converted information.
The information processing apparatus according to claim 1, wherein the processing unit converts the text information into a heptad according to user feedback information obtained by feedback from a user.
The information processing apparatus according to claim 2, wherein the processing unit converts the text information into a heptad according to context information on the text information.
The processing unit is
The seven-to-five conversion process adapted to the context information is selected with reference to the user feedback information,
The information processing apparatus according to claim 3, wherein the selected seven-to-five conversion process is performed on the text information.
The user feedback information includes information obtained by scoring the past environment information and the reaction of the user for each candidate generation pattern for generating seventy-five candidates.
The context information includes current environment information,
The information processing apparatus according to claim 4, wherein the processing unit selects a candidate generation pattern in which a score of past environmental information that is the same as or similar to the current environmental information is equal to or higher than a threshold among the past environmental information. .
The information processing apparatus according to claim 5, wherein the processing unit randomly selects one candidate generation pattern when a plurality of candidate generation patterns can be selected.
The candidate generation pattern is
Remove the particle and generate a candidate that becomes 75 grade, without the particle,
Unnecessary part elimination, which removes candidates that are meaningless by eliminating semantically unnecessary parts
Add endings, add endings to generate candidates that become 75-tone,
Add semantically invariant words, adding semantically unchanged words to generate candidates that become 75-tone,
Repeat additions to generate a candidate that will repeat the word and become 75-tone,
One of the seven tone candidate generation or a plurality of seven tone candidates out of seven tone candidate generation by onomatopoeia adding the onomatopoeia and generating the seventy five candidate, and synonym addition adding the synonym substitution and generating the seven fifth tone candidate The information processing apparatus according to claim 5, comprising a combination of generation.
The information processing apparatus according to claim 5, wherein the context information includes at least one piece of information indicating a time zone, a place, a speaker, an attendant, and an atmosphere of a place.
The information processing apparatus according to claim 1, wherein the text information is a response to an utterance of a user.
The information processing apparatus according to claim 1, wherein the text information is information intended to be read out by speech synthesis.
In an information processing method of an information processing apparatus,
The information processing apparatus
An information processing method that converts input text information into seventy-five tone and outputs it.