WO2021161841A1

WO2021161841A1 - Information processing device and information processing method

Info

Publication number: WO2021161841A1
Application number: PCT/JP2021/003600
Authority: WO
Inventors: 広岩瀬; 真一河野
Original assignee: ソニーグループ株式会社
Priority date: 2020-02-10
Filing date: 2021-02-01
Publication date: 2021-08-19

Abstract

[Problem] To implement smooth text communication between users. [Solution] This information processing device is provided with: a paralinguistic information acquisition unit that, on the basis of a sensing signal obtained by sensing a first user, acquires paralinguistic information of the first user; a candidate generation unit that, on the basis of the paralinguistic information, generates candidates for a message to be transmitted to the first user; and a transmission unit that transmits the candidates for the message to a device of a second user that performs a message exchange with the first user.

Description

Information processing device and information processing method

This disclosure relates to an information processing device and an information processing method.

With the spread of voice recognition, it is expected that there will be more opportunities to quickly input characters by voice utterance in text communication such as SNS, chat, and email.

The spoken voice utterance contains information such as the speaker's intention, attitude, and emotion that is not transcribed. For smoother communication, it is desirable for the other party to understand the content of the speaker's utterance, including the intention of the speaker, in addition to the textual information, and reply. However, with the current text communication tools that use voice recognition, the content of the utterance is transmitted to the other party with an intention different from the intention of the speaker, which may cause a discrepancy in communication.

In Patent Document 1 below, a user selects a part of a message received from another device, generates a candidate for a reply message based on the selected information and its additional information (position, time, etc.), and informs the user. Let them select candidates. However, with this technique, it is not possible to generate a reply message based on the user's intention.

International Publication No. 2016/084481

The present disclosure provides an information processing device and an information processing method that enable smooth text communication between users.

The information processing apparatus of the present disclosure has a para-language information acquisition unit that acquires the para-language information of the first user based on the sensing signal sensed by the first user, and transmits the para-language information to the first user based on the para-language information. It includes a candidate generation unit that generates candidates for a message to be processed, and a transmission unit that transmits the message candidates to a device of a second user who exchanges messages with the first user.

The sensing signal includes the voice signal of the first user.
The information processing device includes a voice recognition processing unit that voice-recognizes the voice signal of the first user and acquires the text data of the first message spoken by the first user.
The candidate generation unit generates a candidate for the second message returned by the second user in response to the first message.
The transmission unit may transmit the text data and the candidate for the second message to the device of the second user.

The information processing device
A receiving unit that receives a reply message including the candidate selected from the message candidates from the device of the second user, and a receiving unit.
A display unit that displays the reply message received by the reception unit may be provided.

The sensing signal includes the voice signal of the first user.
The para-language information acquisition unit may acquire the para-language information based on the acoustic feature information of the voice signal of the first user.

The sensing signal includes an image pickup signal of the first user.
The para-language information acquisition unit may perform image recognition based on the image pickup signal of the first user and acquire the para-language information based on the result of the image recognition.

The information processing device includes a natural language processing unit that estimates the intention of the first user's utterance and the target of the utterance based on the text data.
The candidate generation unit may generate a candidate for the second message based on the intention of the utterance, the target of the utterance, and the para-language information.

The information processing device includes a reply phrase database that stores a plurality of phrases.
The candidate generation unit may specify a phrase corresponding to the intention of the utterance and the para-language information in the reply phrase database, and generate a candidate for the second message based on the specified phrase.

The para-language information may include information on whether or not the first user intends to ask a question.

The para-language information may include information that identifies a word that is emphasized in the text data.

The para-language information may include information that specifies a word delimiter position in the text data.

The paralinguistic information may include information representing at least one of the emotion, urgency, severity, flank, and tension of the first user.

The candidate generation unit decorates the text data based on the para-language information, and then decorates the text data.
The transmitting unit may transmit the decorated text data.

The candidate generation unit may add a question mark to the end of the text data when the para-language information indicates that the first user intends to ask a question.

The para-language information includes information for identifying a word emphasized by the first user in the text data, and the candidate generation unit may change the appearance of the emphasized word in the text data. ..

The para-language information includes information for specifying a word delimiter position in the text data.
The candidate generation unit may add information for identifying the delimiter between words to the delimiter position in the text data.

The para-language information includes information representing the emotion of the first user, and the candidate generation unit may add information for identifying the emotion to the text data.

The information processing device may include a display unit that displays the decorated text data.

The information processing device includes a translation processing unit that translates the text data and the candidate for the second message into the language used by the second user.
The transmission unit may transmit the text data translated into the language used by the second user and the candidate for the second message translated into the language used by the second user.

The second user may be a human or a computer system.

The information processing method of the present disclosure is
Based on the sensing signal sensed by the first user, the para-language information of the first user is acquired, and the para-language information of the first user is acquired.
Based on the para-language information, a candidate message to be sent to the first user is generated.
The message candidate is transmitted to the device of the second user who exchanges messages with the first user.

The block diagram which shows the structural example of the information processing system which concerns on 1st Embodiment of this disclosure. Block diagram of information processing device. The figure for demonstrating the specific example of this embodiment. The figure which shows an example of the reply phrase DB. Block diagram of the receiver. The figure which shows the example which sent message information is displayed on the display part. The figure which shows the example which arranged the audio output button. The figure which shows the example which the selection result information was displayed on the display part. The figure which shows the example which the user 1 continues the dialogue with the user 2. The figure for demonstrating the specific example 1. The figure which shows an example of the reply phrase DB which concerns on a specific example 1. The figure which shows the example which displayed the sent message information. The figure for demonstrating the specific example 2. The figure which shows an example of the reply phrase DB which concerns on the specific example 2. The figure which shows the example which displayed the sent message information. The figure for demonstrating the specific example 3. The figure which shows an example of the reply phrase DB which concerns on a specific example 3. The figure which shows the example which displayed the sent message information. A flowchart of an example of an operation in which the candidate generation unit generates a candidate for a reply message. The block diagram of the receiving apparatus which concerns on modification 2. Explanatory drawing of modification 3. FIG. The block diagram of the information processing apparatus which concerns on modification 4. The figure which shows the display example of the menu which reproduces an audio sample. The figure which shows an example of the hardware configuration of the information processing apparatus of FIG.

Hereinafter, embodiments of the present disclosure will be described with reference to the drawings. In one or more embodiments set forth in the present disclosure, the elements included in each embodiment can be combined with each other, and the combined deliverables also form part of the embodiments set forth in the present disclosure.

FIG. 1 is a block diagram showing a configuration example of an information processing system according to the first embodiment of the present disclosure. The information processing system of FIG. 1 includes an information processing device 101 mounted on a terminal for user 1 (speaker) and a receiving device 201 mounted on a terminal for user 2 that exchanges text-based messages with user 1. To be equipped. The information processing device 101 is operated by the user 1, and the receiving device 201 is operated by the user 2. The information processing device 101 and the receiving device 201 are connected to each other via the communication network 301. The user 1 on the transmitting side uses the information processing device 101 to interact with the user 2 on the receiving side by exchanging text-based messages.

The communication network 301 is a wired, wireless, or mixed wired and wireless network. The communication network 301 may be a local area network (LAN) or a wide area network (WAN) such as the Internet. The communication network 301 may be a network of any standard or protocol. For example, the communication network 301 may be a wireless LAN, a 4G or 5G mobile network, or the like.

The terminal on which the information processing device 101 is mounted and the terminal on which the receiving device 201 is mounted can be any terminal operated by the user, such as a mobile terminal, a personal computer (PC), or a wearable device. good. Examples of mobile terminals include smartphones, tablet terminals, and mobile phones. Examples of personal computers include desktop PCs and notebook PCs. Examples of wearable devices include AR (Augmented Reality) glasses, MR (Mixed Reality) glasses, and VR (Virtual Reality) head-mounted displays.

The outline of this information processing system will be explained. The information processing device 101 converts a voice signal message uttered by the user 1 into a text data message by voice recognition processing. The converted text data message is called an utterance text or a voice recognition text. Further, the information processing device 101 performs acoustic analysis recognition processing on the voice signal, and performs acoustic feature information (for example, the amount of change in the fundamental frequency (pitch), the utterance frequency of each word, the volume of each word, and each word. Obtain the utterance speed, the time interval before and after the utterance of a word, the length of the silent section, the spectrum, etc.).

The information processing device 101 acquires para-language information, which is information not included in the utterance text, such as the speaker's intention, attitude, and emotion, as para-language recognition processing, based on the acoustic feature information. The information processing device 101 generates one or more message candidates that the user 2 returns to the user 1 based on the acquired para-language information. The information processing device 101 transmits the utterance text and the candidate of the reply message to the receiving device 201.

The receiving device 201 displays the utterance text of the user 1 and the candidate of the reply message to the user 1. The user 2 confirms the message spoken by the user 1 by viewing the spoken text. User 2 selects a reply to this message from the displayed reply message candidates. The receiving device 201 transmits the reply message selected by the user 2 to the information processing device 101. The information processing device 101 displays the received reply message on the display unit. The user 1 can confirm the message replied by the user 2 by viewing the reply message displayed on the display unit.

In the present embodiment, by generating a message candidate to be returned to the user 1 based on the para-language information of the user 1, the para-language information such as the intention, attitude, and emotion of the speaker, which is not included in the utterance text of the user 1, is generated. Can be reflected in the reply message candidates. For example, when the para-language information indicates that the user 1 has a question intention, the utterance text is interpreted as including the question, and the answer to the question is generated as a candidate for the reply message. On the other hand, when the para-language information indicates that the user 1 does not intend to ask the question, the reply message including the opinion of the user 2 such as a positive opinion or a negative opinion to the opinion of the spoken text. Generate candidates. In this way, the user 2 can send a reply message including the contents including the para-language information of the user 1. That is, the intention of the content of the utterance of the user 1 can be correctly conveyed to the user 2, and the user 2 can send a reply message after correctly grasping the intention of the user's utterance. Hereinafter, the present embodiment will be described in more detail.

FIG. 2 is a block diagram of the information processing device 101. The information processing device 101 includes a voice input unit 111, a voice recognition processing unit 112, a natural language understanding processing unit 113, a candidate generation unit 115, a reply phrase database (DB) 116, an image input unit 117, a transmission / reception unit 119, and para-language information acquisition. It includes a unit 120, an image output processing unit 121, a display unit 122, an audio output processing unit 131, and an audio output unit 132. The para-language information acquisition unit 120 includes an acoustic analysis recognition processing unit 114 and an image recognition processing unit 118.

Part or all of these elements included in the information processing device 101 are composed of hardware, software, or a combination thereof. The hardware includes a processor such as a CPU or a dedicated circuit as an example. The reply phrase DB 116 is composed of a storage device such as a memory device or a hard disk device. The reply phrase DB 116 may be provided as an external device of the information processing device 101 or as a database server on the communication network. Further, a clock for counting the time may be provided in the information processing device 101. Further, the information processing apparatus 101 may be provided with an operation input unit for inputting various instructions or data.

The voice input unit 111 senses the voice of the message uttered by the user 1 and converts the sensed signal into an electric signal. The converted electrical signal is called an audio signal. The voice input unit 111 is, for example, a microphone. The voice input unit 111 provides the voice signal to the voice recognition processing unit 112 and the acoustic analysis recognition processing unit 114. The user 1 utters a message to be transmitted to the user 2, for example. The voice input unit 111 may start the operation when the user 1 gives an instruction to collect the sound, and may end the operation when the user 1 gives an instruction to end the sound collection. As an example, the sound collection start instruction and the sound collection end instruction may be given by pressing the sound collection button provided on the terminal body or the touch panel.

The image input unit 117 senses the image of the user 1 and obtains an image pickup signal. The image input unit 117 is a sensing device such as a camera. The image may be a still image or a moving image. As an example, the image input unit 117 performs imaging at regular time intervals. The object of imaging may be any part of the body of the user 1, such as the face, upper body, and whole body of the user 1. The image input unit 117 provides the image pickup signal to the image recognition processing unit 118. The operations of the image input unit 117 and the audio input unit 111 may be synchronized. For example, the image input unit 117 may perform the imaging operation only when the audio input unit 111 is operating.

The voice recognition processing unit 112 converts the voice signal into a text data message (spoken text) by performing voice recognition processing on the voice signal input from the voice input unit 111. The utterance text includes the linguistic information of the user 1's utterance. In the voice recognition process, for example, each phoneme included in the voice spoken by the user 1 is identified based on the acoustic model of the phoneme, and a text is generated based on the identified phoneme and the language model. The voice recognition processing unit 112 provides the generated utterance text to the natural language understanding processing unit 113, the candidate generation unit 115, and the transmission / reception unit 119.

The natural language understanding (NLU: Natural Language Understanding) processing unit 113 performs natural language understanding processing based on the utterance text (linguistic information) to perform the utterance intent (Intent) and the target of the utterance in the intention. Estimate (Entity). That is, it is estimated by categorization what kind of intention the utterance was made, and in that intention, the word that is the target of the utterance is estimated by categorization as Entity. Entity contains a value and a type.

FIG. 3 is a diagram for explaining a specific example of the present embodiment. For example, suppose user 1 utters "I can go home at 8 o'clock today". In this case, the voice recognition processing unit 112 generates "I can return at 8 o'clock today" as the utterance text (speech recognition text). The natural language understanding processing unit 113 estimates "GoHome" (homecoming) as the intent of the utterance (Intent) from the utterance text of "I can return at 8 o'clock today" by the natural language understanding processing, and sets it as the utterance target (Entity). Estimate 8 o'clock (time). The type of Entity is "time" and the value of Entity is "8 o'clock". The natural language understanding processing unit 113 learns an algorithm for estimating Intent and Entity by machine learning in advance, for example.

The natural language understanding processing unit 113 provides the candidate generation unit 115 with the Intent and Entity estimated by the natural language understanding processing.

The para-language information acquisition unit 120 acquires the para-language information of the user 1 based on the sensing signal sensed by the user 1. The sensing signal includes at least one of the user 1 audio signal and the image pickup signal. The para-language information acquisition unit 120 includes an acoustic analysis recognition processing unit 114 and an image recognition processing unit 118.

The acoustic analysis recognition processing unit 114 generates acoustic feature information of the voice signal by performing acoustic analysis by signal processing or a learned neural network based on the voice signal provided by the voice input unit 111. As an example of acoustic analysis, the amount of change in the fundamental frequency (pitch) of an audio signal may be extracted. In addition, the frequency of utterance of each word included in the audio signal, the volume of each word, the utterance speed of each word, and the time interval before and after the utterance of the word may be extracted. In addition, the time of a silent section (that is, a time section between utterances) included in the audio signal may be measured. In addition, the feature amount (spectrum, squeeze, etc.) of the audio signal may be extracted. The example of acoustic analysis described is only an example, and various other processes are possible.

The acoustic analysis recognition processing unit 114 performs the para-language recognition process based on the acoustic feature information to acquire the para-language information which is the information not included in the utterance text among the voice signals of the user 1. Paralinguistic information is information such as the speaker's intention, attitude, and emotion that is not transcribed by voice recognition processing.

For example, as shown in FIG. 3 described above, the acoustic analysis of the utterance text "I can return at 8 o'clock today" is performed, and the amount of change in the fundamental frequency is detected. At the end of the utterance, it is judged whether the amount of change in the fundamental frequency has increased by a certain value or more. If the pitch rises above a certain value, that is, if the pitch rises at the end of the utterance, the user 1 determines that the question is intended. The acoustic analysis recognition processing unit 114 generates para-language information indicating whether or not the speaker (user 1) intends to ask a question. There are various other examples of para-language information, and the details will be described later.

The acoustic analysis recognition processing unit 114 provides the acquired para-language information to the candidate generation unit 115. The acoustic analysis recognition processing unit 114 may provide the acoustic analysis information to the candidate generation unit 115 together with the para-language information.

The image recognition processing unit 118 extracts para-language information by performing image recognition processing on the image pickup signal input from the image input unit 117. For example, the shape of a person's mouth when asking a question is learned in advance, and whether or not the user 1 intends to ask a question is acquired as para-language information by image recognition from the image signal of the user 1. In addition, the user 1 image-recognizes the gesture of bending his / her neck, and acquires as para-language information whether or not the user 1 intends to ask a question. In addition, the shape of the mouth of the user 1 is image-recognized, and the time between utterances of the user 1 (time during non-utterance) is acquired as para-language information. By recognizing the facial expression of the user as an image, the emotion at the time of utterance may be recognized as para-language information.

The candidate generation unit 115 includes para-language information provided by at least one of the acoustic analysis recognition processing unit 114 and the image recognition processing unit 118, the Intent and Entity provided by the natural language understanding processing unit 113, and the voice recognition processing unit 112. Generates a candidate message for the user 2 to reply to the user 1 based on the speech text provided by the user 2. That is, a candidate for a reply message to the message of the spoken text of the user 1 is generated. The candidate generation unit 115 uses the reply phrase DB 116 to generate a candidate for the reply message.

The reply phrase DB 116 stores a plurality of reply phrases according to the Intent, para-language information, and the like. Among multiple reply phrases, there is also a reply phrase that includes a slot that stores the Entity as a parameter.

The candidate generation unit 115 acquires the reply phrase corresponding to the Intent and the para-language information from the reply phrase DB 116. If a slot exists in the acquired reply phrase, the value of Entity is stored in the slot, and the reply phrase containing the Entity value is used as a candidate for the reply message. If the slot does not exist, the obtained reply phrase is used as a candidate for the reply message.

FIG. 4 shows an example of the reply phrase DB116. The reply phrase DB 116 includes an Intent, para-language information, and a reply phrase. In the example shown, the Intent is "GoHome" and the paralinguistic information is "question" or "non-question". The <> in the reply phrase column represents the slot. In this example, the question reply phrase contains slots, but some or all of the question reply phrases may not contain slots. Further, although the non-question reply phrase does not include a slot, a slot may be included in a part or all of the non-question reply phrase. The number of reply phrases for questions is three, and the number of reply phrases for non-questions is three, but each of them may be less than three or four or more.

As a specific example, an example of generating a reply message candidate when the para-language information indicates that the user 1 intends to ask a question is shown. It is assumed that the values of Intent and Entity provided by the natural language understanding processing unit 113 are “GoHome” and “8 o'clock”. The candidate generation unit 115 acquires the reply phrase corresponding to the “GoHome” and the “question” from the reply phrase DB 116. As a result, three reply phrases "Yeah, I can go back to <>", "I'm going to go back later than <>", and "I'm going to go back earlier than <>" are acquired. The Entity value "8 o'clock" is stored in the slot included in each reply phrase. As a result, three reply message candidates, "Yeah, I can go home at 8 o'clock", "I'm going to go home later than 8 o'clock", and "I'm going to go home earlier than 8 o'clock", can be obtained.

On the other hand, when the para-language information indicates that the user 1 intends to ask a non-question, a candidate for a reply message is generated as follows. It is assumed that the values of Intent and Entity provided by the natural language understanding processing unit 113 are “GoHome” and “8 o'clock”. The candidate generation unit 115 acquires the reply phrase corresponding to “GoHome” and “non-question” from the reply phrase DB 116. You will get three reply phrases: "OK", "Come back sooner", and "You can come back later". Since none of the reply phrases contain slots, the obtained reply phrase is used as a candidate for the reply message as it is.

When there are many reply phrases that match the Intent and para-language information, the number of reply phrases to be acquired may be suppressed to the upper limit or less. The upper limit may be adjustable by the setting of user 1. When acquiring the number of reply phrases less than the upper limit, the reply phrases to be acquired may be randomly selected. Alternatively, the priority may be set as the reply phrase, and the number of reply phrases equal to or less than the upper limit may be selected according to the priority. You may select the reply phrase in any other way. Each reply phrase may be given a number that identifies the reply phrase.

The candidate generation unit 115 provides the acquired reply message candidate to the transmission / reception unit 119. When there are a plurality of reply message candidates, a number for identifying the plurality of candidates may be set. As described above, the number may be preset in the reply phrase DB 116, or may be dynamically determined by the candidate generation unit 115. The method of determination may be random or any other method. The candidate generation unit 115 may provide the utterance text to the transmission / reception unit 119. However, in the present embodiment, the utterance text is provided from the voice recognition processing unit 112 to the transmission / reception unit 119.

The transmission / reception unit 119 generates transmission message information including a reply message candidate and an utterance text, and transmits the generated transmission message information to the receiving device 201. The transmitted message information is packetized according to the communication protocol used in the communication network 301, and is transmitted to the receiving device 201 as packetized data, for example. Information necessary for delivering data to the receiving device 201, for example, the address (IP address, etc.) of the receiving device 201 is preset. When the transmission / reception unit 119 performs wireless communication, the transmission / reception unit 119 may include at least one antenna.

In the present embodiment, the information processing device 101 directly transmits the transmitted message information to the receiving device 201, but a server is interposed between the information processing device 101 and the receiving device 201, and the information processing device 101 and the receiving device 201 It is also possible to communicate with each server. In this case, the server provides a service that mediates the transmission and reception of messages between the information processing device 101 and the receiving device 201. In this case, the information processing device 101 transmits the transmitted message information to the server, and the receiving device 201 accesses the server to acquire the transmitted message information addressed to the receiving device 201.

FIG. 5 is a block diagram of the receiving device 201. The receiving device 201 includes a voice input unit 211, a voice recognition processing unit 212, a natural language understanding processing unit 213, an image input unit 214, an image recognition processing unit 215, a selection result recognition unit 217, a transmission / reception unit 218, an operation input unit 216, and an image. It includes an output processing unit 221 and a display unit 222, an audio output processing unit 231 and an audio output unit 232.

These elements included in the receiving device 201 are composed of hardware, software, or a combination thereof. The hardware includes a processor such as a CPU or a dedicated circuit as an example. Further, a clock for counting the time may be provided in the information processing device 101.

The transmission / reception unit 218 receives the transmission message information transmitted from the information processing device 101. When the transmission / reception unit 218 performs wireless communication, the transmission / reception unit 218 includes at least one antenna. The transmission / reception unit 218 provides the received transmission message information to the image output processing unit 221. Further, the transmission / reception unit 218 provides the received transmission message information to the voice output processing unit 231.

The image output processing unit 221 displays the transmission message information provided by the transmission / reception unit 218 on the display unit 222. The display unit 222 is a display device that displays data, such as a liquid crystal display device, an organic EL display device, or a plasma display device, but is not limited to these examples.

6 (A) and 6 (B) show an example in which the transmitted message information is displayed on the display unit 222. In each of FIGS. 6 (A) and 6 (B), the utterance text included in the transmitted message information and the candidate of the reply message are displayed. In FIG. 6A, candidates for a reply message generated when the para-language information indicates the question intention of the user 1 are displayed. In FIG. 6B, candidates for a reply message generated when the para-language information indicates the non-question intention of the user 1 are displayed. Each candidate is given a number 1 to 3 set by the information processing apparatus 101. The content of the utterance text (received message) is the same in both FIGS. 6 (A) and 6 (B).

User 2 can grasp the content of the message spoken by user 1 by looking at the displayed utterance text. At this time, it is also possible to infer whether the spoken text is a question or a non-question by looking at the displayed reply message candidates. For example, by looking at the candidate reply message in FIG. 6A, it can be determined that the utterance text "I can return at 8 o'clock today" is that the user 1 is asking the user 2.

The voice output processing unit 231 converts the utterance text included in the transmitted message information provided by the transmission / reception unit 218 into a voice signal, and provides the voice signal to the voice output unit 232. The voice output unit 232 generates a voice signal and outputs the voice. Further, the voice output processing unit 231 may convert the candidate of the reply message included in the transmitted message information into a voice signal and provide the voice signal to the voice output unit 232. The audio output unit 232 reproduces this audio signal and outputs the audio. The voice output of the utterance text may be performed when instructed by the user 2, or may be performed as soon as the transmitted message information is received. The voice output of the reply message candidate may be performed in order immediately after the voice output of the utterance text, or may be performed when instructed by the user 2.

The voice output instruction of the user 2 can be given by the operation input unit 216, the image input unit 214, or the voice input unit 211, which will be described later. For example, when the operation input unit 216 or the display unit 222 includes a touch panel, the user 2 may instruct the voice output by operating the touch panel.

FIG. 7 shows an example in which the voice output button is arranged on the touch panel screen on which the transmitted message information of FIG. 6 (A) is displayed. A voice output button 10 is arranged near the utterance text column. By clicking the voice output button 10, the voice of the spoken text is output. Similarly, for the reply message candidates, the voice output buttons 11 to 13 are arranged near the columns of each candidate. By clicking the voice output buttons 11 to 13, the voice of each candidate is output.

By outputting the utterance text and the voice of the reply message candidate, even if the user 2 has a weak vision, the user 2 is made to accurately recognize the content of the utterance text and select an appropriate reply message from the reply message candidates. be able to.

The user 2 who is presented with a reply message candidate via the display unit 222 and the voice output unit 232 selects a reply message from the presented reply message candidates. The user 2 can use at least one of the operation input unit 216, the voice input unit 211, and the image input unit 214 in order to select the reply message.

The operation input unit 216 is a circuit or an input device for inputting a user's operation signal to the receiving device 201. Examples of the operation input unit 216 include a touch panel, a keyboard, a mouse, buttons provided on the main body of the device, and the like. The user 2 inputs an instruction to select a reply message from the reply message candidates. For example, when the operation input unit 216 or the display unit 222 includes a touch panel, a reply message is selected by operating the touch panel. For example, in FIG. 6A or FIG. 6B, the candidate to be replied to is touched from above among the plurality of candidates of the reply message.

The operation input unit 216 provides the selection result recognition unit 217 with information for identifying the selected reply message. The information for identifying the selected reply message may be the number of the selected reply message, if the reply message is numbered. Alternatively, it may be the coordinates in the display area where the selected reply message is arranged on the display unit 222. Other information may be used.

The voice input unit 211 senses the voice of the message uttered by the user 2 and converts the sensed signal into an electric signal (voice signal). The voice input unit 211 is, for example, a microphone. The user 2 utters an instruction to select a reply message from the reply message candidates. For example, when each candidate is assigned a number, the number of the reply message to be selected is uttered. Alternatively, you may read the text of the message you are replying to. The voice input unit 211 provides the voice recognition processing unit 212 with a voice signal obtained by converting an utterance such as an instruction of the user 1.

The voice input unit 211 starts the operation when the user 1 gives an instruction for sound collection, and ends the operation when the user 1 gives an instruction to end the sound collection or after completing the input of the instruction. May be good. As an example, the sound collection start instruction and the sound collection end instruction may be given by pressing the sound collection button provided on the terminal body or the touch panel.

The voice recognition processing unit 112 converts the voice signal into text data by performing voice recognition processing on the voice signal input from the voice input unit 211. The voice recognition processing unit 112 provides the text obtained by the conversion to the natural language understanding processing unit 213. There may be a form in which the text is directly provided to the selection result recognition unit 217.

The natural language understanding processing unit 213 estimates the utterance intention (Intent) and the utterance target (Entity) in the intention by performing the natural language understanding processing based on the text provided by the speech recognition processing unit 112. .. For example, when the user 2 replies "Reply with No. 2", "Reply" is estimated as the intention (Intent) and "No. 2" is estimated as the Entity. The type of Entity is, for example, an answer number. The natural language understanding processing unit 213 provides the estimated Intent and Entity to the selection result recognition unit 217.

The image input unit 214 senses the image of the user 2 and obtains an image pickup signal. The image input unit 214 is a sensing device such as a camera. The image may be a still image or a moving image. As an example, the image input unit 214 takes images at regular time intervals. The object of imaging may be any part of the body of the user 2, such as the face, upper body, and whole body of the user 2. The image input unit 214 provides the captured image data (imaging signal) to the image recognition processing unit 215.

User 2 may make a gesture to select a reply message from the reply message candidates. For example, when each candidate is assigned a number, the operation of drawing the number of the reply message to be selected with the index finger is performed. In addition, the operation of reading the number by moving the mouth may be performed, or the operation of reading the text of the message to be replied by moving the mouth may be performed (the voice may or may not be spoken).

The image recognition processing unit 215 identifies the gesture of the user 2 by performing image recognition processing on the image pickup signal input from the image input unit 214. For example, the locus of the movement of the index finger of the user 2 is analyzed, and the number specified by the user 2 is specified. As an example, a database in which a user's trajectory and a number are associated with each other is prepared in advance. A number matching the locus specified in the image recognition process is specified as a number instructed by the user 2. Further, the movement of the mouth of the user 2 may be identified, and the number specified by the user may be specified. The image recognition processing unit 215 provides the result of image recognition to the selection result recognition unit 217. For example, when the second reply message is specified, the information for specifying the second message is provided.

The method of specifying the instruction of the user 2 is not limited to the above method. For example, when the information processing device 101 is mounted on the AR glass or the VR head-mounted display, the reply message or the number of the reply message that the user gazes at may be detected by eye tracking.

The selection result recognition unit 217 identifies the reply message selected by the user 2 based on the information input from at least one of the operation input unit 216, the image recognition processing unit 215, and the natural language understanding processing unit 213. The selection result recognition unit 217 generates selection result information including information for identifying the identified reply message, and provides the selection result information to the transmission / reception unit 218.

The information that identifies the reply message may be the text of the reply message itself, or may be the number if the reply message is numbered.

The transmission / reception unit 218 transmits the selection result information provided by the selection result recognition unit 217 to the information processing device 101. The selection result information is packetized and transmitted to the information processing apparatus 101 as packetized data according to the communication protocol used in the communication network 301, for example. Information necessary for delivering data to the information processing device 101, for example, the address of the information processing device (IP address, etc.) may be set in advance in the transmission / reception unit 218, or the packet received from the information processing device 101. It may be specified from the header.

When the above-mentioned server intervenes between the information processing device 101 and the receiving device 201, the receiving device 201 transmits the selection result information to the server. In this case, the information processing device 101 accesses the server and acquires the selection result information addressed to the information processing device 101.

In FIG. 2, the transmission / reception unit 119 of the information processing device 101 receives the selection result information transmitted from the reception device 201, and provides the received selection result information to the image output processing unit 121. Further, the transmission / reception unit 119 provides the received selection result information to the voice output processing unit 131. When the selection result information includes the number of the reply message, the reply message corresponding to the number is specified, and the text of the specified reply message is provided to the image output processing unit 121 and the voice output processing unit 131.

The image output processing unit 121 displays the selection result information provided by the transmission / reception unit 119 on the display unit 122. The display unit 122 is a display device that displays data, such as a liquid crystal display device, an organic EL display device, or a plasma display device, but is not limited to these examples.

FIG. 8 shows an example in which the selection result information is displayed on the display unit 122. The reply message specified by the selection result information is arranged below the utterance text of the user 1. As a result, the user 1 can confirm the message uttered by the user 1 and the message returned by the user 2. Since the reply message of the user 2 reflects the para-language information of the utterance of the user 1, the user 1 can feel that the intention of his / her utterance is correctly transmitted to the user 2.

The voice output processing unit 131 converts the text of the reply message provided by the transmission / reception unit 119 into a voice signal, and outputs the voice signal as voice from the voice output unit 132. The voice output may be performed according to the instruction of the user 1, or may be performed as soon as the selection result information is received. The instruction of the user 1 may be given by using the operation input unit (touch panel, keyboard, mouse, etc.), or by the image input unit 117 or the voice input unit 111.

User 1 can continue the dialogue with user 2 by continuously speaking after confirming the reply message of user 2.

FIG. 9 shows an example in which the user 1 continues the dialogue with the user 2 after the state of FIG. The utterance message of the user 1 and the reply message of the user 2 are added.

In the present embodiment, the receiving device 201 provided the text generated by the voice recognition processing unit 212 to the natural language understanding processing unit 213 in FIG. 5, but it may be provided to the selection result recognition unit 217. In this case, the natural language understanding processing unit 213 may be omitted from the receiving device 201. The selection result recognition unit 217 identifies the reply message selected by the user 2 based on the text provided by the voice recognition processing unit 212. For example, when the user 2 answers "No. 2", it is determined that "No. 2" is detected by keyword matching from the text and the No. 2 reply message is selected.

In the above description of the present embodiment, an example is shown in which it is determined whether or not the user's utterance text is intended as a question as paralinguistic information, and a reply message candidate is generated according to the result of the determination. In the following, a specific example of generating a reply message candidate will be shown using another example as para-language information.

[Specific example 1]
FIG. 10 is a diagram for explaining the specific example 1. It is assumed that user 1 utters "Tomorrow's meeting is okay at Yokohama station at 10:30". In this case, the voice recognition processing unit 112 generates "Tomorrow's meeting is good at Yokohama Station at 10:30" as the utterance text (voice recognition text). The natural language understanding processing unit 113 estimates "appointment" as the intention of the utterance by the natural language understanding processing from the utterance text of "Tomorrow's meeting is good at Yokohama Station at 10:30". As an Entity, tomorrow (date), 10:30 (time), Yokohama station (place) is estimated. The characters in parentheses are the type of Entity, and the characters outside the parentheses are the value of Entity.

The acoustic analysis recognition processing unit 114 measures at least one of the utterance frequency, volume, utterance speed, and time interval before and after the utterance as acoustic feature information based on the voice signal of the user 1.
The following words are extracted as paralinguistic information as words emphasized in the utterance.
-Words with a relatively high frequency in the utterance or words with a rising frequency at the end of the utterance-Words with a relatively high volume in the utterance-Words with a relatively slow utterance speed in the utterance-Word utterance Words with a certain amount of time before and after

As a specific example, when the user 1 utters "10:30" at a relatively high volume, the word at the part where "10:30" is uttered is extracted as the emphasized word. Further, when the user 1 speaks "Yokohama Station" at a relatively slow speed, the word in the portion where "Yokohama Station" is emphasized in the utterance is extracted as a word. Specifically, in the voice signal, the signal at the relevant location may be converted into text by voice recognition to extract the word. Alternatively, the start time and end time of the emphasized word may be specified, and the text portion located between the start time and the end time in the utterance text may be specified as the emphasized word. Alternatively, the utterance text is morphologically analyzed, decomposed into words, the frequency, volume, etc. of the signal corresponding to each morpheme (word) are measured, and the word that meets the above criteria is regarded as the word emphasized by the user 1. It may be specified.

Further, the acoustic analysis recognition processing unit 114 detects the amount of change in the fundamental frequency by acoustically analyzing the audio signal of the utterance of the user 1 in the same manner as in the specific example (see FIG. 3) described above. At the end of the utterance, it is determined whether the amount of change in the fundamental frequency has increased by a certain value or more, and if it has increased by a certain value or more, that is, if the pitch has increased at the end of the utterance, the speaker intends to ask a question. Judge that you are doing. The acoustic analysis recognition processing unit 114 generates para-language information indicating whether or not the speaker (user 1) intends to ask a question.

The candidate generation unit 115 is based on the para-language information provided by at least one of the acoustic analysis recognition processing unit 114 and the image recognition processing unit 118, and the Intent and Entity provided by the natural language understanding processing unit 113, and the reply phrase DB 116 Is used to generate candidate reply messages. The paralinguistic information includes emphasized words and information on whether or not User 1 intends to ask a question.

FIG. 11 shows an example of the reply phrase DB 116 according to the specific example 1. The reply phrase DB 116 includes an Intent, an Entity type of the emphasized word, para-language information on whether or not there is a question intention, and a reply phrase. In the example of the figure, the Intent is "appointment", and the Entity type of the emphasized word is "time", "place", and the like. The <> in the reply phrase column represents the slot.

The candidate generation unit 115 determines the Entity type of the emphasized word. For example, when the emphasized word is "10:30", the Entity type whose Entity value is "10:30" is set as the Entity type of the emphasized word.

Read the reply message candidate from the reply phrase DB116 based on the Entity type of the emphasized word, the Intent, and the presence or absence of the question intention.

If the Entity type is "time", the Intent is "appointment", and user 1 has the intention of asking a question, "Yeah, <> is okay", "Let's make it earlier", "Let's make it later" "Three reply phrases are acquired. The Entity value "10:30" is stored in the slot included in the first "Yeah, <> is okay". As a result, three reply message candidates, "Yeah, it's okay at 10:30", "Let's make it earlier", and "Let's make it later" are obtained.

On the other hand, if the Entity type is "location", the Intent is "appointment", and user 1 has the intention of asking a question, "Yeah, <> is fine", "Let's move closer", "Farer place". Three reply phrases of "Let's do it" are acquired. The Entity value "Yokohama Station" is stored in the slot included in the first "Yeah, <> is okay". As a result, three reply message candidates, "Yeah, it's okay at Yokohama Station," "Let's move closer," and "Let's move farther," are obtained.

The example when the user 1 has a question intention is described, but even if the user 1 does not have a question intention, a candidate for a reply message is obtained in the same manner.

The candidate generation unit 115 provides the acquired reply message candidate to the transmission / reception unit 119. The transmission / reception unit 119 transmits transmission message information including a reply message candidate and an utterance text to the receiving device 201. The display unit 222 of the receiving device 201 displays the transmitted message information received from the information processing device 101. Subsequent operations are the same as the above-mentioned operations.

12 (A) and 12 (B) show a display example of the transmitted message information according to the specific example 1. FIG. 12A shows a candidate example of a reply message generated when the entity type of the emphasized word is “time” and the user 1 intends to ask a question. FIG. 12B shows a candidate example of a reply message generated when the entity type of the emphasized word is “location” and the user 1 intends to ask a question. The content of the utterance text is the same in both FIGS. 12 (A) and 12 (B). The user 2 on the receiving side selects a message to be used for reply from the three candidates by using at least one of the operation input unit 216, the image input unit 214, and the voice input unit 211.

[Specific example 2]
FIG. 13 is a diagram for explaining a specific example 2. It is assumed that user 1 utters "What is good for dinner, hamburger curry ramen". In this case, "What is good for dinner, hamburger curry ramen" is output from the voice recognition processing unit 112 as the utterance text (speech recognition text). The natural language understanding processing unit 113 estimates “MenuSelect” as the intent of the utterance (menu selection) by the natural language understanding processing from the utterance text of “What is good for dinner hamburger curry ramen”. Estimate "hamburger", "curry", and "ramen" as Entity values. Although the estimation of the type of Entity is omitted here, there are types such as "meat", "roux", and "noodle" as an example.

The acoustic analysis recognition processing unit 114 measures the time of the silent section of the utterance as acoustic feature information based on the voice signal of the utterance of the user 1. The acoustic analysis recognition processing unit 114 identifies a silent section having a time length equal to or greater than a threshold value after the start of utterance. A word sandwiched between two silent sections is converted into text, and the obtained text is identified as a word (item) intended by the user. Alternatively, the start time and end time of the portion sandwiched between the silent sections may be specified, and the text portion located between the start time and the end time in the utterance text may be identified as a word intended by the user. .. Alternatively, the utterance text may be morphologically analyzed and decomposed into words, and the word corresponding to the portion sandwiched between the silent sections may be identified as the word intended by the user 1. The word intended by the user may be identified by other methods. The acoustic analysis recognition processing unit 114 provides the candidate generation unit 115 with information for identifying the identified word (item) as para-language information.

Specifically, for example, in the spoken text "What is good for dinner hamburger curry ramen" of user 1, between "What is good for dinner" and "hamburger", between "hamburger" and "curry ramen", It is assumed that a silent section having a length equal to or longer than the threshold is detected after "curry ramen". In this case, "hamburger" and "curry ramen" are extracted as words sandwiched between silent sections.

Alternatively, it is assumed that a silent section having a length longer than the threshold is detected between "what is good for dinner" and "hamburger curry", between "hamburger curry" and "ramen", and after "ramen". In this case, "hamburger curry" and "ramen" are extracted as words sandwiched between silent sections.

Alternatively, the length above the threshold between "what is good for dinner" and "hamburger", between "hamburger" and "curry", between "curry" and "ramen", and after "ramen". Suppose that a silent section of is detected. In this case, "hamburger", "curry", and "ramen" are extracted as words sandwiched between silent sections.

The candidate generation unit 115 is based on the Intent and Entity values provided by the natural language understanding processing unit 113 and the para-language information provided by at least one of the acoustic analysis recognition processing unit 114 and the image recognition processing unit 118. DB116 is used to generate reply message candidates.

FIG. 14 shows an example of the reply phrase DB 116 according to the specific example 2. The reply phrase DB 116 includes an Intent, para-language information or Entity type, and a reply phrase. In the example of the figure, the Intent is "Menuselect", and the para-language information or the Entity type is "any" indicating arbitrary. The <> in the reply phrase column represents the slot. In the example shown in the figure, only the slot is stored in the reply phrase, but text other than the slot may be included, such as "<> is good".

When the para-language information indicates "hamburger" and "curry ramen", the candidate generation unit 115 determines that the user 1 has intentionally spoken "hamburger" and "curry ramen". Therefore, among the three Entity values "hamburger", "curry", and "ramen", "curry" and "ramen" are combined to form "curry ramen". Read the reply phrase "<>" corresponding to MenuSelect from the reply phrase DB116, store "hamburger" and "curry ramen" in the slots, respectively, and obtain two reply message candidates, "hamburger" and "curry ramen". ..

When the para-language information indicates "hamburger curry" and "ramen", it is determined that the user 1 has intentionally spoken "hamburger curry" and "ramen". Therefore, among the three Entity values "hamburger", "curry", and "ramen", "hamburger" and "curry" are combined to form "hamburger curry". Read the reply phrase "<>" corresponding to MenuSelect from the reply phrase DB116, store "hamburger curry" and "ramen" in the slots, respectively, and obtain two reply message candidates, "hamburger curry" and "ramen". ..

When the para-language information indicates "hamburger", "curry" and "ramen", it is determined that the user 1 utters with the intention of "hamburger", "curry" and "ramen". These three correspond to the three Entity values "hamburger", "curry", and "ramen". Read the reply phrase "<>" corresponding to MenuSelect from the reply phrase DB116, store "hamburger", "curry", and "ramen" in the slots, respectively, and reply with three replies, "hamburger", "curry", and "ramen". Get message suggestions.

In the first two examples of the above three examples, an example of combining two Entity values was shown, but there are cases where the Entity values are separated. For example, suppose that the Entity values are "hamburger curry" and "ramen", and the para-language information indicates "hamburger", "curry", and "ramen". In this case, the Entity value "hamburger curry" is separated into "hamburger" and "curry". Then, three reply message candidates, "hamburger", "curry", and "ramen", are obtained.

In some cases, both separation and combination are performed. For example, suppose that the Entity value is "hamburger curry" and "ramen", and the para-language information indicates "hamburger" and "curry ramen". In this case, the Entity value "hamburger curry" is separated into "hamburger" and "curry", and the separated "curry" is combined in front of the Entity value "ramen" to form "curry ramen". Then, two reply message candidates, "hamburger" and "curry ramen", are obtained.

The candidate generation unit 115 provides the acquired reply message candidate to the transmission / reception unit 119. The transmission / reception unit 119 transmits transmission message information including a reply message candidate and an utterance text to the receiving device 201. The display unit 222 of the receiving device 201 displays the transmitted message information received from the information processing device 101.

15 (A), 15 (B), and 15 (C) show a display example of the transmitted message information according to the specific example 2. FIG. 15A shows a candidate example of a reply message when it is determined that the user 1 intentionally utters the three “hamburger”, “curry”, and “ramen”. FIG. 15B shows a candidate example of a reply message when it is determined that the user 1 intentionally utters the two items “hamburger curry” and “ramen”. FIG. 15C shows a candidate example of a reply message when it is determined that the user 1 intentionally utters the two items “hamburger” and “curry ramen”. In any case of FIGS. 15 (A) to 15 (C), the utterance text is the same. The user 2 on the receiving side selects a message to be used for reply from the three or two candidates by using at least one of the operation input unit 216, the image input unit 214, and the voice input unit 211.

[Specific example 3]
FIG. 16 is a diagram for explaining a specific example 3. Suppose user 1 utters "Today I will be back by 8 o'clock". In this case, the voice recognition processing unit 112 outputs "Today I will be back by 8 o'clock" as the utterance text (speech recognition text). The natural language understanding processing unit 113 estimates "GoHome" (homecoming) as the intent of the utterance by the natural language understanding processing from the utterance text of "I will be back by 8 o'clock today". Estimate "8 o'clock" as the value of Entity and "time" as the type of Entity.

The acoustic analysis recognition processing unit 114 acoustically analyzes the audio signal of the utterance of the user 1 and calculates a feature amount such as a frequency spectrum or a degree of sharpness as acoustic feature information. Based on the calculated feature amount, information representing the emotion of the user 1 is acquired as para-language information (referred to as para-language information 1). Using teacher data including voice features and user emotions, a model for estimating user emotions from voice features is generated in advance by a method such as machine learning. The emotion of the user 1 is estimated based on the generated model and the calculated feature amount.

Further, the acoustic analysis recognition processing unit 114 detects the amount of change in the fundamental frequency by acoustically analyzing the audio signal of the utterance of the user 1 in the same manner as in the specific example (see FIG. 3) described above. At the end of the utterance, it is determined whether the amount of change in the fundamental frequency has increased by a certain value or more, and if it has increased by a certain value or more, that is, if the pitch has increased at the end of the utterance, the speaker intends to ask a question. Judge that you are doing. The acoustic analysis recognition processing unit 114 generates para-language information (referred to as para-language information 2) indicating whether or not the speaker (user 1) intends to ask a question.

The candidate generation unit 115 is based on the Intent and Entity provided by the natural language understanding processing unit 113 and the para-language information provided by at least one of the acoustic analysis recognition processing unit 114 and the image recognition processing unit 118, and the reply phrase DB 116 Is used to generate candidate reply messages. The paralinguistic information includes the emotion of the user 1 and information on whether or not the user 1 intends to ask a question.

FIG. 17 shows an example of the reply phrase DB 116 according to the specific example 3. The reply phrase DB 116 includes an Intent, para-language information 1, para-language information 2, and a reply phrase. In the example of the figure, Intent is “GoHome”, para-language information 1 is emotion (joy, normality, anger), and para-language information 2 is the presence or absence of question intention.

When the para-language information 1 indicates "joy" and the para-language information 2 indicates that the user 1 has an intention to ask a question, the candidate generation unit 115 reads out "I can go home!" And "I can't do it" as reply phrases. .. Each read reply phrase is used as a candidate for a reply message.

When the para-language information 1 indicates "normal" and the para-language information 2 indicates that the user 1 has an intention to ask a question, the candidate generation unit 115 uses "Yeah, I can return to <>" as a reply phrase. > Read "It seems to be slower". Then, the Entity value "8 o'clock" is stored in the slot, and "Yeah, I can return by 8 o'clock" and "It seems to be later than 8 o'clock" are obtained as candidates for the reply message.

When the para-language information 1 indicates "anger" and the para-language information 2 indicates that the user 1 has the intention of asking a question, the candidate generation unit 115 uses "Yes, I will return to <>" as a reply phrase. I'm sorry, it will be late. " Then, the Entity value "8 o'clock" is stored in the slot of the former reply phrase, and "Yes, I will return by 8 o'clock". This will get candidates for reply messages such as "Yes, I'll be back by 8 o'clock" and "I'm sorry, I'll be late".

The paralinguistic information 1 has described an example of indicating that the user 1 has an intention to ask a question, but when the user 1 indicates that the user 1 has no intention of asking a question, a candidate for a reply message is obtained in the same manner.

The candidate generation unit 115 provides the reply message candidate to the transmission / reception unit 119. The transmission / reception unit 119 transmits transmission message information including a reply message candidate and an utterance text to the receiving device 201. The display unit 222 of the receiving device 201 displays the received transmission message information.

18 (A), 18 (B), and 18 (C) show an example of displaying the transmitted message information. FIG. 18A shows a candidate example of a reply message generated when the emotion of the user 1 is “joy” and the user 1 has an intention to ask a question. FIG. 18B shows a candidate example of a reply message generated when the emotion of the user 1 is “normal” and the user 1 has an intention to ask a question. FIG. 18C shows a candidate example of a reply message generated when the emotion of the user 1 is “anger” and the user 1 has an intention to ask a question. In any of FIGS. 18 (A) to 18 (C), the utterance text is the same. The user 2 on the receiving side selects a message to be used for reply from the two candidates by using at least one of the operation input unit 216, the image input unit 214, and the voice input unit 211.

In Specific Example 3, the user's emotions are identified as para-language information, but the user's urgency, flank, severity, tension, etc. are identified, and reply message candidates are generated according to the identified urgency, etc. You may.

For example, the degree of urgency can be determined by the speech speed of the user. It can be said that the higher the degree of urgency, the more urgent the user is. Therefore, a short and concise phrase and a phrase having a normal length are prepared in the phrase DB. If the urgency is high, use short and concise phrases to generate suggestions for reply messages. If the urgency is low, use a phrase of normal length to generate a candidate reply message. When the degree of urgency is high, the utterance speed is, for example, equal to or higher than the threshold value, and when the degree of urgency is low, the utterance speed is, for example, less than the threshold value.

Also, the utterance flank degree can be determined by at least one of the vowel length at the end of the word (equivalent to the super-note "-") and the volume increase (equivalent to the exclamation mark "!"). It can be said that the higher the degree of flank, the closer the relationship. Therefore, a phrase using broken words and a phrase using polite words are prepared in the reply phrase DB. If the degree of flank is high, use phrases with broken words to generate candidate reply messages. If the flank is low, use polite language phrases to generate replies message suggestions. If the flank is high, the vowel length is above the threshold or the volume increase is above the threshold, and if the flank is low, the vowel length is below the threshold or the volume increase is below the threshold. Corresponds to the case.

Also, the severity can be judged by the presence or absence of laughter detection. If laughter is detected, it can be said that the severity is low. Therefore, a phrase of broken words and a phrase of polite words are prepared in the reply phrase DB. If laughter is detected, a broken word phrase is used to generate a candidate reply message. If no laughter is detected, use polite language phrases to generate replies message suggestions.

In addition, the degree of tension of the speaker can be determined from at least one of the pitch, speed, volume, and time between utterances. It can be said that the degree of tension increases when the pitch is high, the speaking speed is fast, the volume is low, and the time between utterances is long. Therefore, a phrase with a gentle broken tone and a phrase with a severe tone are stored in the reply phrase DB. The higher the tension, the more relaxed the speaker (user 1) is, so that a reply message candidate is generated using a phrase with a gentle and broken tone. When the degree of tension is low, candidates for reply messages are generated using phrases with a strict tone so that the speaker (user 1) feels more nervous. When the degree of tension is high, for example, it corresponds to the case where the pitch is equal to or more than the threshold value, the speaking speed is equal to or more than the threshold value, the volume is less than the threshold value, or the time between utterances is equal to or more than the threshold value. When the degree of tension is low, for example, it corresponds to the case where the pitch is below the threshold value, the speaking speed is below the threshold value, the volume is above the threshold value, or the time between utterances is below the threshold value.

A model for estimating the urgency etc. from the voice feature amount may be generated by using the teacher data including the voice feature amount and the user's urgency etc. by machine learning in advance. Examples of voice features include utterance speed, vowel length / volume increase at the end of a word, presence / absence of detection of laughter, utterance pitch, utterance speed, volume, and time between utterances. In this case, the urgency, flank, severity, tension, etc. of the user 1 are estimated based on the generated model and the feature amount calculated from the voice signal of the user 1.

FIG. 19 is a flowchart of an example of an operation in which the candidate generation unit 115 of the information processing device 101 generates a candidate for a reply message. Based on the para-language information supplied from the acoustic analysis recognition processing unit 114, it is determined whether or not the utterance intention of the user 1 is a question (S11).

When the user 1's utterance intention is a question (YES in S11), the user 1's question intention is to select an item from a plurality of items (item selection) based on the Intent provided by the natural language understanding processing unit 113. ) (S12).

In the case of item selection (YES in S12), the utterance text is separated at the item delimiter position based on the para-language information. Each word separated by a delimiter position is specified as an item intended by the user 1 (S13). A message containing each of the specified items is generated as a candidate for a reply message (S13). After that, the candidate of the reply message is transmitted to the receiving device 201 together with the utterance text via the transmission / reception unit 119.

If it is determined in step S12 that the intention of user 1 is not item selection (NO in S12), it is determined whether there are a plurality of Entity that can be the target of the answer to the question (S14). For example, if the number of Entity provided by the natural language understanding processing unit 113 is a plurality, it is determined that there are a plurality of Entity that can be the target of the answer. If the number of Entity provided by the natural language understanding processing unit 113 is one, it is determined that the number of Entity that can be the target of the answer is singular.

When it is determined that there are a plurality of Entity to be answered (YES in S14), the word emphasized by the user 1 is specified based on the para-language information (S15). Further, based on the specified word, the Entity to be the target of answering the question is specified among the plurality of Entity (S16). The utterance intention of the user 1 is a question, and the reply phrase corresponding to the specified Entity type is specified in the reply phrase DB 116. Generate candidate reply messages based on the identified reply phrase. (S16).

Even if it is determined in step S14 that the number of Entity that can be the target of the answer is singular (NO in S14), the intention of the user 1's utterance is a question, and a reply phrase corresponding to the type of the singular Entity is returned. Obtained from the phrase DB116 (S16).

On the other hand, when it is determined in step S11 that the utterance intention of the user 1 is not a question, the reply phrase when the utterance intention is a non-question intention is acquired from the reply phrase DB 116 (S17).

It is determined whether the reply phrase acquired in step S16 or step S17 has a variation of the phrase depending on emotion, urgency, flank, severity or tension (S18). For example, if the reply phrase DB 116 has a sequence of emotion, urgency, flank, severity or tension, it is determined that there are variations of the phrase depending on emotion, urgency, flank, severity or tension. If there is a variation of the phrase (YES in S18), the reply phrase is narrowed down to match the emotion, urgency, flank, severity or tension indicated by the paralinguistic information. If there is no variation of the reply phrase (NO in S18), the reply phrase acquired in step S16 or step S17 is used as it is.

Determine if the reply phrase contains a slot (S20). If a slot is included, the Entity value is stored in the slot included in the reply phrase and used as a candidate for the reply message. The stored Entity value is the Entity value specified in step S15 when a plurality of Entity values are provided by the natural language understanding processing unit. When one Entity value is provided by the natural language understanding processing unit, the stored Entity value is the one Entity value (S21). On the other hand, if the reply phrase does not include a slot, the reply phrase is used as a candidate for the reply message as it is. The reply message candidate is transmitted to the receiving device 201 together with the utterance text via the transmission / reception unit 119.

(Effect of this embodiment)
According to the present embodiment, by presenting a candidate for a reply message that captures the situation, attitude, or emotion of the speaker (user 1) to the dialogue partner (user 2), the user 2 captures the situation, attitude, or emotion. The reply message can be sent to user 1. As a result, the discrepancy in intention transmission that may occur only with the voice-recognized text is reduced, and smooth text communication becomes possible.

According to the present embodiment, since the dialogue partner is presented with a candidate including the reply content expected by the speaker, the dialogue partner can perform the reply work quickly and with a low load simply by selecting the candidate. ..

This embodiment is effective as a dialogue support tool for the hearing impaired. When the voice utterance of a hearing person is converted into a text and presented to a hearing-impaired person, it is possible to enhance mutual communication by presenting a candidate for a reply message based on the paralingual information of the hearing person. Especially for hearing-impaired people who have difficulty speaking, they can reply by simply selecting a candidate for a reply message, which enables a low-load and reliable response.

(Modification example 1)
In the above-described embodiment, a candidate for a reply message to be replied to the user 1 is created based on the para-language information of the user 1 and the candidate for the utterance text, and the utterance text and the candidate for the reply message are transmitted to the user 2. In this modification, a message candidate to be transmitted to the user 1 is created based on the para-language information of the user 1, and the created message candidate is transmitted to the user 2. The user 2 selects a message to be transmitted to the user 1 from the received candidates, and transmits the selected message.

For example, suppose that user 1 does not speak for a while and performs an action indicating that he / she is bored, such as yawning. The image recognition processing unit 118 generates information indicating that the user 1 is bored as para-language information from the image pickup signal of the user 1. The candidate generation unit 115 generates message candidates such as "Himadane" and "Is it boring?" As candidates for the message to be transmitted to the user 1 based on the para-language information. The generated candidate is transmitted to the user 2. Since the user 1 has not spoken, the spoken text is not transmitted. The user 2 selects one message from the presented candidates and sends it to the user 1. The user 1 sees the message received from the user 2 and performs an action such as resuming the dialogue. By doing so, even when the user 1 is not speaking, a message including the para-language information of the user 1 can be transmitted to the user 1, so that smooth communication between the user 1 and the user 2 can be promoted.

(Modification 2)
In the above-described embodiment, the user 2 replies to the user 1 by selecting one candidate from the candidates of the reply message, but the user 2 does not select the candidate, but the reply content by himself / herself. You may directly compose and send a reply message with the content you thought about. The user 2 can indirectly infer the intention of the user 1 from the candidates of the presented reply message. Therefore, when the user 2 directly sends the reply message to the user 1, the response can be made in consideration of the intention of the user 1.

For example, in the case of FIG. 6A described above, it can be inferred that the user 1 intends to ask a question by referring to the presented candidate. Therefore, the user 2 can send a reply message to the presented candidate, for example, "It may be later than 8 o'clock, but I will return as soon as possible."

FIG. 20 is a block diagram of the receiving device 201 according to the second modification. A path for outputting text from the voice recognition processing unit 212 to the transmission / reception unit 218 has been added. A reply message is created by inputting voice using the voice input unit 211 and converting the input voice signal into text by the voice recognition processing unit 212. The reply message is provided from the voice recognition processing unit 212 to the transmission / reception unit 218. A reply message is transmitted from the transmission / reception unit 218 to the information processing device 101. The user 2 may manually create the text of the reply message using the operation input unit 216.

(Modification example 3)
The utterance text generated by the voice recognition processing unit 212 may be decorated with text according to the para-language information. The text decoration is performed by, for example, the candidate generation unit 115. For example, a question mark "?" May be added to the end of the text that is determined to have a question intention due to the pitch increase at the end of the utterance.

FIG. 21 is an explanatory diagram of the modified example 3. As an example, when the utterance text is "I can return at 8 o'clock today" and it is determined that there is a question intention, a "?" Is added at the end as shown in FIG. 21 (A). As another example, the color of the entire text may be changed. You may also add an arrow pointing to the upper right at the end.

The appearance of the text of the word emphasized by the user 1 in the utterance text may be changed. For example, make it bold, increase the font size, or add color. As an example, if the utterance text is "Tomorrow's meeting is good at Yokohama Station at 10:30" and "10:30" is emphasized, "10:30" as shown in Fig. 21 (B). Increase the font size of ".

Also, information that identifies the delimiter position, such as a comma "," dot "・" slash "/", may be added to the item delimiter position in the utterance text. As an example, if the utterance text is "What is good hamburger curry ramen for dinner" and "hamburger curry" and "ramen" are separated, as shown in Fig. 21 (C), it is called "hamburger curry". Add the "/" symbol between "ramen".

Also, information that identifies the user's emotions indicated by paralinguistic information, such as symbols or pictures (pictograms such as faces, stamps, etc.) may be added to the utterance text. As an example, when the utterance text is "I will be back by 8 o'clock today" and the user's emotion is judged to be angry, as shown in FIG. 21 (D), the anger is added to the end of the utterance text. Add emoticons. In this example, it is also determined that the user intends the question, and a question mark is added.

If the vowel length at the end of the word is long and the degree of flank is high, the long note "-" may be added (Fig. 21 (E)). When the volume increase at the end of the word is high and the flank is high, an exclamation mark “!” May be added (FIG. 21 (F)). When laughter is detected, information identifying that the user is laughing, for example, an emoticon such as a smile or a stamp may be added (FIG. 21 (G)).

By decorating the utterance text in this way, the degree of understanding of the utterance text of the receiving user 2 can be improved. Further, the decorated utterance text may be presented to the user 1 (speaker) via at least one of the display unit 122 and the voice output unit 132. As a result, the para-language information is fed back to the user 1, and the effect of inducing the user 1 to speak using the para-language information can be expected.

(Modification example 4)
The utterance text of user 1 (speaker) and the candidate text of the reply message are translated into the language used by the receiving user 2, and the translated utterance text and the translated candidate text are transmitted to the receiving user 2. You may present it. As a result, the effects of the above-described embodiment and various modifications can be obtained even when text communication using voice recognition between different languages is performed. The utterance text may be translated by translating the text decorated with the paralinguistic information described above. In particular, question marks are effective because the linguistic translation results (subject and syntax at the time of English translation) change when they are added.

FIG. 22 is a block diagram of the information processing device 101 according to the modified example 4. Translation processing unit 141 has been added. The translation processing unit 141 acquires the utterance text from the voice recognition processing unit 112 and translates the utterance text. The translated utterance text is provided to the transmission / reception unit 119. The translation processing unit 141 translates the candidate of the reply message and provides the translated candidate to the transmission / reception unit 119. The translation source language and the translation destination language are set in advance in the application or OS (Operating System) that realizes the processing of the present embodiment, and the language can be changed by the user 1. The transmission / reception unit 119 transmits the translated utterance text and the translated candidate to the receiving device 201. The transmission / reception unit 119 provides the translation processing unit 141 with the reply message specified by the selection result information received from the reception device 201, and the translation processing unit 141 translates the reply message into the original language. The transmission / reception unit 119 receives the reply message translated into the original language from the translation processing unit 141 and provides it to at least one of the image output processing unit 121 and the voice output processing unit 131.

According to this variant, by translating and presenting the utterance text of the speaker and the candidate of the reply message into the language used by the receiving user, even a user who cannot understand the language used by the speaker can understand the intention of the speaker. It is possible to return an appropriate reply that captures the situation, attitude, and emotions.

(Modification 5)
In the initial state, the reply phrase DB stores, for example, phrases corresponding to Intent and para-language information. When the user 1 freely creates and replies a message while communicating with another user, the reply phrase DB may be updated with the replied message. For example, one of the reply phrases corresponding to the candidate sent to another user may be updated with the message returned from the other user. Alternatively, the returned message may be added to the reply phrase DB as a new reply phrase.

(Modification 6)
The candidate generation unit 115 may generate a candidate for a reply message according to the attribute information of the individual user 1 (speaker). For example, a reply phrase according to the age is stored in the reply phrase DB 116, and the reply phrase to be used can be used according to the age of the speaker. For example, in the case of children or the elderly, use plain language reply phrases. In addition, the reply phrase to be used may be used properly according to the gender. Further, the reply phrases of a plurality of dialects may be stored, and the reply phrases to be used may be used properly according to the place of residence or the place of origin of the speaker. Here, the reply phrase is used properly according to the attribute information of the speaker, but the reply phrase may be used properly according to the attribute information of the user 2 on the receiving side.

(Modification 7)
In the above-described embodiment and each modification, the user 2 on the receiving side is a human being, but the user 2 may be a computer system such as a voice agent or a chatbot instead of a human being. In this case, the system selects the candidate to be used for the reply from the reply candidates generated based on the para-language information.

As a result, it becomes possible for the user 1 to communicate with the system in consideration of the para-language information of the user 1, and the response from the system can also reflect the para-language information of the user 1. Therefore, smooth voice communication between the person and the system becomes possible.

(Modification 8)
In the above-described embodiment, an example is shown in which the user 1 makes an utterance intended to ask a question, an utterance emphasizing a specific word, an utterance enumerating a plurality of items, an utterance including emotions, and the like. When the user 1 actually makes an utterance in each case, the para-language information of the user 1 can be acquired more accurately by making an utterance that matches the operation algorithm of the para-language information acquisition unit 120. Therefore, a voice sample of the utterance in each case may be prepared so that the user 1 can learn a more appropriate utterance by listening to the voice sample.

FIGS. 23 (A) to 23 (D) show a display example of a menu for playing back an audio sample. FIG. 23A shows an example of a screen for playing back a voice sample of an utterance intended to ask a question and a voice sample of an utterance not intended to ask a question. User 1 can play the voice sample by clicking or touching the button of the voice sample. FIG. 23B shows an example of a screen for reproducing a voice sample that emphasizes the time and utters and a voice sample that emphasizes the place when uttering a message including the time and place. FIG. 23C shows an example of a screen for playing back an utterance voice sample that enumerates three items and an utterance voice sample that enumerates two items. FIG. 24D shows an example of each voice sample when speaking with feelings of joy, normality, and anger.

The user 1 listens to the voice samples shown in FIGS. 23 (A) to 23 (D) and then actually speaks, so that the user 1 can more appropriately use his / her para-language information and can select the reply message and the spoken text. It can be reflected in the decoration.

(Modification 9)
In the above-described embodiment, the audio signal and the video signal are used as the sensing signal for acquiring the para-language information, but other signals may be used as long as the signal is sensed from the user 1. For example, a wearable device may be used to measure the body temperature, blood pressure, heart rate, body movement, etc. of the user 1, and paralinguistic information may be acquired based on the measured information. For example, when the heart rate is high and the blood pressure is high, the paralinguistic information that the tension of the user 1 is high is acquired.

(Hardware configuration)
FIG. 24 shows an example of the hardware configuration of the information processing device 101 of FIG. The receiving device 201 of FIG. 1 is also composed of the same hardware as the information processing device 101. The information processing device 101 of FIG. 2 is composed of a computer device 400. The computer device 400 includes a CPU 401, an input interface 402, a display device 403, a communication device 404, a main storage device 405, and an external storage device 406, which are connected to each other by a bus 407. As an example, the computer device 400 is configured as a smartphone, a tablet, a desktop PC (Performal Computer), or a notebook PC.

The CPU (Central Processing Unit) 401 executes an information processing program, which is a computer program, on the main storage device 405. The information processing program is a program that realizes each of the above-mentioned functional configurations of the information processing device 101. The information processing program may be realized not by one program but by a combination of a plurality of programs and scripts. Each functional configuration is realized by the CPU 401 executing the information processing program.

The input interface 402 is a circuit for inputting operation signals from input devices such as a keyboard, mouse, and touch panel to the information processing device 101.

The display device 403 displays the data output from the information processing device 101. The display device 403 is, for example, an LCD (liquid crystal display), an organic electroluminescence display, a CRT (cathode ray tube), or a PDP (plasma display), but is not limited thereto. The data output from the computer device 400 can be displayed on the display device 403.

The communication device 404 is a circuit for the information processing device 101 to communicate with the external device wirelessly or by wire. Data can be input from an external device via the communication device 404. The data input from the external device can be stored in the main storage device 405 or the external storage device 406.

The main storage device 405 stores an information processing program, data necessary for executing the information processing program, data generated by executing the information processing program, and the like. The information processing program is developed and executed on the main storage device 405. The main storage device 405 is, for example, RAM, DRAM, and SRAM, but is not limited thereto. The reply phrase DB of FIG. 2 may be built on the main storage device 405.

The external storage device 406 stores an information processing program, data necessary for executing the information processing program, data generated by executing the information processing program, and the like. These information processing programs and data are read out to the main storage device 405 when the information processing program is executed. The external storage device 406 is, for example, a hard disk, an optical disk, a flash memory, and a magnetic tape, but is not limited thereto. The reply phrase DB of FIG. 2 may be built on the external storage device 406.

The information processing program may be installed in the computer device 400 in advance, or may be stored in a storage medium such as a CD-ROM. Further, the information processing program may be uploaded on the Internet.

Further, the information processing device 101 may be configured by a single computer device 400, or may be configured as a system composed of a plurality of computer devices 400 connected to each other.

Note that the above-described embodiment shows an example for embodying the present disclosure, and the present disclosure can be implemented in various other forms. For example, various modifications, substitutions, omissions, or combinations thereof are possible without departing from the gist of the present disclosure. The forms in which such modifications, substitutions, omissions, etc. are made are also included in the scope of the invention described in the claims and the equivalent scope thereof, as are included in the scope of the present disclosure.

Further, the effects of the present disclosure described in the present specification are merely examples, and other effects may be obtained.

The present disclosure may also have the following structure.
[Item 1]
Based on the sensing signal sensed by the first user, the para-language information acquisition unit that acquires the para-language information of the first user, and
A candidate generation unit that generates candidates for a message to be transmitted to the first user based on the para-language information, and
A transmission unit that transmits the message candidates to the device of the second user who exchanges messages with the first user, and
Information processing device equipped with.
[Item 2]
The sensing signal includes the voice signal of the first user.
A voice recognition processing unit for voice-recognizing the voice signal of the first user and acquiring the text data of the first message spoken by the first user is provided.
The candidate generation unit generates a candidate for the second message returned by the second user in response to the first message.
The information processing device according to item 1, wherein the transmission unit transmits the text data and the candidate for the second message to the device of the second user.
[Item 3]
A receiving unit that receives a reply message including the candidate selected from the message candidates from the device of the second user, and a receiving unit.
The information processing apparatus according to item 1 or 2, further comprising a display unit for displaying the reply message received by the receiving unit.
[Item 4]
The sensing signal includes the voice signal of the first user.
The information processing device according to any one of items 1 to 3, wherein the para-language information acquisition unit acquires the para-language information based on the acoustic feature information of the voice signal of the first user.
[Item 5]
The sensing signal includes an image pickup signal of the first user.
The information according to any one of items 1 to 4 in which the para-language information acquisition unit performs image recognition based on the image pickup signal of the first user and acquires the para-language information based on the result of the image recognition. Processing equipment.
[Item 6]
A natural language processing unit that estimates the intention of the first user's utterance and the target of the utterance based on the text data is provided.
The information processing device according to item 2, wherein the candidate generation unit generates a candidate for the second message based on the intention of the utterance, the target of the utterance, and the para-language information.
[Item 7]
It has a reply phrase database that stores multiple phrases.
The item 6 described in Item 6, wherein the candidate generation unit specifies a phrase corresponding to the intention of the utterance and the para-language information in the reply phrase database, and generates a candidate for the second message based on the specified phrase. Information processing device.
[Item 8]
The information processing device according to item 2, wherein the para-language information includes information on whether or not the first user intends to ask a question.
[Item 9]
The information processing device according to item 2, wherein the para-language information includes information for identifying a word emphasized in the text data.
[Item 10]
The information processing device according to item 2, wherein the para-language information includes information for specifying a word delimiter position in the text data.
[Item 11]
The information processing apparatus according to any one of items 1 to 10, wherein the para-language information includes information representing at least one of the emotion, urgency, severity, flank, and tension of the first user.
[Item 12]
The candidate generation unit decorates the text data based on the para-language information, and then decorates the text data.
The information processing device according to item 2, wherein the transmission unit transmits the decorated text data.
[Item 13]
The information processing device according to item 12, wherein the candidate generation unit adds a question mark to the end of the text data when the para-language information indicates that the first user intends to ask a question.
[Item 14]
The para-language information includes information for identifying a word emphasized by the first user in the text data, and the candidate generation unit changes the appearance of the emphasized word in the text data. The information processing apparatus according to any one of 13.
[Item 15]
The para-language information includes information for specifying a word delimiter position in the text data.
The information processing device according to any one of items 12 to 14, wherein the candidate generation unit adds information for identifying a delimiter between words to the delimiter position in the text data.
[Item 16]
The information processing according to any one of items 12 to 15, wherein the para-language information includes information representing the emotion of the first user, and the candidate generation unit adds information for identifying the emotion to the text data. Device.
[Item 17]
The information processing apparatus according to any one of items 12 to 16, further comprising a display unit for displaying the decorated text data.
[Item 18]
A translation processing unit that translates the text data and the candidate for the second message into the language used by the second user is provided.
The information processing device according to item 2, wherein the transmission unit transmits the text data translated into the language used by the second user and the candidate for the second message translated into the language used by the second user. ..
[Item 19]
The information processing device according to any one of items 1 to 18, wherein the second user is a human or computer system.
[Item 20]
Based on the sensing signal sensed by the first user, the para-language information of the first user is acquired, and the para-language information of the first user is acquired.
Based on the para-language information, a candidate message to be sent to the first user is generated.
An information processing method for transmitting a message candidate to a device of a second user who exchanges messages with the first user.

1: User (speaker) 2: User (dialogue partner), 101: Information processing device, 201: Receiver device, 111: Voice input unit, 112: Speech recognition processing unit, 113: Natural language understanding processing unit, 114: Speech analysis recognition processing unit, 115: Candidate generation unit, 116: Reply phrase database (DB), 117: Image input unit, 118: Image recognition processing unit, 119: Transmission / reception unit, 121: Image output processing unit, 122: Display unit , 131: Speech output processing unit, 132: Audio output unit, 141: Translation processing unit, 211: Voice input unit, 212: Speech recognition processing unit, 213: Natural language understanding processing unit, 214: Image input unit, 215: Image Recognition processing unit, 217: Selection result recognition unit, 218: Transmission / reception unit, 216: Operation input unit, 221: Image output processing unit, 222: Display unit, 231: Voice output processing unit, 232: Voice output unit, 400: Computer Device, 401: CPU, 402: Input interface, 403: Display device, 404: Communication device, 405: Main memory device, 406: External storage device, 407: Bus

Claims

Based on the sensing signal sensed by the first user, the para-language information acquisition unit that acquires the para-language information of the first user, and
A candidate generation unit that generates candidates for a message to be transmitted to the first user based on the para-language information, and
A transmission unit that transmits the message candidates to the device of the second user who exchanges messages with the first user, and
Information processing device equipped with.
The sensing signal includes the voice signal of the first user.
A voice recognition processing unit for voice-recognizing the voice signal of the first user and acquiring the text data of the first message spoken by the first user is provided.
The candidate generation unit generates a candidate for the second message returned by the second user in response to the first message.
The information processing device according to claim 1, wherein the transmission unit transmits the text data and the candidate for the second message to the device of the second user.
A receiving unit that receives a reply message including the candidate selected from the message candidates from the device of the second user, and a receiving unit.
The information processing apparatus according to claim 1, further comprising a display unit for displaying the reply message received by the receiving unit.
The sensing signal includes the voice signal of the first user.
The information processing device according to claim 1, wherein the para-language information acquisition unit acquires the para-language information based on the acoustic feature information of the voice signal of the first user.
The sensing signal includes an image pickup signal of the first user.
The information processing device according to claim 1, wherein the para-language information acquisition unit performs image recognition based on the image pickup signal of the first user and acquires the para-language information based on the result of the image recognition.
A natural language processing unit that estimates the intention of the first user's utterance and the target of the utterance based on the text data is provided.
The information processing device according to claim 2, wherein the candidate generation unit generates a candidate for the second message based on the intention of the utterance, the target of the utterance, and the para-language information.
It has a reply phrase database that stores multiple phrases.
The candidate generation unit identifies a phrase corresponding to the intention of the utterance and the para-language information in the reply phrase database, and generates a candidate for the second message based on the specified phrase. Information processing device.
The information processing device according to claim 2, wherein the para-language information includes information on whether or not the first user intends to ask a question.
The information processing device according to claim 2, wherein the para-language information includes information for identifying a word emphasized in the text data.
The information processing device according to claim 2, wherein the para-language information includes information for specifying a word delimiter position in the text data.
The information processing apparatus according to claim 1, wherein the paralinguistic information includes information representing at least one of the emotion, urgency, severity, flank, and tension of the first user.
The candidate generation unit decorates the text data based on the para-language information, and then decorates the text data.
The information processing device according to claim 2, wherein the transmission unit transmits the decorated text data.
The information processing device according to claim 12, wherein the candidate generation unit adds a question mark to the end of the text data when the para-language information indicates that the first user intends to ask a question.
The paralinguistic information includes information for identifying a word emphasized by the first user in the text data, and the candidate generation unit changes the appearance of the emphasized word in the text data. The information processing device described in.
The para-language information includes information for specifying a word delimiter position in the text data.
The information processing device according to claim 12, wherein the candidate generation unit adds information for identifying a delimiter between words to the delimiter position in the text data.
The information processing device according to claim 12, wherein the para-language information includes information representing the emotion of the first user, and the candidate generation unit adds information for identifying the emotion to the text data.
The information processing apparatus according to claim 12, further comprising a display unit for displaying the decorated text data.
A translation processing unit that translates the text data and the candidate for the second message into the language used by the second user is provided.
The information processing according to claim 2, wherein the transmitting unit transmits the text data translated into the language used by the second user and the candidate for the second message translated into the language used by the second user. Device.
The information processing device according to claim 1, wherein the second user is a human or computer system.
Based on the sensing signal sensed by the first user, the para-language information of the first user is acquired, and the para-language information of the first user is acquired.
Based on the para-language information, a candidate message to be sent to the first user is generated.
An information processing method for transmitting a message candidate to a device of a second user who exchanges messages with the first user.