WO2021166504A1 - Dispositif de traitement d'informations, procédé de traitement d'informations et programme - Google Patents

Dispositif de traitement d'informations, procédé de traitement d'informations et programme Download PDF

Info

Publication number
WO2021166504A1
WO2021166504A1 PCT/JP2021/001072 JP2021001072W WO2021166504A1 WO 2021166504 A1 WO2021166504 A1 WO 2021166504A1 JP 2021001072 W JP2021001072 W JP 2021001072W WO 2021166504 A1 WO2021166504 A1 WO 2021166504A1
Authority
WO
WIPO (PCT)
Prior art keywords
utterance
user
response
proxy
information processing
Prior art date
Application number
PCT/JP2021/001072
Other languages
English (en)
Japanese (ja)
Inventor
千明 宮崎
沙也 鈴木
礼夢 肥田
正則 井上
Original Assignee
ソニーグループ株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ソニーグループ株式会社 filed Critical ソニーグループ株式会社
Publication of WO2021166504A1 publication Critical patent/WO2021166504A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/10Speech classification or search using distance or distortion measures between unknown speech and reference templates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue

Definitions

  • This disclosure relates to an information processing device, an information processing method, and a program. More specifically, the present invention relates to an information processing device that speaks on behalf of the user, an information processing method, and a program.
  • the voice dialogue system analyzes the user's utterance input through the microphone, and performs various processes and responses based on the analysis result.
  • a car navigation (car navigation) device mounted on a vehicle.
  • the car navigation device analyzes user utterances such as a driver, for example, user utterances related to designation of a destination, displays a route to the destination, and makes an announcement (system utterance) for a driving guide.
  • a car navigation device that executes processing according to a user's utterance is described in, for example, Patent Document 1 (US Pat. No. 5,274,560).
  • a voice dialogue system such as a car navigation device
  • the driver's concentration on driving may be reduced and the driver may be in a dangerous situation. ..
  • the present disclosure has been made in view of the above problems, for example, even if the user does not speak to a voice interactive device that analyzes and processes the user's utterance, such as a car navigation device. , An information processing device that speaks on behalf of a user, an information processing method, and a program.
  • the first aspect of the disclosure is Input the device utterance output from the user utterance partner device, which is the user's utterance partner, It has a data processing unit that generates and outputs a user substitute utterance in place of the user in response to the device utterance.
  • the data processing unit A response necessity determination unit that determines the necessity of the user proxy utterance, and a response necessity determination unit.
  • the information processing device has a response generation unit that generates a user proxy utterance when the response necessity determination unit determines that a user proxy utterance is necessary.
  • the second aspect of the present disclosure is It is an information processing method that executes information processing in an information processing device.
  • the information processing device Input the device utterance output from the user utterance partner device, which is the user's utterance partner, It has a data processing unit that generates and outputs a user substitute utterance in place of the user in response to the device utterance.
  • the data processing unit The response necessity determination process for determining the necessity of the user proxy utterance and the response necessity determination process This is an information processing method for executing a response generation process for generating a user proxy utterance when it is determined in the response necessity determination process that a user proxy utterance is necessary.
  • the third aspect of the present disclosure is A program that executes information processing in an information processing device.
  • the information processing device Input the device utterance output from the user utterance partner device, which is the user's utterance partner, It has a data processing unit that generates and outputs a user substitute utterance in place of the user in response to the device utterance.
  • the program is installed in the data processing unit.
  • the response necessity determination process for determining the necessity of the user proxy utterance and the response necessity determination process
  • the program of the present disclosure is, for example, a program that can be provided by a storage medium or a communication medium that is provided in a computer-readable format to an information processing device or a computer system that can execute various program codes.
  • a program that can be provided by a storage medium or a communication medium that is provided in a computer-readable format to an information processing device or a computer system that can execute various program codes.
  • system is a logical set configuration of a plurality of devices, and the devices having each configuration are not limited to those in the same housing.
  • a device and a method for generating and outputting a response utterance on behalf of the user with respect to the device utterance output by the interactive device are realized.
  • the device utterance from the user utterance partner device which is the user's utterance partner, is input, and the user substitute utterance is generated and output on behalf of the user.
  • It has a response necessity determination unit that determines the necessity of a user proxy utterance, and a response generation unit that generates a user proxy utterance when it is determined that a user proxy utterance is necessary.
  • the response generation unit generates and outputs a proxy utterance that reflects the user's intention by referring to, for example, the user action history information.
  • the response necessity determination unit determines that the user proxy utterance is necessary when the user utterance is not performed within the predetermined threshold time from the utterance completion timing of the device utterance.
  • a voice dialogue system is a system that analyzes user utterances input through a microphone and performs various processes and responses based on the analysis results.
  • a car navigation system mounted on a vehicle ( There is a car navigation) device.
  • the car navigation device analyzes user utterances such as a driver, for example, user utterances related to designation of a destination, displays a route to the destination, and makes an announcement (system utterance) for a driving guide.
  • a driver for example, user utterances related to designation of a destination
  • displays a route to the destination and makes an announcement (system utterance) for a driving guide.
  • system utterance for a driving guide.
  • the user speaks to a voice dialogue system such as a car navigation device the user's mental burden may occur and the dialogue with the system may be interrupted.
  • the driver's concentration on driving may be reduced and a dangerous situation may occur.
  • FIG. 1 shows the car navigation device as the user utterance partner device 10.
  • User 1 is, for example, a vehicle driver.
  • the user 1 makes the following user utterance in step S11.
  • User utterance "Tell me the route to Tokyo Tower”
  • the user utterance partner device 10 which is a car navigation device utters the following device in step S12.
  • Device utterance "There are three candidates for the route to Tokyo Tower. Which one do you want?"
  • the user utterance partner device 10 is not limited to the car navigation device. Various interactive devices can be the user utterance partner device 10.
  • FIG. 2 shows an example in which the user speaking partner device 10 is used as an English conversation lesson device.
  • the user utterance partner device 10 which is an English conversation lesson device utters the following devices in step S21.
  • Device utterance "Have you ever been to New York?"
  • step S22 If the user 1 cannot hear or understand the device utterance, the user utterance is not performed in step S22. After that, the dialogue with the user speaking partner device 10 which is the English conversation lesson device is interrupted, and the English conversation lesson device cannot proceed with the process.
  • FIG. 3 shows an example in which the user utterance partner device 10 is used as a character dialogue device.
  • the character dialogue device is a device that enables various dialogues such as daily conversations between the user and the character.
  • the user utterance partner device 10 which is a character dialogue device, performs the following device utterance in step S31.
  • Device utterance "Where did you go today?"
  • step S32 If the user 1 cannot hear the device utterance, is wondering what to answer, or forgets the name of the place, the user utterance is performed in step S32. No. After that, the dialogue with the user utterance partner device 10 which is the character dialogue device is interrupted.
  • the user 1 cannot speak to the user utterance partner device 10 which is a voice dialogue system in a timely manner.
  • the dialogue between the user 1 and the user utterance partner device 10 is performed. It will be interrupted and the processing on the user utterance partner device 10 side will be delayed.
  • the present disclosure prevents the occurrence of such a situation. That is, a user proxy utterance device that executes an utterance on behalf of the user 1 is provided. A specific example of the process executed by the user proxy utterance device of the present disclosure will be described with reference to FIG. 4 and the following.
  • FIG. 4 is a diagram showing the car navigation device described above with reference to FIG. 1 as the user utterance partner device 10.
  • FIG. 4 further shows the user proxy utterance device 20.
  • the smartphone (smartphone) is shown as the user proxy utterance device 20.
  • the user proxy utterance device 20 may be a device other than a smartphone, for example, a PC or a tablet terminal. Furthermore, it can be realized as another dedicated information processing device. Further, the configuration may be integrated with the user utterance partner device 10 such as a car navigation device.
  • a smartphone smart phone
  • a program application that executes the user proxy utterance process is installed and used on the smartphone.
  • FIG. 4 A processing example of the user proxy utterance device 20 (smartphone) shown in FIG. 4 will be described.
  • the user 1 is, for example, a driver of a vehicle.
  • the user 1 makes the following user utterance in step S11.
  • User utterance "Tell me the route to Tokyo Tower”
  • the user utterance partner device 10 which is a car navigation device utters the following device in step S12.
  • Device utterance "There are three candidates for the route to Tokyo Tower. Which one do you want?"
  • the user proxy utterance device 20 (smartphone) shown in FIG. 4 performs the following "user proxy utterance” instead of the user 1 in step S13.
  • User proxy device utterance "Select the fastest route"
  • the user proxy device utterance by the user proxy utterance device 20 is input to the user speech partner device 10, that is, the car navigation device.
  • the car navigation device interprets this user proxy device utterance as a user speech and executes processing according to the user proxy device utterance. That is, in the example of FIG. 4, the process of selecting one route that arrives earliest from the three routes to Tokyo Tower is performed.
  • the user proxy utterance device 20 executes the user proxy device utterance instead of the user 1, so that the car navigation device can proceed with the process according to the user proxy device utterance, and the process can be performed. It can be executed without delay.
  • FIG. 5 is an example in which the user speaking partner device 10 described above with reference to FIG. 2 is used as an English conversation lesson device.
  • the user utterance partner device 10 which is an English conversation lesson device utters the following devices in step S21.
  • Device utterance "Have you ever been to New York?"
  • the user 1 cannot hear or understand the device utterance, and cannot immediately make the user utterance.
  • the user substitute utterance device 20 (smartphone) shown in FIG. 5 performs the following user substitute utterance device instead of the user 1 in step S22.
  • User proxy device utterance "Yes, I want to New York last summer.”
  • the user proxy device utterance by the user proxy utterance device 20 is input to the user speech partner device 10, that is, the English conversation lesson device.
  • the English conversation lesson device interprets this user proxy device utterance as a user speech and executes processing according to the user proxy device utterance. That is, in the example of FIG. 5, it is possible to shift to the next device utterance.
  • FIG. 6 is an example in which the user utterance partner device 10 described above with reference to FIG. 3 is used as the character dialogue device.
  • the user utterance partner device 10 which is a character dialogue device, performs the following device utterance in step S31.
  • Device utterance "Where did you go today?"
  • the user substitute utterance device 20 (smartphone) shown in FIG. 6 performs the following user substitute utterance device instead of the user 1 in step S32.
  • User agency utterance "I went to the museum"
  • the user proxy device utterance by the user proxy utterance device 20 is input to the user speech partner device 10, that is, the character dialogue device.
  • the character dialogue device interprets this user proxy device utterance as a user speech, and executes processing according to the user proxy device utterance. That is, in the example of FIG. 6, it is possible to shift to the next device utterance.
  • FIG. 7 shows the user 1 when the user utterance partner device 10 is a car navigation device, the user utterance partner device 10 (car navigation device), and the user proxy utterance device 20 (as described above with reference to FIG. 4). It is a figure which shows an example of the dialogue sequence which a smartphone) executes. Utterance No. 1 to No. The following utterance sequences up to 9 are shown.
  • Utterance No. 1 is a user utterance, and the user utters "navigate from here to Odaiba".
  • the user utterance partner device (car navigation device) makes the following utterances.
  • This utterance No. 2 (user utterance partner device (car navigation device)) is an utterance including a question from the car navigation device to the user.
  • the user proxy utterance device (smartphone) changes to the user and makes the following response.
  • a dialogue is performed between the user utterance partner device (car navigation device) and the user substitute utterance device (smartphone).
  • the user does not participate in the dialogue between the utterance N0.4 to the utterance N0.8, and simply listens to the dialogue between the user utterance partner device (car navigation device) and the user substitute utterance device (smartphone).
  • the user listens to the dialogue between the user utterance partner device (car navigation device) and the user proxy utterance device (smartphone), and whenever the user does not agree with the intention of the user, the user listens to the dialogue. It is possible to participate in the dialogue and convey the user's intention to the user utterance partner device (car navigation device).
  • the user substitute utterance device 10 (smartphone) has a dialogue with the user utterance partner device 20 instead of the user 1.
  • an input utterance correspondence response database that associates "sample input utterance” and "response utterance” that can be used for dialogue with the user utterance partner device 20 is stored.
  • the user substitute utterance device 10 selects and outputs the substitute utterance from the information registered in the input utterance correspondence response database.
  • the user proxy utterance device 10 includes a proxy utterance that conveys the user's intention to the user speech partner device 20 (car navigation device) instead of the user 1. Is done.
  • the user proxy utterance device 10 in order to perform the proxy utterance including the intention of the user 1, the user proxy utterance device 10 (smartphone) needs to perform a process of estimating the intention of the user 1.
  • the user proxy utterance device 10 refers to, for example, the user action history information stored in the storage unit in the user proxy utterance device 10 (smartphone).
  • the user action history information stored in the storage unit in the user proxy utterance device 10 will be described later.
  • history information of past actions of the user 1 is stored in the storage unit in the user proxy utterance device 10 (smartphone). For example, information such as a place frequently visited by the user 1 and a road to be used is recorded.
  • the user proxy utterance device 10 refers to the user action history information, estimates the user's intention, determines the proxy utterance of the user 1, and outputs it.
  • the user proxy utterance device 10 estimates the user's intention by referring not only to the user action history information but also other information such as user profile information and information acquired from an external server such as an SNS server. Then, a process of determining the substitute utterance of the user 1 may be performed. Specific examples of these processes will be described later.
  • this utterance No. 3 is an estimation that when the user proxy utterance device 20 (smartphone) refers to the user behavior history information and the user 1 speaks "Odaiba", the user 1 intends to "Odaiba Kaihin Koen Station". This is a user proxy utterance.
  • the user proxy utterance device 20 (smartphone) refers to the user behavior history information and confirms that the user 1 frequently uses the expressway, and under this confirmation, the user 1 determines. Judging that he wants to use the expressway, the above utterance No. This is the utterance on behalf of the user in 7.
  • the user 1 can also listen to the dialogue performed between the user utterance partner device 10 (car navigation device) and the user substitute utterance device 20 (smartphone), and when the user's intention is not met. Can participate in the dialogue at any time and convey the user's intention to the user utterance partner device 10 (car navigation device).
  • FIG. 8 shows the user 1 when the user speaking partner device 10 is an English conversation lesson device, the user speaking partner device 10 (English conversation lesson device), and the user proxy speaking device, as described above with reference to FIG. It is a figure which shows an example of the dialogue sequence executed by 20 (smartphone). Utterance No. 1 to No. The following utterance sequences up to 4 are shown.
  • Utterance No. Reference numeral 1 denotes an utterance of the user's utterance partner device (English conversation lesson device), and the English conversation lesson device utters "How was your holiday?".
  • This utterance No. 1 (user utterance partner device (English conversation lesson device)) is an utterance including a question to the user from the English conversation lesson device.
  • the user utterance partner device (English conversation lesson device) recognizes the utterance from the user substitute utterance device (smartphone) as a response from the user, and further utters the following in response to this response.
  • This utterance No. 3 (User utterance partner device (English conversation lesson device)) is also an utterance including a question to the user from the English conversation lesson device.
  • the user can participate in the dialogue at the timing when the user wants to speak while listening to the dialogue performed between the user utterance partner device (English conversation lesson device) and the user substitute utterance device (smartphone). Will be.
  • the user substitute utterance device 10 transmits information on the past behavior of the user to the user utterance partner device 20 (car navigation device) instead of the user 1.
  • the user proxy utterance device 10 needs to acquire information about the past actions of the user 1.
  • the user action history information is stored in the storage unit of the user substitute utterance device 10 (smartphone), and the data processing unit of the user substitute utterance device 10 (smartphone) stores the user action stored in the storage unit. Refer to the history information and execute the utterance according to the reference result. A specific example of the user action history information stored in the storage unit in the user proxy utterance device 10 (smartphone) will be described later.
  • FIG. 9 shows the user 1 when the user utterance partner device 10 is a character dialogue device, the user utterance partner device 10 (character dialogue device), and the user substitute utterance device, as described above with reference to FIG. It is a figure which shows an example of the dialogue sequence executed by 20 (smartphone). Utterance No. 1 to No. The following utterance sequences up to 6 are shown.
  • Utterance No. Reference numeral 1 denotes an utterance of the user's utterance partner device (character dialogue device), and the character dialogue device utters "Hello. Let's talk with me.” Since the user was wondering how to answer this utterance and felt a mental burden, the user himself did not answer and left the response to the user proxy utterance device (smartphone). When the user does not respond and a predetermined time elapses, the user proxy utterance device (smartphone) takes the place of the user and makes the following utterances.
  • the user utterance partner device recognizes the utterance from the user substitute utterance device (smartphone) as a response from the user, and further utters the following in response to this response.
  • This utterance No. 3 (user utterance partner device (character dialogue device)) is an utterance including a question from the character dialogue device to the user.
  • the user proxy utterance device (smartphone) makes the following utterances on behalf of the user.
  • Utterance No. 4 (user utterance) Taro
  • the user utterance partner device recognizes the utterance from the user substitute utterance device (smartphone) as a response from the user, and further utters the following in response to this response.
  • This utterance No. 3 (user utterance partner device (character dialogue device)) is an utterance including a question from the character dialogue device to the user.
  • the user can participate in the dialogue at the timing when the user wants to speak while listening to the dialogue performed between the user utterance partner device (character dialogue device) and the user substitute utterance device (smartphone). It becomes.
  • the user proxy utterance device 10 (smartphone) has the utterance No. In 2, the user utterance partner device 20 (character dialogue device) is uttered by asking for a name. In addition, the user proxy utterance device 10 (smartphone) has an utterance No. In 4, the user's name is answered instead of the user 1.
  • the user proxy utterance device 10 (smartphone) has the utterance No. In 2, in order to make a name-questioning utterance to the user utterance partner device 20 (character dialogue device), a function of uttering a response to the utterance from the user utterance partner device 20 (character dialogue device) is required. ..
  • the storage unit of the user substitute utterance device 10 is an input utterance correspondence response database in which "sample input utterance” and “response utterance” that can be used for dialogue with the user utterance partner device 20 are associated with each other. Is stored.
  • the user substitute utterance device 10 refers to the information registered in the input utterance correspondence response database, and refers to the utterance No. of FIG.
  • the utterance in 2 that is, the utterance to the user utterance partner device 20 (character dialogue device) is performed.
  • a specific example of the input utterance correspondence response database will be described later.
  • the user proxy utterance device 10 (smartphone) has the utterance No. In 4, in order to speak the user's name to the user's utterance partner device 20 (character dialogue device), the user proxy utterance device 10 (smartphone) must know the user's name.
  • the data processing unit of the user substitute utterance device 10 (smartphone) acquires the user name from the user profile information stored in the storage unit of the user substitute utterance device 10 (smartphone), and obtains the user name from the user profile information, and the utterance No. In 4, the utterance explaining the user's name is given.
  • the user proxy utterance device 10 of the present disclosure As described with reference to FIGS. 4 to 9, if the user proxy utterance device 10 of the present disclosure is used, most of the dialogue with the user utterance partner device 20 is left to the user proxy utterance device 10, and the user is positive. It is possible to speak to the user and participate in the dialogue at any time in the situation where he / she wants to participate in the dialogue.
  • the situation in which the user wants to actively participate in the dialogue is, for example, a situation in which the utterance of the user substitute utterance device 10 needs to be corrected, or a situation in which the user 1 feels a mental burden to make the user's utterance by himself / herself. If not, there is a topic that user 1 wants to explain by himself.
  • FIG. 10 is a diagram showing a configuration example of the user proxy utterance device 20 which is the information processing device of the present disclosure.
  • the user proxy utterance device 20 includes a voice input unit (microphone) 21, a data processing unit 22, a voice output unit (speaker) 23, a communication unit 24, a storage unit 25, and an image input unit ( Has a camera).
  • the voice input unit (microphone) 21 inputs the user utterance voice 51 emitted from the user 1 and the user utterance partner device output voice 52 output from the user utterance partner device 10 such as a car navigation device.
  • the voice data input by the voice input unit (microphone) 21 is input to the data processing unit 22.
  • the data processing unit 22 analyzes the input voice, determines whether or not the user proxy utterance device 20 should speak, and if it determines that the utterance should be made, generates an utterance and generates a voice output unit (speaker). ) Output to 23.
  • the voice output unit (speaker) 23 outputs the utterance generated by the data processing unit 22 as the user utterance agent utterance 53.
  • the communication unit 24 performs communication for acquiring information necessary for determining whether or not to execute an utterance and generating an utterance in the data processing unit 22 from an external server or a user utterance partner device 10.
  • the storage unit 25 records information necessary for determining whether or not the data processing unit 22 needs to execute an utterance and for generating an utterance.
  • the user behavior history information described above, the input utterance correspondence response database, the user profile information, and the like are recorded.
  • the image input unit captures, for example, a face image or an eye image of user 1.
  • the captured image is used, for example, for the line-of-sight direction analysis of the user 1, and is used for determination processing of whether or not the user utterance is an utterance made toward the user utterance partner device 10.
  • FIG. 11 is a diagram showing a detailed configuration of the data processing unit 22 and the storage unit 24 of the user proxy utterance device 20.
  • the data processing unit 22 of the user substitute utterance device 20 includes an utterance detection unit 101, a voice recognition unit 102, a response necessity determination unit 103, a response generation unit 104, and a voice synthesis unit 105.
  • the storage unit 24 of the user substitute utterance device 20 stores the input utterance correspondence response database 121, the user action history information 122, and the user profile information 123.
  • the communication unit 25 executes communication with the user utterance partner device 10 and the external server 150.
  • the external server 150 is, for example, a server that stores information that can be used by the user substitute utterance device 20 to understand the input utterance, information necessary for the user substitute utterance device 20 to make a utterance, and the like.
  • it is composed of various databases such as a knowledge database that stores general knowledge and the like, a scenario database that stores dialogue sequence information, and an SNS (Social Networking Service) server.
  • SNS Social Networking Service
  • the travel history information of the vehicle equipped with the car navigation device may be recorded in the storage unit of the car navigation device.
  • the data processing unit 22 can acquire the travel history information via the communication unit 25, analyze the user behavior, and use it as reference information at the time of utterance generation.
  • the utterance detection unit 101 inputs the user utterance voice 51 emitted from the user 1 and the user utterance partner device output voice 52 output from the user utterance partner device 10 such as a car navigation device via the voice input unit (microphone) 21. do.
  • the utterance detection unit 101 detects that the voice data has been input from the voice input unit (mic) 21, the spoke detection unit 101 outputs the input voice data to the voice recognition unit 102.
  • the voice recognition unit 102 executes a text (utterance text) generation process based on the voice data input from the speech detection unit 101.
  • the voice recognition unit 102 has, for example, an ASR (Automatic Speech Recognition) function, and converts voice data into text (utterance text) data composed of a plurality of words.
  • the utterance text generated by the voice recognition unit 102 is output to the response necessity determination unit 103.
  • the voice data input from the utterance detection unit 101 is (A) Is it the user-spoken voice 51 emitted from the user 1? (B) Whether the user utterance partner device output voice 52 is output from the user utterance partner device 10 such as a car navigation device. It is determined which of these voices (a) and (b) is, and the utterance subject identifier (speaking subject identification tag) indicating the discrimination information is used as the attribute information associated with the utterance text, and it is necessary to respond to both the utterance texts. Output to the determination unit 103.
  • a utterance subject identification unit for identifying the utterance subject is provided as a configuration different from the voice recognition unit 102, and the utterance subject identifier (utterance subject identification tag) generated in the utterance subject identification unit is used in the response necessity determination unit 103 together with the utterance text. It may be configured to output.
  • the process of determining whether the voice data input from the utterance detection unit 101 is the user utterance voice 51 or the user utterance partner device output voice 52 is performed by analyzing the voice frequency included in the voice data. Since the output voice 52 of the user utterance partner device such as a car navigation device is the output voice from the speaker, it is composed of only the frequency data in a predetermined range according to the characteristics of the speaker, and is different from the frequency included in the human utterance.
  • the voice recognition unit 102 analyzes the frequency characteristics of the voice data input from the utterance detection unit 101, and determines whether the input voice data is the user utterance voice 51 or the user utterance partner device output voice 52. , The utterance subject identifier (speaking subject identification tag) indicating the discrimination information is associated with the utterance text and output to the response necessity determination unit 103.
  • the response necessity determination unit 103 inputs the utterance text in which the utterance subject identifier (utterance subject identification tag) is set from the voice recognition unit 102, and the input utterance text is generated based on the user utterance partner device output voice 52.
  • the utterance text it is determined whether or not the user substitute utterance device 20 needs to make a response utterance to the utterance text, that is, the text based on the output utterance of the user utterance partner device output voice 52. ..
  • the response necessity determination unit 103 performs, for example, the following processing as a process of determining whether or not it is necessary to perform a response utterance. If the utterance of the user 1 is not input within the predetermined threshold time (for example, 2 seconds) from the utterance completion timing of the user utterance partner device 10, it is determined that the user substitute utterance device 20 should respond instead of the user 1. ..
  • the user acts on behalf of the user 1. It is determined that the uttering device 20 should respond.
  • the response necessity determination unit 103 determines that the user proxy utterance device 20 should respond, the response necessity determination unit 103 requests the next response generation unit 104 to generate a response.
  • the response generation unit 104 When the response generation request is input from the response necessity determination unit 103, the response generation unit 104 generates an utterance text corresponding to the utterance to be output from the user proxy utterance device 20.
  • various information stored in the storage unit 24, that is, the input utterance correspondence response database 121, the user action history information 122, and the user profile information 123 can be used. Further, it is also possible to use the information that can be acquired via the communication unit 25, that is, the information acquired from the external server 150 or the user utterance partner device 10. Specific examples of these processes will be described later.
  • the utterance text generated by the response generation unit 104 is input to the speech synthesis unit 105.
  • the voice synthesis unit 105 generates a synthetic voice based on the utterance text generated by the response generation unit 104. That is, the voice synthesis process (TTS: Text To Speech) is executed, and the generated synthetic voice is output as the user speech agent output voice 53 shown in the figure via the voice output unit (speaker) 23.
  • TTS Text To Speech
  • the user utterance partner device 10 and the user substitute utterance device 20 are configured as separate devices, but it is also possible to configure both as one device.
  • a user proxy utterance execution unit that performs a user's proxy utterance may be provided in the car navigation device.
  • the utterance detection unit 101 outputs the user utterance voice 51 emitted from the user 1 and the user utterance partner device 10 output from the user utterance partner device 10 such as a car navigation device via the voice input unit (mic) 21.
  • the voice 52 is input, and the input voice data is output to the voice recognition unit 102.
  • the input, output, and execution processes of the utterance detection unit 101 are as follows.
  • the utterance voice data of the utterance voice of the user utterance partner device 10 or the user 1 is extracted from the voice data including various noise sounds to generate the utterance voice data as output data.
  • An existing voice section detection program can be used for the voice section detection process. For example, existing open source software that is allowed to be freely used or modified may be used.
  • the voice recognition unit 102 executes a text (utterance text) generation process based on the voice data input from the speech detection unit 101 as described above.
  • the voice recognition unit 102 has, for example, an ASR (Automatic Speech Recognition) function, and converts voice data into text (utterance text) data composed of one or a plurality of words.
  • ASR Automatic Speech Recognition
  • the voice recognition unit 102 receives the voice data input from the utterance detection unit 101.
  • A Is it the user-spoken voice 51 emitted from the user 1?
  • B Whether the user utterance partner device output voice 52 is output from the user utterance partner device 10 such as a car navigation device. It is determined which of these voices (a) and (b) is, and the utterance subject identifier (speaking subject identification tag) indicating the discrimination information is used as the attribute information corresponding to the utterance text, and the response necessity determination unit is used for both the utterance texts.
  • Output to 103 Output to 103.
  • the input, output, and execution processes of the voice recognition unit 102 are as follows.
  • Input utterance voice speech voice of user utterance partner device 10 or utterance voice of user 1
  • B Output utterance text to which the utterance subject identifier (speaking subject identification tag) is added (the utterance text of the user utterance partner device 10 or the utterance text of the user himself / herself)
  • C Processing Automatically convert spoken voice to text.
  • speech recognition for example, a program that executes the above-mentioned ASR (Automatic Speech Recognition) function is used. Open source software may be used.
  • the voice data input from the utterance detection unit 101 is (A) Is it the user-spoken voice 51 emitted from the user 1? (B) Whether the user utterance partner device output voice 52 is output from the user utterance partner device 10 such as a car navigation device. It is determined which of these voices (a) and (b) is.
  • the output voice 52 of the user utterance partner device such as the car navigation device is the output voice from the speaker, it is composed only of the frequency data in a predetermined range according to the characteristics of the speaker, and the frequency included in the human utterance. Is different.
  • the voice recognition unit 102 analyzes the frequency characteristics of the voice data input from the utterance detection unit 101, and determines whether the input voice data is the user utterance voice 51 or the user utterance partner device output voice 52.
  • the utterance subject identifier (speaking subject identification tag) indicating the discrimination information is associated with the utterance text and output to the response necessity determination unit 103.
  • the response necessity determination unit 103 inputs the utterance text in which the utterance subject identifier (utterance subject identification tag) generated by the voice recognition unit 102 is set as described above, and the tag set in the input utterance text is the user's utterance. In the case of the utterance text based on the output utterance of the remote device output voice 52, it is determined whether or not the user substitute utterance device 20 needs to make a response utterance.
  • the user substitute utterance device 20 is used instead of the user 1. Determines that should respond.
  • the response necessity determination unit 103 determines that the user proxy utterance device 20 should respond, the response necessity determination unit 103 requests the next response generation unit 104 to generate a response.
  • the input, output, and execution processing of the response necessity determination unit 103 are as follows.
  • the utterance subject identifier (utterance subject identification tag) generated by the voice recognition unit 102 is used.
  • the set utterance text is also output to the response generation unit 104.
  • the response necessity determination unit 103 performs the following processing, for example.
  • the user substitute utterance device 20 is used instead of the user 1. Determine to respond.
  • the utterance completion timing of the user utterance partner device 10 and the user utterance detection timing, which are required in this determination process, are set from the voice recognition unit 102 to the response necessity determination unit 103 as the utterance subject identifier (utterance subject identification tag). Judgment is based on the time the utterance text is entered. Alternatively, the utterance detection time in which the utterance is detected by the utterance detection unit 101 may be used.
  • processing mode of the response necessity determination processing for determining whether or not the user substitute utterance device 20 should respond instead of the user 1, and any one of them or It is possible to execute a combination of a plurality of processing examples.
  • the face image of the user 1 is photographed by using the image input unit (camera) 26 provided in the user proxy utterance device 20, and the line-of-sight information of the user 1 is analyzed from the photographed face image.
  • the image input unit (camera) 26 provided in the user proxy utterance device 20
  • the line-of-sight information of the user 1 is analyzed from the photographed face image.
  • an image of the user 1 is taken, the gesture (signal) of the user 1 is analyzed from the taken image, and a "gesture indicating that the user himself / herself wants to speak” or a "gesture indicating that the user himself / herself does not want to speak” or "a gesture indicating that the user himself / herself does not want to speak”. , And based on the gesture analysis result, it may be determined whether or not the response of the user proxy utterance device 20 is necessary.
  • processing example 1 that is, (Processing Example 1)
  • a process of determining whether or not a response is required of the user proxy utterance device 20 is performed.
  • the processing sequence of this (processing example 1) will be described.
  • Step S101 First, in step S101, the response necessity determination unit 103 of the user substitute utterance device 20 inputs the utterance text of the user utterance partner device 10 from the voice recognition unit 102.
  • the voice recognition unit 102 receives the voice data input from the utterance detection unit 101.
  • (A) Is it the user-spoken voice 51 emitted from the user 1?
  • (B) Whether the user utterance partner device output voice 52 is output from the user utterance partner device 10 such as a car navigation device.
  • the identification information of which of these voices (a) and (b) is used, that is, the utterance text in which the utterance subject identifier (speaking subject identification tag) is set as additional information is input to the response necessity determination unit 103.
  • the response necessity determination unit 103 refers to the utterance subject identifier (utterance subject identification tag) and confirms that the input text from the voice recognition unit 102 is the utterance text of the user utterance partner device 10, the step The processing of S102 and the like is executed.
  • step S102 the response necessity determination unit 103 determines whether or not the user utterance is detected within the specified threshold time (N seconds) from the utterance completion timing of the user utterance partner device 10.
  • the utterance completion timing of the user utterance partner device 10 and the user utterance detection timing, which are required in this determination process, are set from the voice recognition unit 102 to the response necessity determination unit 103 to set the utterance subject identifier (utterance subject identification tag). Judgment is made based on the time when the spoken text is input. Alternatively, the utterance detection time in the utterance detection unit 101 may be used.
  • step S102 If the user utterance is detected within the specified threshold time (N seconds) from the utterance completion timing of the user utterance partner device 10, the determination in step S102 is Yes, and the process proceeds to step S103.
  • step S102 the determination in step S102 is Yes, and the process proceeds to step S103.
  • step S102 determines whether the user utterance is not detected within the specified threshold time (N seconds) from the utterance completion timing of the user utterance partner device 10. If the user utterance is not detected within the specified threshold time (N seconds) from the utterance completion timing of the user utterance partner device 10, the determination in step S102 is Yes, and the process proceeds to step S103.
  • step S102 the determination in step S102 is No, and the process proceeds to step S104.
  • Step S103 In step S102, when the user utterance is detected within the specified threshold time (N seconds) from the utterance completion timing of the user utterance partner device 10, the process of step S103 is executed.
  • step S103 the response necessity determination unit 103 is "unnecessary" for the response of the user substitute utterance device 20, that is, the utterance from the user substitute utterance device 20 for the output utterance from the user utterance partner device 10.
  • the response necessity identification value (0) indicating the above is generated and output to the response generation unit 104.
  • the response generation unit 104 does not execute the response utterance generation process for outputting from the user proxy utterance device 20.
  • Step S104 On the other hand, in step S102, if the user utterance is not detected within the specified threshold time (N seconds) from the utterance completion timing of the user utterance partner device 10, the process of step S104 is executed.
  • step S104 the response necessity determination unit 103 "needs" the response of the user substitute utterance device 20, that is, the utterance from the user substitute utterance device 20 to the output utterance from the user utterance partner device 10.
  • the response necessity identification value (1) indicating the above is generated and output to the response generation unit 104.
  • the response generation unit 104 executes a response utterance generation process for outputting from the user proxy utterance device 20.
  • the user utterance detected within the specified threshold time (N seconds) from the utterance completion timing of the user utterance partner device 10 is made to the user utterance partner device 10 such as a car navigation device.
  • the response necessity identification value (0) indicating that the utterance from the user substitute utterance device 20 for the output utterance from the user utterance partner device 10 is "unnecessary" is generated, and the response generation unit 104 generates a response necessity identification value (0).
  • the utterance from the user substitute utterance device 20 for the output utterance from the user utterance partner device 10 is "necessary".
  • a response necessity identification value (1) indicating that there is is generated and output to the response generation unit 104.
  • the determination as to whether or not the user utterance is made to the user utterance partner device 10 such as the car navigation device can be executed based on, for example, the semantic analysis result of the user utterance.
  • the semantic analysis of the user utterance it can be determined whether or not the user utterance is made as a response to the utterance of the immediately preceding user utterance partner device 10. For example, it is possible to analyze the text of the user's utterance and determine whether or not it is an operation command to the car navigation device (a response to the utterance of the user's utterance partner device 10).
  • the line-of-sight direction of the user 1 is analyzed from the face image of the user 1 taken by the image input unit (camera) 26 mounted on the user substitute utterance device 20, and the line-of-sight of the user 1 is transferred to the user utterance partner device 10. If it is directed, it is determined that the user utterance is an utterance made to the user utterance partner device 10, and if the line of sight of the user 1 is not directed to the user utterance partner device 10, the user utterance is It may be determined that the utterance is not made to the user utterance partner device 10.
  • Steps S121 to S122 The processes of steps S121 to S122 are the same as the processes of steps S101 to S102 of the flow described with reference to FIG.
  • the response necessity determination unit 103 of the user substitute utterance device 20 inputs the utterance text of the user utterance partner device 10 from the voice recognition unit 102 in step S121.
  • step S122 executes the following processing.
  • step S122 the response necessity determination unit 103 determines whether or not the user utterance is detected within the specified threshold time (N seconds) from the utterance completion timing of the user utterance partner device 10.
  • step S122 If the user utterance text is input within the specified threshold time (N seconds) from the utterance completion timing of the user utterance partner device 10, the determination in step S122 is Yes, and the process proceeds to step S123. On the other hand, if the user utterance text is not input within the specified threshold time (N seconds) from the utterance completion timing of the user utterance partner device 10, the determination in step S122 becomes No, and the process proceeds to step S125.
  • Step S123 In step S122, when the user utterance is detected within the specified threshold time (N seconds) from the utterance completion timing of the user utterance partner device 10, the process of step S123 is executed.
  • step S123 the response necessity determination unit 103 detects the user utterance within the specified threshold time (N seconds) from the utterance completion timing of the user utterance partner device 10 to the user utterance partner device such as a car navigation device. It is determined whether or not it was made for 10. For example, the determination is made using the semantic analysis result of the user's utterance and the analysis result of the line-of-sight information from the user's face image.
  • step S124 If it is determined that the detected user utterance is made to the user utterance partner device 10, the process proceeds to step S124. On the other hand, if it is determined that the detected user utterance is not made to the user utterance partner device 10, the process proceeds to step S125.
  • Step S124 In step S122, it is determined that the user utterance is detected within the specified threshold time (N seconds) from the utterance completion timing of the user utterance partner device 10, and further, in step S123, the user utterance is the user utterance partner device 10. If it is determined that the item has been made for, the process proceeds to step S124.
  • step S124 the response necessity determination unit 103 is "unnecessary" for the response of the user substitute utterance device 20, that is, the utterance from the user substitute utterance device 20 for the output utterance from the user utterance partner device 10.
  • the response necessity identification value (0) indicating the above is generated and output to the response generation unit 104.
  • the response generation unit 104 does not execute the response utterance generation process for outputting from the user proxy utterance device 20.
  • Step S125 On the other hand, in step S122, when the user utterance is not detected within the specified threshold time (N seconds) from the utterance completion timing of the user utterance partner device 10, or in step S123, the utterance of the user utterance partner device 10 is completed. If it is determined that the user utterance detected within the specified threshold time (N seconds) from the timing is not made to the user utterance partner device 10, the process proceeds to step S125.
  • step S125 the response necessity determination unit 103 "needs" the response of the user substitute utterance device 20, that is, the utterance from the user substitute utterance device 20 with respect to the output utterance from the user utterance partner device 10.
  • the response necessity identification value (1) indicating the above is generated and output to the response generation unit 104.
  • the response generation unit 104 executes a response utterance generation process for outputting from the user proxy utterance device 20.
  • Step S141 The process of step S141 is the same process as the process of step S101 of the flow described with reference to FIG. 14
  • the response necessity determination unit 103 of the user substitute utterance device 20 inputs the utterance text of the user utterance partner device 10 from the voice recognition unit 102 in step S141.
  • step S142 Execute the following processing.
  • step S142 the response necessity determination unit 103 determines whether or not the user substitute utterance device 20 is speaking as the utterance immediately before the input utterance text of the user utterance partner device 10.
  • step S142 determines whether the utterance is a human or not performing the utterance. If this is done, the determination in step S142 is Yes, and the process proceeds to step S144. If it has not been performed, the determination in step S142 becomes No, and the process proceeds to step S143. In the case where the utterance is not performed, the utterance immediately before the input utterance text of the user utterance partner device 10 does not exist, and the utterance immediately before the utterance text of the input user utterance partner device 10 is the user 1. Two cases are included in the case of user utterance.
  • Step S143 If it is determined in step S142 that the utterance of the user substitute utterance device 20 has not been performed as the utterance immediately before the utterance text of the input user utterance partner device 10, the process proceeds to step S143.
  • the response necessity determination unit 103 determines in step S143 whether or not the user utterance is detected within the specified threshold time (N seconds) from the utterance completion timing of the user utterance partner device 10.
  • step S143 If the user utterance text is input within the specified threshold time (N seconds) from the utterance completion timing of the user utterance partner device 10, the determination in step S143 is Yes, and the process proceeds to step S145. On the other hand, if the user utterance text is not input within the specified threshold time (N seconds) from the utterance completion timing of the user utterance partner device 10, the determination in step S143 becomes No, and the process proceeds to step S144.
  • Step S144 The process of step S144 is a process to be executed in any of the following cases.
  • step S144 the response necessity determination unit 103 responds to the user proxy utterance device 20, that is, the user proxy utterance device for the output utterance from the user utterance partner device 10.
  • a response necessity identification value (1) indicating that the utterance from 20 is "necessary" is generated and output to the response generation unit 104.
  • the response generation unit 104 executes a response utterance generation process for outputting from the user proxy utterance device 20.
  • Step S145 On the other hand, in step S142, it is determined that the user substitute utterance device 20 has not spoken as the utterance immediately before the utterance text of the user utterance partner device 10 that has been input, and further, in step S143, the user utterance partner device 10 When it is determined that the user utterance is detected within the specified threshold time (N seconds) from the utterance completion timing, the response necessity determination unit 103 executes the process of step S145.
  • step S145 the response necessity determination unit 103 is "unnecessary" for the response of the user substitute utterance device 20, that is, the utterance from the user substitute utterance device 20 for the output utterance from the user utterance partner device 10.
  • the response necessity identification value (0) indicating the above is generated and output to the response generation unit 104.
  • the response generation unit 104 does not execute the response utterance generation process for outputting from the user proxy utterance device 20.
  • processing example 2 + 3 in which (processing example 2) described with reference to FIG. 13 and (processing example 3) described with reference to FIG. 14 are executed together with reference to FIG. explain.
  • Step S151 The process of step S151 is the same process as the process of step S101 of the flow described with reference to FIG.
  • the response necessity determination unit 103 of the user substitute utterance device 20 inputs the utterance text of the user utterance partner device 10 from the voice recognition unit 102 in step S151.
  • the response necessity determination unit 103 refers to the utterance subject identifier (utterance subject identification tag) and confirms that the input text from the voice recognition unit 102 is the utterance text of the user utterance partner device 10, step S152. Execute the following processing.
  • Step S152 the response necessity determination unit 103 determines whether or not the user substitute utterance device 20 is speaking as the utterance immediately before the input utterance text of the user utterance partner device 10.
  • step S152 determines whether the utterance is Yes, and the process proceeds to step S155. If it has not been performed, the determination in step S152 becomes No, and the process proceeds to step S153. In the case where the utterance is not performed, the utterance immediately before the input utterance text of the user utterance partner device 10 does not exist, and the utterance immediately before the utterance text of the input user utterance partner device 10 is the user 1. Two cases are included in the case of user utterance.
  • Step S153 If it is determined in step S152 that the utterance of the user substitute utterance device 20 has not been performed as the utterance immediately before the utterance text of the input user utterance partner device 10, the process proceeds to step S153.
  • the response necessity determination unit 103 determines in step S153 whether or not the user utterance is detected within the specified threshold time (N seconds) from the utterance completion timing of the user utterance partner device 10.
  • step S153 If the user utterance text is input within the specified threshold time (N seconds) from the utterance completion timing of the user utterance partner device 10, the determination in step S153 is Yes, and the process proceeds to step S154. On the other hand, if the user utterance text is not input within the specified threshold time (N seconds) from the utterance completion timing of the user utterance partner device 10, the determination in step S153 becomes No, and the process proceeds to step S155.
  • Step S154 The process of step S154 is a process to be executed in the following cases.
  • step S152 it is determined that the utterance of the user substitute utterance device 20 has not been performed as the utterance immediately before the utterance text of the user utterance partner device 10 input, and further, in step S153, the utterance of the user utterance partner device 10 is completed.
  • N seconds the specified threshold time
  • step S154 the response necessity determination unit 103 detects the user utterance within the specified threshold time (N seconds) from the utterance completion timing of the user utterance partner device 10 to the user utterance partner device such as a car navigation device. It is determined whether or not it was made for 10. For example, the determination is made using the semantic analysis result of the user's utterance and the analysis result of the line-of-sight information from the user's face image.
  • step S156 If it is determined that the detected user utterance is made to the user utterance partner device 10, the process proceeds to step S156. On the other hand, if it is determined that the detected user utterance is not made to the user utterance partner device 10, the process proceeds to step S155.
  • Step S155 The process of step S155 is a process to be executed in any of the following cases.
  • step S152 when it is determined that the utterance of the user substitute utterance device 20 is being performed as the utterance immediately before the utterance text of the input user utterance partner device 10.
  • step S152 it is determined that the user substitute utterance device 20 has not spoken as the utterance immediately before the utterance text of the user utterance partner device 10 that has been input, and further, in step S153, the user utterance partner device 10 When it is determined that the user utterance is not detected within the specified threshold time (N seconds) from the utterance completion timing of (C)
  • step S152 it is determined that the user substitute utterance device 20 has not spoken as the utterance immediately before the utterance text of the user utterance partner device 10 that has been input, and further, in step S153, the user utterance partner device 10 It is determined that the user utterance is detected within the specified threshold time (N seconds) from the utterance completion timing of the above, and further, in step S154, the detected user utterance must be made to the user utterance partner device 10. When judged.
  • step S155 the response necessity determination unit 103 responds to the response of the user substitute utterance device 20, that is, the user to the output utterance from the user utterance partner device 10.
  • a response necessity identification value (1) indicating that the utterance from the substitute utterance device 20 is “necessary” is generated and output to the response generation unit 104.
  • the response generation unit 104 executes a response utterance generation process for outputting from the user proxy utterance device 20.
  • Step S156 On the other hand, in step S152, it is determined that the user substitute utterance device 20 has not spoken as the utterance immediately before the utterance text of the user utterance partner device 10 that has been input, and further, in step S153, the user utterance partner device 10 It is determined that the user utterance is detected within the specified threshold time (N seconds) from the utterance completion timing, and further, in step S154, it is determined that the detected user utterance is made to the user utterance partner device 10. When the determination is made, the response necessity determination unit 103 executes the process of step S156.
  • step S156 the response necessity determination unit 103 is "unnecessary" for the response of the user substitute utterance device 20, that is, the utterance from the user substitute utterance device 20 for the output utterance from the user utterance partner device 10.
  • the response necessity identification value (0) indicating the above is generated and output to the response generation unit 104.
  • the response generation unit 104 does not execute the response utterance generation process for outputting from the user proxy utterance device 20.
  • the response necessity determination unit 103 outputs a signal for notifying the user 1 that the user substitute utterance device 20 is about to speak (response). It may be configured to be used. Specifically, as a signal, for example, a blinking LED lamp, an output of a sound effect, or an utterance such as "I will answer" is output.
  • the response generation unit 104 When the response generation request is input from the response necessity determination unit 103 as described above, the response generation unit 104 generates an utterance text corresponding to the utterance to be output from the user proxy utterance device 20.
  • various information stored in the storage unit 24, that is, the input utterance correspondence response database 121, the user action history information 122, and the user profile information 123 can be used. Further, it is also possible to use the information that can be acquired via the communication unit 25, that is, the information acquired from the external server 150 or the user utterance partner device 10.
  • the input, output, and execution processes of the response generation unit 104 are as follows.
  • the utterance text in which the utterance subject identifier (utterance subject identification tag) generated by the voice recognition unit 102 is also set. It is also entered.
  • the response generation unit 104 When the response generation unit 104 generates the utterance text of the utterance output from the user substitute utterance device 20, various information stored in the storage unit 24, that is, the input utterance correspondence response database 121, the user action history information 122, and the user The profile information 123 and the information acquired from the external server 150 and the user utterance partner device 10 are used.
  • the input utterance correspondence response database 121 is (A) Sample input utterance (B) Response utterance This is a database that stores a large number of corresponding data of (A) and (B).
  • the sample input utterance is sample data of the utterance text assuming the utterance of the user utterance partner device 10.
  • the response utterance is the utterance text data of the utterance output from the user substitute utterance device 20 for each sample data of the utterance text assuming the utterance of the user utterance partner device 10.
  • the user action history information 122, the user profile information 123, and the information acquired from the external server 150 and the user utterance partner device 10 are used.
  • FIG. 17 A specific example of the user action history information 122 is shown in FIG. As shown in FIG. 17, the user action history information 122 contains the past action history information of the user 1. (A) Date and time (b) Behavior A large number of corresponding data of these (a) and (b) are stored.
  • the response generation unit 104 of the user substitute utterance device 20 cannot determine the utterance text to be output from one user substitute utterance device 20 only by referring to the registration data of the input utterance correspondence response database 121, the response generation unit 104 registers the user action history information 122. By referring to the data and other information, the intention of the user 1 is estimated, and the proxy utterance on behalf of the user 1 is determined.
  • the other information other than the user action history information 122 includes, for example, information acquired from various external servers such as a user profile information 123, a knowledge database, a scenario database storing dialogue sequence information, and an SNS (Social Networking Service) server. Or, the accumulated data of the user talking partner device 10.
  • the response generation unit 104 acquires and refers to this information, estimates the intention of the user 1, and determines a substitute utterance on behalf of the user 1.
  • the user profile information 123 records the gender, age, hobbies, family structure, etc. of the user 1.
  • the external server 150 includes various servers such as a knowledge database, a scenario database storing dialogue sequence information, and an SNS (Social Networking Service) server.
  • SNS Social Networking Service
  • utterances, images, and the like made by the user 1 using the SNS are registered.
  • the response generation unit 104 of the user substitute utterance device 20 performs a process of estimating the user's intention by analyzing the utterance or posted image of the user 1 registered in the SNS server, for example.
  • the input utterance text that is, the utterance text of the user utterance partner device 10 is the following utterance text.
  • the other party's utterance text "I'm clever. Are you sure you want to go to Odaiba Station? "
  • the response generation unit 104 uses the input utterance correspondence response database 121, the user action history information 122, and the user profile information 123 in order to generate an utterance suitable as a response to the utterance of the user utterance partner device 10 as described above. Further, if necessary, information that can be acquired via the communication unit 25, that is, information acquired from the external server 150 or the user utterance partner device 10 is used.
  • Sample input utterance is selected from the input utterance correspondence response database 121.
  • the "similarity" used as the determination index value when selecting "(A) sample input utterance” in the input utterance correspondence response database 121 having the highest degree of similarity to the utterance text of the user utterance partner device 10 is, for example, You can use the degree of duplication of words and phrases contained in the utterance. For example, morphological analysis can be performed to determine the degree of similarity based on the number of common morphemes.
  • the other party's utterance text "I'm clever. Are you sure you want to go to Odaiba Station? " First, the utterance text of the other device is divided into words in morpheme units. "Understood/. / Purpose / Location / Ha / Odaiba / Station / De / OK / Is / Ka ” (13 morphemes (words))
  • sample input utterance of entry 122 "I'm clever. May I decide on a destination? ”
  • Sample input utterance of entry 201 "I'm clever. Are you sure you want to go to Odaiba Station? "
  • Entry 122 "Skillful / better / ta /. / Purpose / Ground / Determine / Decide / te / mo / good / is / ka ”(morpheme (word) 13 words)
  • Entry 201 "Skillful / better / ta /. / Destination / Ha / Odaiba / Station / De / OK / Is / ka ”(Morpheme (word) 12 words)
  • the number of common morphemes between the utterance text of the other device and the entry 122 is 8 (smart / better / ta /. / purpose / ground / is / ka).
  • the number of morphemes common to the other device's utterance text and entry 201 is 11 (Kashikomari / Mashi / Ta /.
  • the response generation unit 104 not only compares the numbers of the above common morphemes, words, phrases, etc., but also has different notations such as "destination” and "destination”. You may also perform more flexible similarity calculation considering that there are words and phrases that are semantically close to each other. As a specific process in this case, for example, there is a process of performing similarity analysis using a morpheme (word) or a distributed expression of a phrase.
  • the response generation unit 104 executes any of the following processes.
  • a candidate randomly selected from the remaining candidates may be output, or the user may output the candidate. You may make an utterance requesting a response to one person.
  • the utterance text to be generated may be converted into a wording suitable for the user himself / herself.
  • a personal pronoun for example, me, me, me
  • a sentence ending expression for example, kashira, kana, dayo
  • the unique representation of this user is acquired by referring to, for example, the user profile information 123 and the registration data of the SNS server and the like constituting the external server 150.
  • FIG. 18 is a diagram showing a detailed configuration example of the response generation unit 104. As shown in FIG. 18, the response generation unit 104 includes a substitute utterance selection unit 151 and a substitute utterance expression conversion unit 152.
  • the substitute utterance selection unit 151 inputs the “utterance text of the user utterance partner device 10” 201 shown in FIG. 18 together with the response generation request from the response necessity determination unit 103.
  • the substitute utterance selection unit 151 performs a selection process of the utterance text corresponding to the utterance to be output from the user substitute utterance device 20 with respect to the input text.
  • the utterance text selection process is selected from the input utterance correspondence response database 121 in the storage unit 24. That is, it has been described above with reference to FIG. (A) Sample input utterance (B) Response utterance Select from the input utterance correspondence response database 121 in which these correspondence data are registered.
  • the substitute utterance selection unit 151 selects "(A) sample input utterance" of the input utterance correspondence response database 121 having the highest degree of similarity to the utterance text of the user utterance partner device 10.
  • the degree of duplication of words and phrases contained in an utterance can be used.
  • morphological analysis is performed to determine the degree of similarity based on the number of common morphemes.
  • the utterance selection unit 151 executes any of the following processes. (A) Randomly select one or the other. (B) Other information, that is, user action history information 122, user profile information 123, external server 150, and information acquired from the user utterance partner device 10 is used for selection.
  • the utterance selected by the substitute utterance selection unit 151 is input to the substitute utterance expression conversion unit 152.
  • the substitute utterance expression conversion unit 152 performs a process of converting the utterance selected by the substitute utterance selection unit 151 into the wording of the user himself / herself. For example, as described above, user 1's frequently used expressions such as personal pronouns (for example, me, me, me) and sentence ending expressions (for example, kashira, kana, dayo) are applied according to user 1. Change the utterance expression.
  • the user's unique representation is acquired by referring to, for example, the user profile information 123 and the registration data of the SNS server and the like constituting the external server 150.
  • the utterance text generated by the substitute utterance expression conversion unit 152 that is, the "utterance text of the user substitute utterance device" 202 shown in FIG. 18 is input to the voice synthesis unit 105.
  • Step S201 First, in step S201, the response generation unit 104 inputs the “utterance text of the user utterance partner device 10” together with the response generation request from the response necessity determination unit 103.
  • Step S202 the response generation unit 104 accesses the input utterance correspondence response database 121, executes a search process based on the utterance text of the output utterance of the user utterance partner device 10, and inputs a sample registered in the database. From the utterances, select the entry in which the sample input utterance with the highest similarity is registered.
  • the degree of duplication of words and phrases included in utterances can be used.
  • morphological analysis is performed to determine the degree of similarity based on the number of common morphemes.
  • step S203 the response generation unit 104 determines whether or not an entry having a high degree of similarity is selected in the entry selection process in step S202.
  • step S203 If the selection of entries having a high degree of similarity fails, the determination in step S203 becomes No, and the process ends. In this case, no utterance is made from the user substitute utterance device 20. On the other hand, if the selection of entries having a high degree of similarity is successful, the determination in step S203 is Yes, and the process proceeds to step S204.
  • Step S204 If the entry with high similarity is successfully selected from the input utterance correspondence response database 121 in step S202, the process of step S204 is executed.
  • step S204 the response generation unit 104 can select one entry in which a response that can be determined to be optimal is registered based on the user behavior history registered in the user behavior history information 122 and other information. If it is possible to select it, select it.
  • the other information includes user profile information 123, a knowledge database, a scenario database storing dialogue sequence information, information acquired from various external servers such as an SNS (Social Networking Service) server, and a user speaking partner device 10. Accumulated data, etc.
  • SNS Social Networking Service
  • step S205 the response generation unit 104 determines whether or not one entry of the substitute utterance output by the user substitute utterance device has been successfully selected.
  • step S207 If one entry registered in the input utterance correspondence response database 121 is successfully selected, the process proceeds to step S207. If the narrowing down of the entries registered in the input utterance correspondence response database 121 fails and a plurality of entries are selected, the process proceeds to step S206.
  • Step S206 If the narrowing down of the entries registered in the input utterance correspondence response database 121 fails in step S205 and a plurality of entries are selected, the response generation unit 104 selects one from the plurality of selected entries in step S206. Select randomly.
  • step S207 the response generation unit 104 selects the response utterance registered in one selection entry and outputs it to the speech synthesis unit 105.
  • step S205 If one entry registered in the input utterance correspondence response database 121 is successfully selected in step S205, the response utterance registered in the one entry is selected and output to the speech synthesis unit 105.
  • step S205 the response utterance registered in the entry randomly selected in step S206 is selected and the voice synthesis unit 105 selects the response utterance. Output.
  • the response generation unit 104 when the response generation unit 104 fails to select one entry registered in the input utterance correspondence response database 121 in step S205, the response generation unit 104 starts from the plurality of entries selected in step S206.
  • the configuration is such that one entry is randomly selected. For example, in such a case, the proxy utterance may be stopped and the user utterance may be waited for.
  • FIG. 20 shows a processing sequence when executing a process of outputting an utterance for requesting an utterance to a user.
  • the flow shown in FIG. 20 is the processing of steps S201 to S205, and the processing of step S207 is the same processing as the processing of each step of the flow described with reference to FIG.
  • step S206 of the flow shown in FIG. 19 is replaced with the process of step S221.
  • step S221 will be described.
  • Step S221 In step S205, when the narrowing down of the entries registered in the input utterance correspondence response database 121 fails and a plurality of entries are selected, the response generation unit 104 requests the user to speak in step S221. Generates and outputs the utterance of. Such processing may be performed.
  • the voice synthesis unit 105 generates a synthetic voice based on the utterance text generated by the response generation unit 104. That is, the voice synthesis process (TTS: Text To Speech) is executed, and the generated synthetic voice is output as the user speech agent output voice 53 shown in the figure via the voice output unit (speaker) 23.
  • TTS Text To Speech
  • the input, output, and execution processes of the voice synthesis unit 105 are as follows.
  • a synthetic voice is generated based on the utterance text generated by the response generation unit 104. That is, voice synthesis processing (TTS: Text To Speech) is executed to generate synthetic voice.
  • TTS Text To Speech
  • the generated synthetic voice is output via the voice output unit (speaker) 23.
  • the utterance text indicating the utterance content output by the user substitute utterance device 20 generated by the response generation unit 104 is the following text. "Please make it Odaiba Kaihin Koen Station"
  • the voice synthesis unit 105 converts this utterance text into voice.
  • speech synthesis a speech synthesis process (TTS: Text To Speech) execution program can be used. It may also be executed using open source software.
  • the voice synthesis unit 105 may change the voice synthesis model according to the user 1 so that the same voice as the utterance of the user himself / herself is output during the synthetic voice generation process. For example, a method of selecting a model according to the attributes (age, gender, etc.) of the user 1 may be adopted, or a process of selecting a model similar to the voice quality of the user himself / herself from among a number of speech synthesis models may be adopted. May be.
  • the voice output is stopped. You may do so.
  • the utterance voice from the user 1 is input during the voice output, it is determined whether or not the user utterance is directed to the user substitute utterance device 20, and the utterance is directed to the user substitute utterance device 20.
  • the utterance of the user proxy utterance device 20 may be stopped only when it is determined that the utterance is a user.
  • the CPU (Central Processing Unit) 301 functions as a control unit or a data processing unit that executes various processes according to a program stored in the ROM (Read Only Memory) 302 or the storage unit 308. For example, the process according to the sequence described in the above-described embodiment is executed.
  • the RAM (Random Access Memory) 303 stores programs and data executed by the CPU 301. These CPU 301, ROM 302, and RAM 303 are connected to each other by a bus 304.
  • the CPU 301 is connected to the input / output interface 305 via the bus 304, and the input / output interface 305 is connected to an input unit 306 consisting of various switches, a keyboard, a mouse, a microphone, a sensor, etc., and an output unit 307 consisting of a display, a speaker, and the like. Has been done.
  • the CPU 301 executes various processes in response to a command input from the input unit 306, and outputs the process results to, for example, the output unit 307.
  • the storage unit 308 connected to the input / output interface 305 is composed of, for example, a hard disk or the like, and stores programs executed by the CPU 301 and various data.
  • the communication unit 309 functions as a transmission / reception unit for Wi-Fi communication, Bluetooth (registered trademark) (BT) communication, and other data communication via a network such as the Internet or a local area network, and communicates with an external device.
  • Wi-Fi Wi-Fi
  • BT registered trademark
  • the drive 310 connected to the input / output interface 305 drives a removable medium 311 such as a magnetic disk, an optical disk, a magneto-optical disk, or a semiconductor memory such as a memory card, and records or reads data.
  • a removable medium 311 such as a magnetic disk, an optical disk, a magneto-optical disk, or a semiconductor memory such as a memory card
  • the technology disclosed in the present specification can have the following configuration.
  • (1) Input the device utterance output from the user utterance partner device, which is the user's utterance partner, and input the device utterance. It has a data processing unit that generates and outputs a user substitute utterance in place of the user in response to the device utterance.
  • the data processing unit A response necessity determination unit that determines the necessity of the user proxy utterance, and a response necessity determination unit.
  • An information processing device having a response generation unit that generates a user proxy utterance when the response necessity determination unit determines that a user proxy utterance is necessary.
  • the response necessity determination unit is The information processing device according to (1), wherein it is determined that a user proxy utterance is necessary when the user utterance is not performed within a predetermined threshold time from the utterance completion timing of the device utterance.
  • the response necessity determination unit is The information processing device according to (1) or (2), wherein it is determined that the user proxy utterance is unnecessary when the user utterance is performed within a predetermined threshold time from the utterance completion timing of the device utterance.
  • the response necessity determination unit is When the user utterance is performed within the predetermined threshold time from the utterance completion timing of the device utterance. It is determined whether the user utterance is an utterance made to the user utterance partner device, and the user utterance is determined. If it is determined that the user utterance is not the utterance made to the user utterance partner device, it is determined that the user proxy utterance is necessary.
  • the information processing device according to any one of (1) to (3), wherein when it is determined that the user utterance is an utterance made to the user utterance partner device, it is determined that the user proxy utterance is unnecessary.
  • the response necessity determination unit is The information processing device according to (4), wherein the determination process of whether or not the user utterance is an utterance made to the user utterance partner device is determined by using the semantic analysis result of the user utterance.
  • the response necessity determination unit is The information processing device according to (4), wherein the determination process of whether or not the user's utterance is an utterance made to the user's utterance partner device is determined by using the user's line-of-sight direction analysis result.
  • the response necessity determination unit is In the case where the substitute utterance has never been executed after the start of the dialogue sequence with the user utterance partner device.
  • the response necessity determination unit is When the proxy utterance is executed more than once after the start of the dialogue sequence with the user utterance partner device The information processing device according to (7), wherein it is determined that all subsequent device utterances require user proxy utterances.
  • the information processing device is A voice recognition unit that identifies whether the input utterance to the information processing device is the user utterance uttered by the user or the device utterance output by the user utterance partner device, and generates the utterance subject identifier that is the identification result.
  • Have and The response necessity determination unit When it is confirmed that the input utterance to the information processing device is the device utterance output by the user utterance partner device based on the utterance subject identifier input from the voice recognition unit.
  • the information processing device according to any one of (1) to (8) for determining the necessity of a user proxy utterance.
  • the response generation unit From the input utterance support database that stores a large number of entries that associate sample input utterances with response utterances Select the entry for the sample input utterance that has the highest similarity to the device utterance.
  • the information processing device according to any one of (1) to (9), wherein the response utterance of the selected entry is set as the user substitute utterance.
  • the response generation unit When selecting the entry for the sample input utterance that has the highest similarity to the device utterance The information processing apparatus according to (10), wherein the device utterance is compared with each morpheme, word, or phrase of each sample input utterance to determine the degree of similarity.
  • the response generation unit The information processing device according to any one of (1) to (11), which estimates the user's intention by referring to the user behavior history information and generates a user proxy utterance that reflects the user's intention.
  • the response generation unit The information processing apparatus according to any one of (1) to (12), wherein the user's intention is estimated by referring to the user profile information or the registration information of the external server, and the user's proxy utterance reflecting the user's intention is generated.
  • the response generation unit The information processing device according to any one of (1) to (13), which generates a user proxy utterance to which an expression often used by the user is applied.
  • the information processing device Input the device utterance output from the user utterance partner device, which is the user's utterance partner, It has a data processing unit that generates and outputs a user substitute utterance in place of the user in response to the device utterance.
  • the data processing unit The response necessity determination process for determining the necessity of the user proxy utterance and the response necessity determination process An information processing method for executing a response generation process for generating a user proxy utterance when it is determined in the response necessity determination process that a user proxy utterance is necessary.
  • a program that executes information processing in an information processing device Input the device utterance output from the user utterance partner device, which is the user's utterance partner, It has a data processing unit that generates and outputs a user substitute utterance in place of the user in response to the device utterance.
  • the program is installed in the data processing unit.
  • the response necessity determination process for determining the necessity of the user proxy utterance and the response necessity determination process A program that executes a response generation process for generating a user proxy utterance when it is determined in the response necessity determination process that a user proxy utterance is necessary.
  • the series of processes described in the specification can be executed by hardware, software, or a composite configuration of both.
  • executing processing by software install the program that records the processing sequence in the memory in the computer built in the dedicated hardware and execute it, or execute the program on a general-purpose computer that can execute various processing. It can be installed and run.
  • the program can be pre-recorded on a recording medium.
  • LAN Local Area Network
  • the various processes described in the specification are not only executed in chronological order according to the description, but may also be executed in parallel or individually as required by the processing capacity of the device that executes the processes.
  • the system is a logical set configuration of a plurality of devices, and the devices having each configuration are not limited to those in the same housing.
  • a device and a method for generating and outputting a response utterance on behalf of the user with respect to the device utterance output by the interactive device are realized. ..
  • the device utterance from the user utterance partner device which is the user's utterance partner, is input, and the user substitute utterance is generated and output on behalf of the user. It has a response necessity determination unit that determines the necessity of a user proxy utterance, and a response generation unit that generates a user proxy utterance when it is determined that a user proxy utterance is necessary.
  • the response generation unit generates and outputs a proxy utterance that reflects the user's intention by referring to, for example, the user action history information.
  • the response necessity determination unit determines that the user proxy utterance is necessary when the user utterance is not performed within the predetermined threshold time from the utterance completion timing of the device utterance.

Landscapes

  • Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • User Interface Of Digital Computer (AREA)
  • Navigation (AREA)

Abstract

La présente invention concerne un dispositif et un procédé pour générer et délivrer, à la place d'un utilisateur, un énoncé de réponse à un énoncé de dispositif délivré par un dispositif interactif. Le dispositif de la présente invention reçoit une entrée d'un énoncé de dispositif à partir d'un dispositif partenaire d'énoncé d'utilisateur servant de partenaire d'énoncé d'un utilisateur, et génère et délivre un énoncé de mandataire d'utilisateur à la place de l'utilisateur. Le dispositif comprend une unité de détermination de nécessité/non-nécessité de réponse qui détermine si l'énoncé de mandataire d'utilisateur est ou non nécessaire, et une unité de génération de réponse qui génère l'énoncé de mandataire d'utilisateur lorsque l'énoncé de mandataire d'utilisateur est déterminé comme étant nécessaire. L'unité de génération de réponse génère et délivre l'énoncé de mandataire qui reflète l'intension de l'utilisateur, par exemple, en référence aux informations d'historique de comportement de l'utilisateur. L'unité de détermination de nécessité/non-nécessité de réponse détermine que l'énoncé de mandataire d'utilisateur est nécessaire lorsqu'un énoncé d'utilisateur n'est pas réalisé dans un temps de seuil prescrit précédemment à partir de la temporisation d'achèvement d'énoncé de l'énoncé de dispositif.
PCT/JP2021/001072 2020-02-20 2021-01-14 Dispositif de traitement d'informations, procédé de traitement d'informations et programme WO2021166504A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2020026870A JP2021131472A (ja) 2020-02-20 2020-02-20 情報処理装置、および情報処理方法、並びにプログラム
JP2020-026870 2020-02-20

Publications (1)

Publication Number Publication Date
WO2021166504A1 true WO2021166504A1 (fr) 2021-08-26

Family

ID=77392121

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2021/001072 WO2021166504A1 (fr) 2020-02-20 2021-01-14 Dispositif de traitement d'informations, procédé de traitement d'informations et programme

Country Status (2)

Country Link
JP (1) JP2021131472A (fr)
WO (1) WO2021166504A1 (fr)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000048038A (ja) * 1998-07-29 2000-02-18 Nec Corp 自然言語対話システム及び自然言語対話プログラム記録媒体
JP2000105596A (ja) * 1998-07-27 2000-04-11 Canon Inc 情報処理装置及びその方法、及びそのプログラムを記憶した記憶媒体
JP2004301980A (ja) * 2003-03-31 2004-10-28 Mitsubishi Electric Corp 音声対話装置及び音声対話代行装置並びにそれらのプログラム
WO2015128960A1 (fr) * 2014-02-26 2015-09-03 三菱電機株式会社 Appareil de commande et procede de commande embarques dans un vehicule
WO2016051519A1 (fr) * 2014-09-30 2016-04-07 三菱電機株式会社 Système de reconnaissance vocale
WO2018163647A1 (fr) * 2017-03-10 2018-09-13 日本電信電話株式会社 Procédé de dialogue, système de dialogue, dispositif de dialogue et programme

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000105596A (ja) * 1998-07-27 2000-04-11 Canon Inc 情報処理装置及びその方法、及びそのプログラムを記憶した記憶媒体
JP2000048038A (ja) * 1998-07-29 2000-02-18 Nec Corp 自然言語対話システム及び自然言語対話プログラム記録媒体
JP2004301980A (ja) * 2003-03-31 2004-10-28 Mitsubishi Electric Corp 音声対話装置及び音声対話代行装置並びにそれらのプログラム
WO2015128960A1 (fr) * 2014-02-26 2015-09-03 三菱電機株式会社 Appareil de commande et procede de commande embarques dans un vehicule
WO2016051519A1 (fr) * 2014-09-30 2016-04-07 三菱電機株式会社 Système de reconnaissance vocale
WO2018163647A1 (fr) * 2017-03-10 2018-09-13 日本電信電話株式会社 Procédé de dialogue, système de dialogue, dispositif de dialogue et programme

Also Published As

Publication number Publication date
JP2021131472A (ja) 2021-09-09

Similar Documents

Publication Publication Date Title
US10978094B2 (en) Method of and system for real time feedback in an incremental speech input interface
US11237793B1 (en) Latency reduction for content playback
US11720326B2 (en) Audio output control
US11423885B2 (en) Utilizing pre-event and post-event input streams to engage an automated assistant
US10649727B1 (en) Wake word detection configuration
KR101857648B1 (ko) 지능형 디지털 어시스턴트에 의한 사용자 트레이닝
US11687526B1 (en) Identifying user content
US11574637B1 (en) Spoken language understanding models
JP2016122183A (ja) 音声合成における同綴異音異義語の曖昧さの解消
US10672379B1 (en) Systems and methods for selecting a recipient device for communications
US20200357399A1 (en) Communicating announcements
US20200219487A1 (en) Information processing apparatus and information processing method
JP7347217B2 (ja) 情報処理装置、情報処理システム、および情報処理方法、並びにプログラム
US11756544B2 (en) Selectively providing enhanced clarification prompts in automated assistant interactions
US20200327888A1 (en) Dialogue system, electronic apparatus and method for controlling the dialogue system
JP2021148974A (ja) 音声対話装置、音声対話システム、プログラムおよび音声対話方法
WO2021166504A1 (fr) Dispositif de traitement d'informations, procédé de traitement d'informations et programme
US11024303B1 (en) Communicating announcements
US10841411B1 (en) Systems and methods for establishing a communications session
US11893984B1 (en) Speech processing system
US11076018B1 (en) Account association for voice-enabled devices

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21756284

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21756284

Country of ref document: EP

Kind code of ref document: A1