WO2021024466A1

WO2021024466A1 - Voice interaction device, voice interaction method, and program recording medium

Info

Publication number: WO2021024466A1
Application number: PCT/JP2019/031423
Authority: WO
Inventors: 小谷　亮
Original assignee: 三菱電機株式会社
Priority date: 2019-08-08
Filing date: 2019-08-08
Publication date: 2021-02-11
Also published as: JPWO2021024466A1; JP7224470B2

Abstract

The objective of the present invention is to provide a voice interaction device with which it is possible to handle an inquiry from a user to a plurality of servers at one time. The voice interaction device includes a voice signal acquisition unit and a wakeup word division unit. The voice signal acquisition unit acquires an input voice signal corresponding to a voice. When a universal wakeup word indicating a plurality of servers that perform voice recognition processing is included in the input voice signal, the wakeup word division unit transmits, to the plurality of servers, a voice signal based on the input voice signal.

Description

Voice dialogue device, voice dialogue method and program recording medium

The present invention relates to a voice dialogue device, a voice dialogue method, and a program recording medium.

As the accuracy of voice recognition technology has increased, voice dialogue systems that enable voice dialogue with humans are in the limelight. The voice dialogue system transmits voice data to a server via a network, and the server performs voice recognition processing and voice synthesis processing. Such systems enable the provision of services called personal assistants, AI (Artificial Intelligence) assistants or virtual assistants, for example, as such systems or services, Amazon® Echo®. , Google (registered trademark), Google Home (registered trademark), etc. are known. In addition, as systems and services installed in smartphones, Siri (registered trademark) of Apple (registered trademark), Google Assistant of Google, Alexa (registered trademark) of Amazon, etc. are known.

The server of these voice dialogue systems starts voice recognition processing based on the wakeup word included in the input voice. The wake-up word is a phrase that is registered in advance and is a phrase that triggers when the voice recognition process is started. The wakeup word usually varies from system to system. For example, "Alexa" in Amazon's Echo, "Siri" in Apple's Siri, and "OK, Google" in Google's Google Home are known as wakeup words.

Japanese Unexamined Patent Publication No. 2018-18133

As described above, since services are provided by a large number of voice dialogue systems, users are often in an environment where they can use multiple services, that is, an environment where they can connect to multiple servers capable of voice recognition processing. In such an environment, if one server cannot respond appropriately to a user's inquiry, the user needs to speak again with a different wakeup word to contact the other server.

The present invention has been made to solve the above problems, and an object of the present invention is to provide a voice dialogue device capable of making inquiries to a plurality of servers by a user at one time.

The voice dialogue device according to the present invention transmits a voice signal to a server that performs voice recognition processing on the voice spoken by the user. The voice dialogue device includes a voice signal acquisition unit and a wakeup word division unit. The voice signal acquisition unit acquires an input voice signal corresponding to the voice. The wakeup word division unit transmits a voice signal based on the input voice signal to the plurality of servers when the input voice signal contains a generic wakeup word indicating a plurality of servers performing voice recognition processing. ..

According to the present invention, it is possible to provide a voice dialogue device capable of making inquiries to a plurality of servers by a user at one time.

The objectives, features, aspects, and advantages of the present invention will be made clearer by the following detailed description and accompanying drawings.

It is a block diagram which shows the structure of the voice dialogue apparatus in Embodiment 1. FIG. It is a figure which shows an example of the structure of the processing circuit included in a voice dialogue apparatus. It is a figure which shows another example of the structure of the processing circuit included in a voice dialogue apparatus. It is a flowchart which shows the voice dialogue method in Embodiment 1. It is a block diagram which shows the structure of the voice dialogue apparatus in Embodiment 2. It is a figure which shows the hardware composition of the voice dialogue apparatus in Embodiment 2. FIG. It is a flowchart which shows the voice dialogue method in Embodiment 2. It is a flowchart which shows the response signal reproduction processing in Embodiment 2. It is a block diagram which shows the structure of the voice dialogue apparatus in Embodiment 3. It is a figure which shows an example of the response signal including the effectiveness signal in Embodiment 3. FIG. It is a flowchart which shows the voice dialogue method in Embodiment 3. It is a flowchart which shows the response signal reproduction processing in Embodiment 3. It is a block diagram which shows the structure of the voice dialogue apparatus and the apparatus which operates in connection with it in Embodiment 4.

<Embodiment 1>
FIG. 1 is a block diagram showing the configuration of the voice dialogue device 100 according to the first embodiment.

The voice dialogue device 100 is connected to a plurality of servers 200 via a network. Each of the plurality of servers 200 has a function of performing voice recognition processing on the input voice. The voice dialogue device 100 in the first embodiment is connected to the third server 230 from the first server 210 as a plurality of servers 200. The first server 210 to the third server 230 each have an individual voice recognition processing function. For example, the first server 210 to the third server 230 are operated by a business operator that provides different voice recognition processing services. The number of servers connected to the voice dialogue device 100 is not limited to this.

Each of the plurality of servers 200 has a function of starting the voice recognition process based on the wakeup word included in the voice signal input to itself. The wake-up word is a word that triggers each of the plurality of servers 200 to start the voice recognition process.

The voice dialogue device 100 includes a voice signal acquisition unit 10 and a wakeup word division unit 20.

The voice signal acquisition unit 10 acquires an input voice signal corresponding to the voice uttered by the user. The voice is acquired by, for example, the microphone 110.

The wakeup word dividing unit 20 detects whether or not the universal wakeup word is included in the input voice signal. The universal wakeup word is a word that collectively indicates a plurality of servers 200. For example, the universal wake-up word is "OK, anybody", "OK, everybody", and the like. Alternatively, for example, the universal wakeup word may be a phrase that has not been used worldwide, a new phrase that has not been used, a coined word, or the like. These universal wakeup words are registered in advance in the voice dialogue device 100.

When the input voice signal includes a generic wake-up word, the wake-up word dividing unit 20 transmits a voice signal based on the input voice signal to a plurality of servers 200.

FIG. 2 is a diagram showing an example of the configuration of the processing circuit 90 included in the voice dialogue device 100. Each function of the audio signal acquisition unit 10 and the wakeup word division unit 20 is realized by the processing circuit 90. That is, the processing circuit 90 has an audio signal acquisition unit 10 and a wakeup word division unit 20.

When the processing circuit 90 is dedicated hardware, the processing circuit 90 may be, for example, a single circuit, a composite circuit, a programmed processor, a parallel programmed processor, an ASIC (Application Specific Integrated Circuit), or an FPGA (Field). -ProgrammableGateArray), or a circuit that combines these. Each function of the audio signal acquisition unit 10 and the wakeup word division unit 20 may be individually realized by a plurality of processing circuits, or may be collectively realized by one processing circuit.

FIG. 3 is a diagram showing another example of the configuration of the processing circuit included in the voice dialogue device 100. The processing circuit includes a processor 91 and a memory 92. By executing the voice dialogue program stored in the memory 92 by the processor 91, each function of the voice signal acquisition unit 10 and the wakeup word division unit 20 is realized. For example, each function is realized by executing software or firmware written as a voice dialogue program by the processor 91. As described above, the voice dialogue device 100 has a memory 92 for storing the voice dialogue program and a processor 91 for executing the voice dialogue program. In other words, the memory 92 is a program recording medium.

In the voice dialogue program, a universal wake-up word indicating a plurality of servers 200 in which the voice dialogue device 100 acquires an input voice signal corresponding to the voice spoken by the user and performs voice recognition processing is the input voice signal. Describes a function of transmitting a voice signal based on an input voice signal to a plurality of servers 200 when included in. Further, the voice dialogue program causes the computer to execute the procedure or method of the voice signal acquisition unit 10 and the wakeup word division unit 20.

The processor 91 is, for example, a CPU (Central Processing Unit), an arithmetic unit, a microprocessor, a microcomputer, a DSP (Digital Signal Processor), or the like. The memory 92 is, for example, non-volatile or volatile such as RAM (RandomAccessMemory), ROM (ReadOnlyMemory), flash memory, EPROM (ErasableProgrammableReadOnlyMemory), and EEPROM (ElectricallyErasableProgrammableReadOnlyMemory). It is a semiconductor memory. Alternatively, the memory 92 may be any storage medium used in the future, such as a magnetic disk, a flexible disk, an optical disk, a compact disk, a mini disk, or a DVD.

Each function of the voice signal acquisition unit 10 and the wakeup word division unit 20 described above may be partially realized by dedicated hardware and the other part may be realized by software or firmware. In this way, the processing circuit realizes each of the above-mentioned functions by hardware, software, firmware, or a combination thereof.

FIG. 4 is a flowchart showing the voice dialogue method in the first embodiment.

In step S1, the voice signal acquisition unit 10 receives the input voice signal corresponding to the voice uttered by the user. Here, as an example, the user utters "OK, anybody. Where can I buy the product of company X?", And the microphone 110 acquires the voice. The audio signal acquisition unit 10 acquires an input audio signal from the microphone 110.

In step S2, the wakeup word dividing unit 20 analyzes whether the input voice signal includes a universal wakeup word. The universal wakeup word to be analyzed is registered in the voice dialogue device 100 in advance. Here, "OK, anybody" and "OK, everybody" are registered in advance in the voice dialogue device 100 as universal wake-up words.

In step S3, the wakeup word dividing unit 20 determines whether or not a universal wakeup word has been detected. If a universal wakeup word is detected, step S4 is executed. If no universal wakeup word is detected, the voice interaction method ends.

In step S4, the wakeup word dividing unit 20 transmits a voice signal based on the input voice signal to the plurality of servers 200. In the first embodiment, there are a plurality of input voice signals corresponding to "OK, anybody. Where can I buy the product of company X?", That is, a plurality of input voice signals acquired by the voice signal acquisition unit 10. It is transmitted to the server 200 of.

Each of the plurality of servers 200 starts the voice recognition process based on the universal wakeup word included in the voice signal received from the voice dialogue device 100. Then, each of the plurality of servers 200 transmits a response signal based on the result of the voice recognition process to the voice dialogue device 100. The voice dialogue device 100 receives response signals from a plurality of servers 200. When the response signal is reproduced by a voice output device (not shown), a dialogue with the user is established.

Summarizing the above, the voice dialogue device 100 according to the first embodiment transmits a voice signal to a server that performs voice recognition processing on the voice uttered by the user. The voice dialogue device 100 includes a voice signal acquisition unit 10 and a wakeup word division unit 20. The voice signal acquisition unit 10 acquires an input voice signal corresponding to the voice. When the input voice signal includes a generic wake-up word indicating a plurality of servers 200 that perform voice recognition processing, the wake-up word dividing unit 20 transmits a voice signal based on the input voice signal to the plurality of servers 200. Send to.

Such a voice dialogue device 100 completes a user's inquiry to a plurality of servers 200 at once. The user can make inquiries to a plurality of servers 200 at once, and even if one server cannot answer, it is not necessary to make a second utterance to another server. The voice dialogue device 100 can be applied to a voice recognition processing system having a voice recognition processing function, and improves the efficiency of the voice dialogue.

Further, in the voice dialogue method in the first embodiment, a voice signal is transmitted to a server that performs voice recognition processing for the voice uttered by the user. The voice dialogue method is based on the input voice signal when the input voice signal includes a generic wake-up word indicating a plurality of servers 200 that acquire the input voice signal corresponding to the voice and perform voice recognition processing. The voice signal is transmitted to a plurality of servers 200.

Such a voice dialogue method completes a user's inquiry to a plurality of servers 200 at once. The voice dialogue method can be applied to a voice recognition processing system having a voice recognition processing function, and improves the efficiency of voice dialogue.

<Embodiment 2>
The voice dialogue device and the voice dialogue method according to the second embodiment will be described. The second embodiment is a subordinate concept of the first embodiment, and the voice dialogue device in the second embodiment includes each configuration of the voice dialogue device 100 in the first embodiment. The same configuration and operation as in the first embodiment will not be described.

FIG. 5 is a block diagram showing the configuration of the voice dialogue device 101 according to the second embodiment.

Each of the plurality of servers 200 in the second embodiment can recognize an individual wakeup word indicating its own server, but cannot recognize a universal wakeup word. For example, when the user says "OK, anybody. Where can I buy the product of company X?", Each of the plurality of servers 200 wakes the part of "OK, anybody". Cannot be recognized as an upward.

In the voice dialogue device 101, in addition to the universal wake-up word, individual wake-up words indicating each of the plurality of servers 200 connected to the voice dialogue device 101 are registered in advance.

The voice dialogue device 101 includes a communication processing unit 30 and a response signal output unit 40 in addition to the voice signal acquisition unit 10 and the wakeup word division unit 20 of the first embodiment. Further, the wakeup word division unit 20 is different from the first embodiment in the functions shown below.

When the universal wakeup word is included in the input audio signal, the wakeup word dividing unit 20 deletes the universal wakeup word from the input audio signal to generate the main audio signal. Then, the wakeup word dividing unit 20 transmits the main audio signal to the plurality of servers 200. The wakeup word dividing unit 20 in the second embodiment transmits a main audio signal via the communication processing unit 30.

The communication processing unit 30 is connected to the network 130 and transmits the main audio signal output from the wakeup word dividing unit 20 to each of the plurality of servers 200. Further, the communication processing unit 30 receives the response signals transmitted from each of the plurality of servers 200 and outputs them to the response signal output unit 40.

The response signal output unit 40 receives the response signal. The response signal output unit 40 according to the second embodiment outputs the response signals in the order in which the response signals are received from the plurality of servers 200.

The response signal received from the server is a voice signal, a text signal, etc. The audio signal as the response signal is a signal by PCM (pulse code modulation), a signal compressed in an mp3 file format, or the like, and the response signal output unit 40 outputs the audio signal to the speaker 120. When the response signal is a text signal, the response signal output unit 40 generates a voice signal that the speaker 120 can output voice based on the text signal by voice synthesis processing and outputs the voice signal to the speaker 120.

The speaker 120 outputs voice based on the response signal.

FIG. 6 is a diagram showing a hardware configuration of the voice dialogue device 101 according to the second embodiment.

The voice dialogue device 101 includes a main processing unit 93 and a program recording medium 94. The main processing unit 93 corresponds to the processing circuits shown in FIGS. 2 and 3. The program recording medium 94 corresponds to the memory 92 shown in FIG.

The functions of the audio signal acquisition unit 10, the wakeup word division unit 20, the communication processing unit 30, and the response signal output unit 40 in the second embodiment are realized by the main processing unit 93. Further, the program recording medium 94 stores a voice dialogue program in which the functions of the voice signal acquisition unit 10, the wakeup word division unit 20, the communication processing unit 30, and the response signal output unit 40 are described. Each of the above functions is realized by executing the voice dialogue program in the main processing unit 93.

FIG. 7 is a flowchart showing the voice dialogue method in the second embodiment.

In step S10, the voice signal acquisition unit 10 receives the input voice signal corresponding to the voice uttered by the user. Similar to the first embodiment, here, the user utters "OK, anybody. Where can I buy the product of company X?", And the voice signal acquisition unit 10 receives the input voice corresponding to the voice. Get the signal.

In step S20, the wakeup word dividing unit 20 analyzes whether the input voice signal includes the wakeup word. The wake-up word to be analyzed is registered in the voice dialogue device 101 in advance. Here, in the voice dialogue device 101, individual wakeup words indicating a specific server and universal wakeup words are registered in advance as wakeup words to be analyzed.

In step S30, the wakeup word dividing unit 20 determines whether or not the wakeup word has been detected. If a wakeup word is detected, step S40 is executed. If no wakeup word is detected, the voice interaction method ends.

In step S40, the wakeup word dividing unit 20 determines whether or not the detected wakeup word is a universal wakeup word. If it is not a universal wakeup word, that is, if the detected wakeup word is an individual wakeup word indicating a particular server, step S50 is executed. If it is a universal wakeup word, step S60 is executed.

In step S50, the wakeup word division unit 20 selects a specific server as a transmission destination.

In step S60, the wakeup word division unit 20 selects a plurality of servers 200 as transmission destinations.

In step S70, the wakeup word dividing unit 20 deletes the universal wakeup word from the input voice signal to generate the main voice signal. Here, the wake-up word dividing unit 20 deletes the voice signal corresponding to the universal wake-up word "OK, anybody" from the input voice signal, and "Where is the product of company X?" Generate a main audio signal corresponding to "Can you buy it?"

In step S80, the communication processing unit 30 transmits an audio signal to the server selected in step S50 or S60. That is, when the voice dialogue process goes through step S50, the communication processing unit 30 transmits the input voice signal to a specific server. When the voice dialogue processing goes through steps S60 and S70, the communication processing unit 30 transmits the main voice signal to the plurality of servers 200.

In step S90, the response signal reproduction process is executed. FIG. 8 is a flowchart showing the response signal reproduction processing according to the second embodiment.

In step S91, the communication processing unit 30 receives response signals from the plurality of servers 200.

In step S92, the response signal output unit 40 outputs the response signal to the speaker 120. When the response signal received from any of the servers is a text signal, the response signal output unit 40 generates a voice signal based on the text signal by voice synthesis processing and outputs it to the speaker 120. By such processing, the speaker 120 can reproduce the response voice in the order in which the response signals are received from the plurality of servers 200.

In step S93, the communication processing unit 30 determines whether or not response signals have been received from all the target servers. The target server is a server to which a voice signal is transmitted in step S80, and is a specific server or a plurality of servers 200. If no response signals have been received from all target servers, step S91 is executed again. When the response signals are received from all the target servers, the response signal reproduction processing ends. Then, the voice dialogue method shown in FIG. 7 ends.

Summarizing the above, the wakeup word dividing unit 20 in the second embodiment deletes the universal wakeup word from the input voice signal to generate the main voice signal. Then, the wakeup word dividing unit 20 transmits the main audio signal to the plurality of servers 200.

Even when the plurality of servers 200 cannot recognize the universal wakeup word, the voice dialogue device 101 according to the second embodiment transmits only the main voice signal corresponding to the specific inquiry content to the server. Therefore, the accuracy of the voice dialogue is improved.

Further, the voice dialogue device 101 according to the second embodiment has an effect that the user can complete inquiries to the plurality of servers 200 at once simply by connecting to the plurality of servers 200 already in operation.

Further, the response signal output unit 40 according to the second embodiment outputs the response signals to the speaker 120 in the order in which the response signals are received from the plurality of servers 200. Therefore, it is possible to reproduce the response voice in the order in which the response signals are returned first.

<Embodiment 3>
The voice dialogue device and the voice dialogue method according to the third embodiment will be described. The third embodiment is a subordinate concept of the first embodiment, and the voice dialogue device in the third embodiment includes each configuration of the voice dialogue device 100 in the first embodiment. The description of the configuration and operation similar to those of the first or second embodiment will be omitted.

FIG. 9 is a block diagram showing the configuration of the voice dialogue device 102 according to the third embodiment.

Each of the plurality of servers 200 in the third embodiment recognizes an individual wakeup word indicating its own server, but does not recognize a universal wakeup word. The first server 210 recognizes "AAA" as an individual wakeup word. In addition, the second server 220 recognizes "BBB" as an individual wakeup word. Further, the third server 230 recognizes "OK, CCC" as an individual wakeup word. “AAA”, “BBB” and “CCC” are, for example, names or abbreviations for voice recognition processing services. For example, when the user calls "Hey, BBB", the second server 220 recognizes the wake-up word "BBB" and starts the voice recognition process. Alternatively, when the user calls "OK, CCC", the third server 230 recognizes the wake-up word "OK, CCC" and starts the voice recognition process.

In the voice dialogue device 102, in addition to the universal wakeup word, an individual wakeup word indicating each of the plurality of servers 200 is registered in advance.

The voice dialogue device 102 includes a wakeup word adding unit 50, a communication processing unit 30, and a response signal output unit 40 in addition to the voice signal acquisition unit 10 and the wakeup word dividing unit 20 of the first embodiment. Further, the wakeup word division unit 20 is different from the first embodiment in the functions shown below.

The wakeup word dividing unit 20 generates a main voice signal in which the universal wakeup word is deleted from the input voice signal, as in the second embodiment. Further, the wakeup word dividing unit 20 of the third embodiment transmits a voice signal for each specific server indicated by the individual wakeup words given to the main voice signal by the wakeup word giving unit 50 described later. .. In the third embodiment, the wakeup word dividing unit 20 transmits an audio signal via the communication processing unit 30.

When a generic wakeup word is included in the input audio signal, the wakeup word imparting unit 50 uses an individual audio signal corresponding to each of the individual wakeup words indicating each of the plurality of servers 200 as the main audio signal. Give. The wake-up word imparting unit 50 in the third embodiment connects individual voice signals before the main voice signal to generate a voice signal. The individual voice signal is stored in the memory 92 as a fixed value, for example.

The communication processing unit 30 is connected to the network 130 and transmits the audio signal output from the wakeup word dividing unit 20 to the server. Further, the communication processing unit 30 receives the response signal transmitted from the server and outputs it to the response signal output unit 40.

The response signal output unit 40 receives response signals from a plurality of servers 200. In the third embodiment, the response signal output unit 40 receives the response signal via the communication processing unit 30. Further, the response signal in the third embodiment includes an effectiveness signal indicating the effectiveness of the response. The response signal output unit 40 outputs a response signal to the speaker 120 based on the effectiveness signal. For example, when it is determined that the response is valid, the response signal output unit 40 outputs the response signal to the speaker 120. The speaker 120 outputs voice based on the response signal.

FIG. 10 is a diagram showing an example of a response signal including an effectiveness signal in the third embodiment. FIG. 10 shows a response signal described in JSON (JavaScript (registered trademark) Object Notation) format. “Effective” indicates the validity signal, and “payload” indicates the content of the response to be reproduced. When the value of "effective" is "yes", the response signal output unit 40 outputs the response signal to the speaker 120, and the voice is reproduced from the speaker 120. When the value of "effective" is "no", the response signal output unit 40 does not output the response signal to the speaker 120. That is, no sound is reproduced from the speaker 120. The "payload" may be data in which a binary audio signal such as PCM (pulse code modulation) or mp3 is converted into a text format by a BASE64 format or the like. Alternatively, "payload" may be a character string such as "Products of company X can be purchased at an online store". In this case, as described above, the response signal output unit 40 generates a voice signal corresponding to the text by voice synthesis processing.

The functions of the wakeup word dividing unit 20, the wakeup word imparting unit 50, the communication processing unit 30, and the response signal output unit 40 are realized by the processing circuit shown in FIG. 2 or FIG.

FIG. 11 is a flowchart showing the voice dialogue method in the third embodiment.

Steps S10 to S70 are the same as in the second embodiment. Step S100 is executed following step S70.

In step S100, the wakeup word adding unit 50 adds an individual voice signal corresponding to each wakeup word to the main voice signal. For example, the wakeup word granting unit 50 connects the individual voice signal corresponding to "Hey, BBB" indicating the second server 220 before the main voice signal of "Where can I buy the product of company X?" And "Hey. , BBB, where can I buy the products of company X? ”Is generated. Or, for example, the wake-up word granting unit 50 concatenates the individual audio signal corresponding to "OK, CCC" indicating the third server 230 before the main audio signal of "Where can I buy the product of company X?" Generates an audio signal corresponding to "Where can I buy products from OK, CCC, and X?".

In step S110, the communication processing unit 30 transmits an audio signal to the server selected in step S50 or S60.

In step S120, the response signal reproduction process is executed. FIG. 12 is a flowchart showing the response signal reproduction processing according to the third embodiment.

In step S121, the communication processing unit 30 receives response signals from the plurality of servers 200.

In step S122, the response signal output unit 40 determines whether or not the response signal is valid based on the validity signal. If it is valid, step S123 is executed. If it is not valid, step S124 is executed.

In step S123, the response signal output unit 40 outputs the response signal to the speaker 120.

In step S124, the communication processing unit 30 determines whether or not response signals have been received from all the target servers. If no response signal has been received from all target servers, step S121 is executed again. When the response signals are received from all the target servers, the response signal reproduction processing ends. Then, the voice dialogue method shown in FIG. 11 ends.

Summarizing the above, the voice dialogue device 102 in the third embodiment includes the wake-up word giving unit 50. When a generic wakeup word is included in the input audio signal, the wakeup word imparting unit 50 outputs an individual audio signal corresponding to each of the individual wakeup words indicating each of the plurality of servers 200 as an audio signal (implementation). In the third form of the above, it is applied to the main audio signal). The wakeup word dividing unit 20 transmits a voice signal to each specific server indicated by the individual wakeup word based on the individual voice signal given to the voice signal.

When each of the plurality of servers 200 cannot recognize the generic wakeup word and requests an individual wakeup word indicating itself, the voice dialogue device 102 assigns an individual wakeup word for each server. A voice signal is transmitted to each server. Therefore, the accuracy of the voice dialogue for each server is improved.

Further, the voice dialogue device 102 in the third embodiment includes a response signal output unit 40. The response signal output unit 40 receives a plurality of response signals for the voice signal from the plurality of servers 200, and receives a plurality of response signals based on the validity signal indicating the effectiveness of the response included in each of the plurality of response signals. Output to the audio output device.

Such a voice dialogue device 102 can cause the voice output device to reproduce only valid answers among the responses received from the server. For example, the content of the response of the first server 210 and the second server 220 is "I do not know", the value of the validity signal is "invalid", and the content of the response of the third server 230 is "X". If the company's product can be purchased from the company X's online store "and the value of the validity signal is" valid ", the voice dialogue device 102 will only respond to the third server 230 as a voice output device. To play.

When an inquiry is made by a universal wakeup word, the user does not necessarily request a response from all servers. The voice dialogue device 102 can give priority to a good answer, that is, an information-rich answer, and cause the voice output device to reproduce the answer.

(Modified Example of Embodiment 3)
The voice dialogue device 102 and the voice dialogue method in the modified example of the third embodiment will be described. The same configuration and operation as in the third embodiment will not be described.

The universal wakeup word in the modified example of the third embodiment indicates a plurality of servers 200 other than a specific server. For example, the universal wakeup word is "OK, other than AAA" and indicates a second server 220 and a third server 230 other than the first server 210.

In step S60 of FIG. 11, the wakeup word dividing unit 20 selects the second server 220 and the third server 230 as the transmission destinations as the plurality of servers 200 other than the specific server. Subsequent steps are the same as in each step of FIG. 11, and the wakeup word dividing unit 20 transmits an audio signal to the second server 220 and the third server 230.

<Embodiment 4>
The voice dialogue device shown in each of the above embodiments can be applied to a system constructed by appropriately combining a navigation device, a communication terminal, a server, and the functions of applications installed in the navigation device. it can. Here, the navigation device includes, for example, a PND (Portable Navigation Device) and the like. Communication terminals include, for example, mobile terminals such as mobile phones, smartphones and tablets.

FIG. 13 is a block diagram showing the configuration of the voice dialogue device 100 and the device that operates in connection therewith according to the fourth embodiment.

The voice dialogue device 100 and the communication device 150 are provided in the wakeup word recognition server 300. The voice dialogue device 100 acquires an input voice signal from the microphone 110 provided in the vehicle 1 via the communication device 140 and the communication device 150. The voice dialogue device 100 transmits a voice signal based on the input voice signal to a plurality of servers 200 when a generic wakeup word is included in the input voice signal. The voice dialogue device 100 receives response signals from the plurality of servers 200 and outputs the response signals to the speaker 120 provided in the vehicle 1 via each communication device.

By arranging the voice dialogue device 100 in the wakeup word recognition server 300 in this way, the configuration of the in-vehicle device can be simplified.

Further, some of the functions or components of the voice dialogue device 100 may be provided in the wakeup word recognition server 300, and the other part may be provided in the vehicle 1 in a distributed manner.

It should be noted that, within the scope of the invention, the present invention can be freely combined with each embodiment, and each embodiment can be appropriately modified or omitted.

Although the present invention has been described in detail, the above description is an example in all aspects, and the present invention is not limited thereto. It is understood that a myriad of variations not illustrated can be envisioned without departing from the scope of the invention.

10 voice signal acquisition unit, 20 wakeup word division unit, 30 communication processing unit, 40 response signal output unit, 50 wakeup word addition unit, 94 program recording medium, 100 voice dialogue device, 110 microphone, 120 speaker, 200 multiple units. server.

Claims

A voice dialogue device that transmits a voice signal to a server that performs voice recognition processing for voice spoken by a user.
An audio signal acquisition unit that acquires an input audio signal corresponding to the audio,
When the input voice signal includes a generic wakeup word indicating a plurality of servers that perform the voice recognition process, the wakeup that transmits the voice signal based on the input voice signal to the plurality of servers. A voice dialogue device including a word division unit.
The wakeup word division section is
The voice dialogue device according to claim 1, wherein the main voice signal is generated by deleting the generic wakeup word from the input voice signal, and the main voice signal is transmitted as the voice signal to the plurality of servers. ..
When the generic wake-up word is included in the input voice signal, a wake-up word grant that gives the voice signal an individual voice signal corresponding to an individual wake-up word indicating each of the plurality of servers. With more parts
The wakeup word division section is
The voice dialogue device according to claim 2, wherein the voice signal is transmitted to each specific server indicated by the individual wakeup word based on the individual voice signal given to the voice signal.
The universal wakeup word refers to the plurality of servers other than a specific server.
The wakeup word division section is
The voice dialogue device according to claim 1, wherein the voice signal based on the input voice signal is transmitted to the plurality of servers other than the specific server.
A plurality of response signals to the voice signal are received from the plurality of servers, and the plurality of response signals are sent to the voice output device based on an effectiveness signal indicating the effectiveness of the response included in each of the plurality of response signals. The voice dialogue device according to claim 1, further comprising a response signal output unit for output.
It is a voice dialogue method that transmits a voice signal to a server that performs voice recognition processing on the voice spoken by the user.
The input voice signal corresponding to the voice is acquired, and
When the input voice signal includes a generic wake-up word indicating a plurality of servers performing the voice recognition process, the voice signal based on the input voice signal is transmitted to the plurality of servers. How to interact.
A program recording medium in which a voice dialogue program for functioning as a voice dialogue device for transmitting a voice signal is recorded on a server that performs voice recognition processing for the voice spoken by the user and can be read by a computer. ,
The voice dialogue program uses the computer.
An audio signal acquisition unit that acquires an input audio signal corresponding to the audio,
When the input voice signal includes a generic wake-up word indicating a plurality of servers performing the voice recognition process, the wake-up that transmits the voice signal based on the input voice signal to the plurality of servers. A program recording medium that is a program for functioning as a word division unit.