WO2021024466A1 - Voice interaction device, voice interaction method, and program recording medium - Google Patents

Voice interaction device, voice interaction method, and program recording medium Download PDF

Info

Publication number
WO2021024466A1
WO2021024466A1 PCT/JP2019/031423 JP2019031423W WO2021024466A1 WO 2021024466 A1 WO2021024466 A1 WO 2021024466A1 JP 2019031423 W JP2019031423 W JP 2019031423W WO 2021024466 A1 WO2021024466 A1 WO 2021024466A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
voice signal
servers
signal
wakeup word
Prior art date
Application number
PCT/JP2019/031423
Other languages
French (fr)
Japanese (ja)
Inventor
小谷 亮
Original Assignee
三菱電機株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 三菱電機株式会社 filed Critical 三菱電機株式会社
Priority to PCT/JP2019/031423 priority Critical patent/WO2021024466A1/en
Priority to JP2021537527A priority patent/JP7224470B2/en
Publication of WO2021024466A1 publication Critical patent/WO2021024466A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/10Speech classification or search using distance or distortion measures between unknown speech and reference templates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems

Definitions

  • the present invention relates to a voice dialogue device, a voice dialogue method, and a program recording medium.
  • voice dialogue systems that enable voice dialogue with humans are in the limelight.
  • the voice dialogue system transmits voice data to a server via a network, and the server performs voice recognition processing and voice synthesis processing.
  • Such systems enable the provision of services called personal assistants, AI (Artificial Intelligence) assistants or virtual assistants, for example, as such systems or services, Amazon® Echo®.
  • Google registered trademark
  • Google Home registered trademark
  • Siri registered trademark
  • Alexa registered trademark
  • the server of these voice dialogue systems starts voice recognition processing based on the wakeup word included in the input voice.
  • the wake-up word is a phrase that is registered in advance and is a phrase that triggers when the voice recognition process is started.
  • the wakeup word usually varies from system to system. For example, "Alexa” in Amazon's Echo, "Siri” in Apple's Siri, and "OK, Google” in Google's Google Home are known as wakeup words.
  • the present invention has been made to solve the above problems, and an object of the present invention is to provide a voice dialogue device capable of making inquiries to a plurality of servers by a user at one time.
  • the voice dialogue device transmits a voice signal to a server that performs voice recognition processing on the voice spoken by the user.
  • the voice dialogue device includes a voice signal acquisition unit and a wakeup word division unit.
  • the voice signal acquisition unit acquires an input voice signal corresponding to the voice.
  • the wakeup word division unit transmits a voice signal based on the input voice signal to the plurality of servers when the input voice signal contains a generic wakeup word indicating a plurality of servers performing voice recognition processing. ..
  • a voice dialogue device capable of making inquiries to a plurality of servers by a user at one time.
  • FIG. It is a block diagram which shows the structure of the voice dialogue apparatus in Embodiment 1.
  • FIG. It is a figure which shows an example of the structure of the processing circuit included in a voice dialogue apparatus. It is a figure which shows another example of the structure of the processing circuit included in a voice dialogue apparatus.
  • FIG. It is a flowchart which shows the voice dialogue method in Embodiment 2.
  • FIG. 1 It is a block diagram which shows the structure of the voice dialogue apparatus in Embodiment 3. It is a figure which shows an example of the response signal including the effectiveness signal in Embodiment 3.
  • FIG. It is a flowchart which shows the voice dialogue method in Embodiment 3. It is a flowchart which shows the response signal reproduction processing in Embodiment 3. It is a block diagram which shows the structure of the voice dialogue apparatus and the apparatus which operates in connection with it in Embodiment 4.
  • FIG. 1 is a block diagram showing the configuration of the voice dialogue device 100 according to the first embodiment.
  • the voice dialogue device 100 is connected to a plurality of servers 200 via a network. Each of the plurality of servers 200 has a function of performing voice recognition processing on the input voice.
  • the voice dialogue device 100 in the first embodiment is connected to the third server 230 from the first server 210 as a plurality of servers 200.
  • the first server 210 to the third server 230 each have an individual voice recognition processing function.
  • the first server 210 to the third server 230 are operated by a business operator that provides different voice recognition processing services.
  • the number of servers connected to the voice dialogue device 100 is not limited to this.
  • Each of the plurality of servers 200 has a function of starting the voice recognition process based on the wakeup word included in the voice signal input to itself.
  • the wake-up word is a word that triggers each of the plurality of servers 200 to start the voice recognition process.
  • the voice dialogue device 100 includes a voice signal acquisition unit 10 and a wakeup word division unit 20.
  • the voice signal acquisition unit 10 acquires an input voice signal corresponding to the voice uttered by the user.
  • the voice is acquired by, for example, the microphone 110.
  • the wakeup word dividing unit 20 detects whether or not the universal wakeup word is included in the input voice signal.
  • the universal wakeup word is a word that collectively indicates a plurality of servers 200.
  • the universal wake-up word is "OK, anybody", “OK, everybody”, and the like.
  • the universal wakeup word may be a phrase that has not been used worldwide, a new phrase that has not been used, a coined word, or the like.
  • the wake-up word dividing unit 20 transmits a voice signal based on the input voice signal to a plurality of servers 200.
  • FIG. 2 is a diagram showing an example of the configuration of the processing circuit 90 included in the voice dialogue device 100.
  • Each function of the audio signal acquisition unit 10 and the wakeup word division unit 20 is realized by the processing circuit 90. That is, the processing circuit 90 has an audio signal acquisition unit 10 and a wakeup word division unit 20.
  • the processing circuit 90 may be, for example, a single circuit, a composite circuit, a programmed processor, a parallel programmed processor, an ASIC (Application Specific Integrated Circuit), or an FPGA (Field). -ProgrammableGateArray), or a circuit that combines these.
  • Each function of the audio signal acquisition unit 10 and the wakeup word division unit 20 may be individually realized by a plurality of processing circuits, or may be collectively realized by one processing circuit.
  • FIG. 3 is a diagram showing another example of the configuration of the processing circuit included in the voice dialogue device 100.
  • the processing circuit includes a processor 91 and a memory 92.
  • each function of the voice signal acquisition unit 10 and the wakeup word division unit 20 is realized.
  • each function is realized by executing software or firmware written as a voice dialogue program by the processor 91.
  • the voice dialogue device 100 has a memory 92 for storing the voice dialogue program and a processor 91 for executing the voice dialogue program.
  • the memory 92 is a program recording medium.
  • a universal wake-up word indicating a plurality of servers 200 in which the voice dialogue device 100 acquires an input voice signal corresponding to the voice spoken by the user and performs voice recognition processing is the input voice signal. Describes a function of transmitting a voice signal based on an input voice signal to a plurality of servers 200 when included in. Further, the voice dialogue program causes the computer to execute the procedure or method of the voice signal acquisition unit 10 and the wakeup word division unit 20.
  • the processor 91 is, for example, a CPU (Central Processing Unit), an arithmetic unit, a microprocessor, a microcomputer, a DSP (Digital Signal Processor), or the like.
  • the memory 92 is, for example, non-volatile or volatile such as RAM (RandomAccessMemory), ROM (ReadOnlyMemory), flash memory, EPROM (ErasableProgrammableReadOnlyMemory), and EEPROM (ElectricallyErasableProgrammableReadOnlyMemory). It is a semiconductor memory.
  • the memory 92 may be any storage medium used in the future, such as a magnetic disk, a flexible disk, an optical disk, a compact disk, a mini disk, or a DVD.
  • Each function of the voice signal acquisition unit 10 and the wakeup word division unit 20 described above may be partially realized by dedicated hardware and the other part may be realized by software or firmware.
  • the processing circuit realizes each of the above-mentioned functions by hardware, software, firmware, or a combination thereof.
  • FIG. 4 is a flowchart showing the voice dialogue method in the first embodiment.
  • step S1 the voice signal acquisition unit 10 receives the input voice signal corresponding to the voice uttered by the user.
  • the user utters "OK, anybody. Where can I buy the product of company X?”
  • the microphone 110 acquires the voice.
  • the audio signal acquisition unit 10 acquires an input audio signal from the microphone 110.
  • step S2 the wakeup word dividing unit 20 analyzes whether the input voice signal includes a universal wakeup word.
  • the universal wakeup word to be analyzed is registered in the voice dialogue device 100 in advance.
  • "OK, anybody” and “OK, everybody” are registered in advance in the voice dialogue device 100 as universal wake-up words.
  • step S3 the wakeup word dividing unit 20 determines whether or not a universal wakeup word has been detected. If a universal wakeup word is detected, step S4 is executed. If no universal wakeup word is detected, the voice interaction method ends.
  • step S4 the wakeup word dividing unit 20 transmits a voice signal based on the input voice signal to the plurality of servers 200.
  • the wakeup word dividing unit 20 transmits a voice signal based on the input voice signal to the plurality of servers 200.
  • Each of the plurality of servers 200 starts the voice recognition process based on the universal wakeup word included in the voice signal received from the voice dialogue device 100. Then, each of the plurality of servers 200 transmits a response signal based on the result of the voice recognition process to the voice dialogue device 100.
  • the voice dialogue device 100 receives response signals from a plurality of servers 200. When the response signal is reproduced by a voice output device (not shown), a dialogue with the user is established.
  • the voice dialogue device 100 transmits a voice signal to a server that performs voice recognition processing on the voice uttered by the user.
  • the voice dialogue device 100 includes a voice signal acquisition unit 10 and a wakeup word division unit 20.
  • the voice signal acquisition unit 10 acquires an input voice signal corresponding to the voice.
  • the wake-up word dividing unit 20 transmits a voice signal based on the input voice signal to the plurality of servers 200. Send to.
  • Such a voice dialogue device 100 completes a user's inquiry to a plurality of servers 200 at once.
  • the user can make inquiries to a plurality of servers 200 at once, and even if one server cannot answer, it is not necessary to make a second utterance to another server.
  • the voice dialogue device 100 can be applied to a voice recognition processing system having a voice recognition processing function, and improves the efficiency of the voice dialogue.
  • a voice signal is transmitted to a server that performs voice recognition processing for the voice uttered by the user.
  • the voice dialogue method is based on the input voice signal when the input voice signal includes a generic wake-up word indicating a plurality of servers 200 that acquire the input voice signal corresponding to the voice and perform voice recognition processing.
  • the voice signal is transmitted to a plurality of servers 200.
  • Such a voice dialogue method completes a user's inquiry to a plurality of servers 200 at once.
  • the voice dialogue method can be applied to a voice recognition processing system having a voice recognition processing function, and improves the efficiency of voice dialogue.
  • the voice dialogue device and the voice dialogue method according to the second embodiment will be described.
  • the second embodiment is a subordinate concept of the first embodiment, and the voice dialogue device in the second embodiment includes each configuration of the voice dialogue device 100 in the first embodiment. The same configuration and operation as in the first embodiment will not be described.
  • FIG. 5 is a block diagram showing the configuration of the voice dialogue device 101 according to the second embodiment.
  • Each of the plurality of servers 200 in the second embodiment can recognize an individual wakeup word indicating its own server, but cannot recognize a universal wakeup word. For example, when the user says “OK, anybody. Where can I buy the product of company X?", Each of the plurality of servers 200 wakes the part of "OK, anybody". Cannot be recognized as an upward.
  • the voice dialogue device 101 includes a communication processing unit 30 and a response signal output unit 40 in addition to the voice signal acquisition unit 10 and the wakeup word division unit 20 of the first embodiment. Further, the wakeup word division unit 20 is different from the first embodiment in the functions shown below.
  • the wakeup word dividing unit 20 deletes the universal wakeup word from the input audio signal to generate the main audio signal. Then, the wakeup word dividing unit 20 transmits the main audio signal to the plurality of servers 200. The wakeup word dividing unit 20 in the second embodiment transmits a main audio signal via the communication processing unit 30.
  • the communication processing unit 30 is connected to the network 130 and transmits the main audio signal output from the wakeup word dividing unit 20 to each of the plurality of servers 200. Further, the communication processing unit 30 receives the response signals transmitted from each of the plurality of servers 200 and outputs them to the response signal output unit 40.
  • the response signal output unit 40 receives the response signal.
  • the response signal output unit 40 according to the second embodiment outputs the response signals in the order in which the response signals are received from the plurality of servers 200.
  • the response signal received from the server is a voice signal, a text signal, etc.
  • the audio signal as the response signal is a signal by PCM (pulse code modulation), a signal compressed in an mp3 file format, or the like, and the response signal output unit 40 outputs the audio signal to the speaker 120.
  • the response signal is a text signal
  • the response signal output unit 40 generates a voice signal that the speaker 120 can output voice based on the text signal by voice synthesis processing and outputs the voice signal to the speaker 120.
  • the speaker 120 outputs voice based on the response signal.
  • FIG. 6 is a diagram showing a hardware configuration of the voice dialogue device 101 according to the second embodiment.
  • the voice dialogue device 101 includes a main processing unit 93 and a program recording medium 94.
  • the main processing unit 93 corresponds to the processing circuits shown in FIGS. 2 and 3.
  • the program recording medium 94 corresponds to the memory 92 shown in FIG.
  • the functions of the audio signal acquisition unit 10, the wakeup word division unit 20, the communication processing unit 30, and the response signal output unit 40 in the second embodiment are realized by the main processing unit 93. Further, the program recording medium 94 stores a voice dialogue program in which the functions of the voice signal acquisition unit 10, the wakeup word division unit 20, the communication processing unit 30, and the response signal output unit 40 are described. Each of the above functions is realized by executing the voice dialogue program in the main processing unit 93.
  • FIG. 7 is a flowchart showing the voice dialogue method in the second embodiment.
  • step S10 the voice signal acquisition unit 10 receives the input voice signal corresponding to the voice uttered by the user. Similar to the first embodiment, here, the user utters "OK, anybody. Where can I buy the product of company X?", And the voice signal acquisition unit 10 receives the input voice corresponding to the voice. Get the signal.
  • step S20 the wakeup word dividing unit 20 analyzes whether the input voice signal includes the wakeup word.
  • the wake-up word to be analyzed is registered in the voice dialogue device 101 in advance.
  • individual wakeup words indicating a specific server and universal wakeup words are registered in advance as wakeup words to be analyzed.
  • step S30 the wakeup word dividing unit 20 determines whether or not the wakeup word has been detected. If a wakeup word is detected, step S40 is executed. If no wakeup word is detected, the voice interaction method ends.
  • step S40 the wakeup word dividing unit 20 determines whether or not the detected wakeup word is a universal wakeup word. If it is not a universal wakeup word, that is, if the detected wakeup word is an individual wakeup word indicating a particular server, step S50 is executed. If it is a universal wakeup word, step S60 is executed.
  • step S50 the wakeup word division unit 20 selects a specific server as a transmission destination.
  • step S60 the wakeup word division unit 20 selects a plurality of servers 200 as transmission destinations.
  • step S70 the wakeup word dividing unit 20 deletes the universal wakeup word from the input voice signal to generate the main voice signal.
  • the wake-up word dividing unit 20 deletes the voice signal corresponding to the universal wake-up word "OK, anybody" from the input voice signal, and "Where is the product of company X?" Generate a main audio signal corresponding to "Can you buy it?"
  • step S80 the communication processing unit 30 transmits an audio signal to the server selected in step S50 or S60. That is, when the voice dialogue process goes through step S50, the communication processing unit 30 transmits the input voice signal to a specific server. When the voice dialogue processing goes through steps S60 and S70, the communication processing unit 30 transmits the main voice signal to the plurality of servers 200.
  • step S90 the response signal reproduction process is executed.
  • FIG. 8 is a flowchart showing the response signal reproduction processing according to the second embodiment.
  • step S91 the communication processing unit 30 receives response signals from the plurality of servers 200.
  • step S92 the response signal output unit 40 outputs the response signal to the speaker 120.
  • the response signal output unit 40 When the response signal received from any of the servers is a text signal, the response signal output unit 40 generates a voice signal based on the text signal by voice synthesis processing and outputs it to the speaker 120. By such processing, the speaker 120 can reproduce the response voice in the order in which the response signals are received from the plurality of servers 200.
  • step S93 the communication processing unit 30 determines whether or not response signals have been received from all the target servers.
  • the target server is a server to which a voice signal is transmitted in step S80, and is a specific server or a plurality of servers 200. If no response signals have been received from all target servers, step S91 is executed again. When the response signals are received from all the target servers, the response signal reproduction processing ends. Then, the voice dialogue method shown in FIG. 7 ends.
  • the wakeup word dividing unit 20 in the second embodiment deletes the universal wakeup word from the input voice signal to generate the main voice signal. Then, the wakeup word dividing unit 20 transmits the main audio signal to the plurality of servers 200.
  • the voice dialogue device 101 transmits only the main voice signal corresponding to the specific inquiry content to the server. Therefore, the accuracy of the voice dialogue is improved.
  • the voice dialogue device 101 has an effect that the user can complete inquiries to the plurality of servers 200 at once simply by connecting to the plurality of servers 200 already in operation.
  • the response signal output unit 40 outputs the response signals to the speaker 120 in the order in which the response signals are received from the plurality of servers 200. Therefore, it is possible to reproduce the response voice in the order in which the response signals are returned first.
  • the voice dialogue device and the voice dialogue method according to the third embodiment will be described.
  • the third embodiment is a subordinate concept of the first embodiment, and the voice dialogue device in the third embodiment includes each configuration of the voice dialogue device 100 in the first embodiment.
  • the description of the configuration and operation similar to those of the first or second embodiment will be omitted.
  • FIG. 9 is a block diagram showing the configuration of the voice dialogue device 102 according to the third embodiment.
  • Each of the plurality of servers 200 in the third embodiment recognizes an individual wakeup word indicating its own server, but does not recognize a universal wakeup word.
  • the first server 210 recognizes "AAA” as an individual wakeup word.
  • the second server 220 recognizes "BBB” as an individual wakeup word.
  • the third server 230 recognizes "OK, CCC” as an individual wakeup word.
  • “AAA”, “BBB” and “CCC” are, for example, names or abbreviations for voice recognition processing services. For example, when the user calls “Hey, BBB”, the second server 220 recognizes the wake-up word “BBB” and starts the voice recognition process. Alternatively, when the user calls "OK, CCC", the third server 230 recognizes the wake-up word "OK, CCC” and starts the voice recognition process.
  • an individual wakeup word indicating each of the plurality of servers 200 is registered in advance.
  • the voice dialogue device 102 includes a wakeup word adding unit 50, a communication processing unit 30, and a response signal output unit 40 in addition to the voice signal acquisition unit 10 and the wakeup word dividing unit 20 of the first embodiment. Further, the wakeup word division unit 20 is different from the first embodiment in the functions shown below.
  • the wakeup word dividing unit 20 generates a main voice signal in which the universal wakeup word is deleted from the input voice signal, as in the second embodiment. Further, the wakeup word dividing unit 20 of the third embodiment transmits a voice signal for each specific server indicated by the individual wakeup words given to the main voice signal by the wakeup word giving unit 50 described later. .. In the third embodiment, the wakeup word dividing unit 20 transmits an audio signal via the communication processing unit 30.
  • the wakeup word imparting unit 50 uses an individual audio signal corresponding to each of the individual wakeup words indicating each of the plurality of servers 200 as the main audio signal.
  • the wake-up word imparting unit 50 in the third embodiment connects individual voice signals before the main voice signal to generate a voice signal.
  • the individual voice signal is stored in the memory 92 as a fixed value, for example.
  • the communication processing unit 30 is connected to the network 130 and transmits the audio signal output from the wakeup word dividing unit 20 to the server. Further, the communication processing unit 30 receives the response signal transmitted from the server and outputs it to the response signal output unit 40.
  • the response signal output unit 40 receives response signals from a plurality of servers 200.
  • the response signal output unit 40 receives the response signal via the communication processing unit 30.
  • the response signal in the third embodiment includes an effectiveness signal indicating the effectiveness of the response.
  • the response signal output unit 40 outputs a response signal to the speaker 120 based on the effectiveness signal. For example, when it is determined that the response is valid, the response signal output unit 40 outputs the response signal to the speaker 120.
  • the speaker 120 outputs voice based on the response signal.
  • FIG. 10 is a diagram showing an example of a response signal including an effectiveness signal in the third embodiment.
  • FIG. 10 shows a response signal described in JSON (JavaScript (registered trademark) Object Notation) format. “Effective” indicates the validity signal, and “payload” indicates the content of the response to be reproduced.
  • JSON JavaScript (registered trademark) Object Notation) format. “Effective” indicates the validity signal, and “payload” indicates the content of the response to be reproduced.
  • the response signal output unit 40 outputs the response signal to the speaker 120, and the voice is reproduced from the speaker 120.
  • the value of "effective” is "no”
  • the response signal output unit 40 does not output the response signal to the speaker 120. That is, no sound is reproduced from the speaker 120.
  • the "payload” may be data in which a binary audio signal such as PCM (pulse code modulation) or mp3 is converted into a text format by a BASE64 format or the like.
  • "payload” may be a character string such as "Products of company X can be purchased at an online store”.
  • the response signal output unit 40 generates a voice signal corresponding to the text by voice synthesis processing.
  • the functions of the wakeup word dividing unit 20, the wakeup word imparting unit 50, the communication processing unit 30, and the response signal output unit 40 are realized by the processing circuit shown in FIG. 2 or FIG.
  • FIG. 11 is a flowchart showing the voice dialogue method in the third embodiment.
  • Steps S10 to S70 are the same as in the second embodiment. Step S100 is executed following step S70.
  • the wakeup word adding unit 50 adds an individual voice signal corresponding to each wakeup word to the main voice signal.
  • the wakeup word granting unit 50 connects the individual voice signal corresponding to "Hey, BBB” indicating the second server 220 before the main voice signal of "Where can I buy the product of company X?" And "Hey. , BBB, where can I buy the products of company X? ”Is generated.
  • the wake-up word granting unit 50 concatenates the individual audio signal corresponding to "OK, CCC” indicating the third server 230 before the main audio signal of "Where can I buy the product of company X?" Generates an audio signal corresponding to "Where can I buy products from OK, CCC, and X?”.
  • step S110 the communication processing unit 30 transmits an audio signal to the server selected in step S50 or S60.
  • step S120 the response signal reproduction process is executed.
  • FIG. 12 is a flowchart showing the response signal reproduction processing according to the third embodiment.
  • step S121 the communication processing unit 30 receives response signals from the plurality of servers 200.
  • step S122 the response signal output unit 40 determines whether or not the response signal is valid based on the validity signal. If it is valid, step S123 is executed. If it is not valid, step S124 is executed.
  • step S123 the response signal output unit 40 outputs the response signal to the speaker 120.
  • step S124 the communication processing unit 30 determines whether or not response signals have been received from all the target servers. If no response signal has been received from all target servers, step S121 is executed again. When the response signals are received from all the target servers, the response signal reproduction processing ends. Then, the voice dialogue method shown in FIG. 11 ends.
  • the voice dialogue device 102 in the third embodiment includes the wake-up word giving unit 50.
  • the wakeup word imparting unit 50 outputs an individual audio signal corresponding to each of the individual wakeup words indicating each of the plurality of servers 200 as an audio signal (implementation).
  • the wakeup word dividing unit 20 transmits a voice signal to each specific server indicated by the individual wakeup word based on the individual voice signal given to the voice signal.
  • the voice dialogue device 102 assigns an individual wakeup word for each server. A voice signal is transmitted to each server. Therefore, the accuracy of the voice dialogue for each server is improved.
  • the voice dialogue device 102 in the third embodiment includes a response signal output unit 40.
  • the response signal output unit 40 receives a plurality of response signals for the voice signal from the plurality of servers 200, and receives a plurality of response signals based on the validity signal indicating the effectiveness of the response included in each of the plurality of response signals. Output to the audio output device.
  • Such a voice dialogue device 102 can cause the voice output device to reproduce only valid answers among the responses received from the server. For example, the content of the response of the first server 210 and the second server 220 is "I do not know”, the value of the validity signal is "invalid”, and the content of the response of the third server 230 is "X”. If the company's product can be purchased from the company X's online store "and the value of the validity signal is" valid ", the voice dialogue device 102 will only respond to the third server 230 as a voice output device. To play.
  • the voice dialogue device 102 can give priority to a good answer, that is, an information-rich answer, and cause the voice output device to reproduce the answer.
  • the universal wakeup word in the modified example of the third embodiment indicates a plurality of servers 200 other than a specific server.
  • the universal wakeup word is "OK, other than AAA" and indicates a second server 220 and a third server 230 other than the first server 210.
  • step S60 of FIG. 11 the wakeup word dividing unit 20 selects the second server 220 and the third server 230 as the transmission destinations as the plurality of servers 200 other than the specific server. Subsequent steps are the same as in each step of FIG. 11, and the wakeup word dividing unit 20 transmits an audio signal to the second server 220 and the third server 230.
  • the voice dialogue device shown in each of the above embodiments can be applied to a system constructed by appropriately combining a navigation device, a communication terminal, a server, and the functions of applications installed in the navigation device.
  • the navigation device includes, for example, a PND (Portable Navigation Device) and the like.
  • Communication terminals include, for example, mobile terminals such as mobile phones, smartphones and tablets.
  • FIG. 13 is a block diagram showing the configuration of the voice dialogue device 100 and the device that operates in connection therewith according to the fourth embodiment.
  • the voice dialogue device 100 and the communication device 150 are provided in the wakeup word recognition server 300.
  • the voice dialogue device 100 acquires an input voice signal from the microphone 110 provided in the vehicle 1 via the communication device 140 and the communication device 150.
  • the voice dialogue device 100 transmits a voice signal based on the input voice signal to a plurality of servers 200 when a generic wakeup word is included in the input voice signal.
  • the voice dialogue device 100 receives response signals from the plurality of servers 200 and outputs the response signals to the speaker 120 provided in the vehicle 1 via each communication device.
  • the voice dialogue device 100 in the wakeup word recognition server 300 in this way, the configuration of the in-vehicle device can be simplified.
  • voice dialogue device 100 may be provided in the wakeup word recognition server 300, and the other part may be provided in the vehicle 1 in a distributed manner.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The objective of the present invention is to provide a voice interaction device with which it is possible to handle an inquiry from a user to a plurality of servers at one time. The voice interaction device includes a voice signal acquisition unit and a wakeup word division unit. The voice signal acquisition unit acquires an input voice signal corresponding to a voice. When a universal wakeup word indicating a plurality of servers that perform voice recognition processing is included in the input voice signal, the wakeup word division unit transmits, to the plurality of servers, a voice signal based on the input voice signal.

Description

音声対話装置、音声対話方法およびプログラム記録媒体Voice dialogue device, voice dialogue method and program recording medium
 本発明は、音声対話装置、音声対話方法およびプログラム記録媒体に関する。 The present invention relates to a voice dialogue device, a voice dialogue method, and a program recording medium.
 音声認識技術の精度が高まるにつれ、人間との音声による対話が可能な音声対話システムが脚光を浴びている。音声対話システムは、ネットワーク経由で音声データをサーバに送信し、サーバで音声認識処理および音声合成処理を行う。このようなシステムは、パーソナルアシスタント、AI(Artificial Intelligence)アシスタントまたはバーチャルアシスタントと言われるサービスの提供を可能とし、例えば、そのようなシステムまたはサービスとして、Amazon(登録商標)社のEcho(登録商標)、Google(登録商標)社のGoogle Home(登録商標)等が知られている。また、スマートフォンに搭載されたシステムやサービスとしては、Apple(登録商標)社のSiri(登録商標)やGoogle社のGoogle Assistant、Amazon社のAlexa(登録商標)等が知られている。 As the accuracy of voice recognition technology has increased, voice dialogue systems that enable voice dialogue with humans are in the limelight. The voice dialogue system transmits voice data to a server via a network, and the server performs voice recognition processing and voice synthesis processing. Such systems enable the provision of services called personal assistants, AI (Artificial Intelligence) assistants or virtual assistants, for example, as such systems or services, Amazon® Echo®. , Google (registered trademark), Google Home (registered trademark), etc. are known. In addition, as systems and services installed in smartphones, Siri (registered trademark) of Apple (registered trademark), Google Assistant of Google, Alexa (registered trademark) of Amazon, etc. are known.
 これらの音声対話システムのサーバは、入力される音声に含まれるウェイクアップワードに基づいて音声認識処理を開始する。ウェイクアップワードとは、予め登録されているフレーズであって、音声認識処理を開始する際のトリガーとなるフレーズである。そのウェイクアップワードは、通常、システムごとに異なる。例えば、上記のAmazon社のEchoでは「Alexa」が、Apple社のSiriでは「Siri」が、Google社のGoogle Homeでは「OK, Google」がウェイクアップワードとして知られている。 The server of these voice dialogue systems starts voice recognition processing based on the wakeup word included in the input voice. The wake-up word is a phrase that is registered in advance and is a phrase that triggers when the voice recognition process is started. The wakeup word usually varies from system to system. For example, "Alexa" in Amazon's Echo, "Siri" in Apple's Siri, and "OK, Google" in Google's Google Home are known as wakeup words.
特開2018-181330号公報Japanese Unexamined Patent Publication No. 2018-18133
 上記のように多数の音声対話システムによるサービスが提供されているため、ユーザが、複数のサービスを利用できる環境、つまり音声認識処理が可能な複数のサーバに接続可能な環境にいることも多い。そのような環境において、一のサーバがユーザからの問い合わせに対して適当に回答できない場合、ユーザは他のサーバに問い合わせるため、異なるウェイクアップワードとともに再度発話する必要がある。 As described above, since services are provided by a large number of voice dialogue systems, users are often in an environment where they can use multiple services, that is, an environment where they can connect to multiple servers capable of voice recognition processing. In such an environment, if one server cannot respond appropriately to a user's inquiry, the user needs to speak again with a different wakeup word to contact the other server.
 本発明は、以上のような課題を解決するためになされたものであり、ユーザによる複数のサーバへの問い合わせを一度で済ますことができる音声対話装置の提供を目的とする。 The present invention has been made to solve the above problems, and an object of the present invention is to provide a voice dialogue device capable of making inquiries to a plurality of servers by a user at one time.
 本発明に係る音声対話装置は、ユーザによって発話された音声に対して音声認識処理を行うサーバに、音声信号を送信する。音声対話装置は、音声信号取得部およびウェイクアップワード分割部を含む。音声信号取得部は、音声に対応する入力音声信号を取得する。ウェイクアップワード分割部は、音声認識処理を行う複数のサーバを示す全称的なウェイクアップワードが入力音声信号に含まれている場合に、入力音声信号に基づく音声信号を、複数のサーバに送信する。 The voice dialogue device according to the present invention transmits a voice signal to a server that performs voice recognition processing on the voice spoken by the user. The voice dialogue device includes a voice signal acquisition unit and a wakeup word division unit. The voice signal acquisition unit acquires an input voice signal corresponding to the voice. The wakeup word division unit transmits a voice signal based on the input voice signal to the plurality of servers when the input voice signal contains a generic wakeup word indicating a plurality of servers performing voice recognition processing. ..
 本発明によれば、ユーザによる複数のサーバへの問い合わせを一度で済ますことができる音声対話装置の提供が可能である。 According to the present invention, it is possible to provide a voice dialogue device capable of making inquiries to a plurality of servers by a user at one time.
 本発明の目的、特徴、局面、および利点は、以下の詳細な説明と添付図面とによって、より明白になる。 The objectives, features, aspects, and advantages of the present invention will be made clearer by the following detailed description and accompanying drawings.
実施の形態1における音声対話装置の構成を示すブロック図である。It is a block diagram which shows the structure of the voice dialogue apparatus in Embodiment 1. FIG. 音声対話装置が含む処理回路の構成の一例を示す図である。It is a figure which shows an example of the structure of the processing circuit included in a voice dialogue apparatus. 音声対話装置が含む処理回路の構成の別の一例を示す図である。It is a figure which shows another example of the structure of the processing circuit included in a voice dialogue apparatus. 実施の形態1における音声対話方法を示すフローチャートである。It is a flowchart which shows the voice dialogue method in Embodiment 1. 実施の形態2における音声対話装置の構成を示すブロック図である。It is a block diagram which shows the structure of the voice dialogue apparatus in Embodiment 2. 実施の形態2における音声対話装置のハードウェア構成を示す図である。It is a figure which shows the hardware composition of the voice dialogue apparatus in Embodiment 2. FIG. 実施の形態2における音声対話方法を示すフローチャートである。It is a flowchart which shows the voice dialogue method in Embodiment 2. 実施の形態2における応答信号再生処理を示すフローチャートである。It is a flowchart which shows the response signal reproduction processing in Embodiment 2. 実施の形態3における音声対話装置の構成を示すブロック図である。It is a block diagram which shows the structure of the voice dialogue apparatus in Embodiment 3. 実施の形態3における有効性信号を含む応答信号の一例を示す図である。It is a figure which shows an example of the response signal including the effectiveness signal in Embodiment 3. FIG. 実施の形態3における音声対話方法を示すフローチャートである。It is a flowchart which shows the voice dialogue method in Embodiment 3. 実施の形態3における応答信号再生処理を示すフローチャートである。It is a flowchart which shows the response signal reproduction processing in Embodiment 3. 実施の形態4における音声対話装置およびそれに関連して動作する装置の構成を示すブロック図である。It is a block diagram which shows the structure of the voice dialogue apparatus and the apparatus which operates in connection with it in Embodiment 4.
 <実施の形態1>
 図1は、実施の形態1における音声対話装置100の構成を示すブロック図である。
<Embodiment 1>
FIG. 1 is a block diagram showing the configuration of the voice dialogue device 100 according to the first embodiment.
 音声対話装置100は、ネットワークを介して複数のサーバ200に接続されている。複数のサーバ200の各々は、入力される音声に対して音声認識処理を行う機能を有する。実施の形態1における音声対話装置100は、複数のサーバ200として、第1サーバ210から第3サーバ230に接続されている。第1サーバ210から第3サーバ230は、それぞれ個別の音声認識処理機能を有している。例えば、第1サーバ210から第3サーバ230は、異なる音声認識処理のサービスを提供する事業者によって運用されている。なお、音声対話装置100に接続されるサーバの数は、これに限定されるものではない。 The voice dialogue device 100 is connected to a plurality of servers 200 via a network. Each of the plurality of servers 200 has a function of performing voice recognition processing on the input voice. The voice dialogue device 100 in the first embodiment is connected to the third server 230 from the first server 210 as a plurality of servers 200. The first server 210 to the third server 230 each have an individual voice recognition processing function. For example, the first server 210 to the third server 230 are operated by a business operator that provides different voice recognition processing services. The number of servers connected to the voice dialogue device 100 is not limited to this.
 複数のサーバ200の各々は、自己に入力される音声信号に含まれるウェイクアップワードに基づいて音声認識処理を開始する機能を有する。ウェイクアップワードとは、複数のサーバ200の各々が音声認識処理を開始する際のトリガーとなるワードである。 Each of the plurality of servers 200 has a function of starting the voice recognition process based on the wakeup word included in the voice signal input to itself. The wake-up word is a word that triggers each of the plurality of servers 200 to start the voice recognition process.
 音声対話装置100は、音声信号取得部10およびウェイクアップワード分割部20を含む。 The voice dialogue device 100 includes a voice signal acquisition unit 10 and a wakeup word division unit 20.
 音声信号取得部10は、ユーザによって発話された音声に対応する入力音声信号を取得する。音声は、例えば、マイク110によって取得される。 The voice signal acquisition unit 10 acquires an input voice signal corresponding to the voice uttered by the user. The voice is acquired by, for example, the microphone 110.
 ウェイクアップワード分割部20は、全称的なウェイクアップワードが、入力音声信号に含まれているか否かを検出する。全称的なウェイクアップワードとは、複数のサーバ200を総括的に示すワードである。例えば、全称的なウェイクアップワードとは、「OK,エニバデ(OK, anybody)」、「OK,エビバデ(OK, everybody)」等である。または例えば、全称的なウェイクアップワードは、世界的に使用実績の少ないフレーズ、使用実績のない新しいフレーズ、造語等であってもよい。これら全称的なウェイクアップワードは、音声対話装置100に、予め登録されている。 The wakeup word dividing unit 20 detects whether or not the universal wakeup word is included in the input voice signal. The universal wakeup word is a word that collectively indicates a plurality of servers 200. For example, the universal wake-up word is "OK, anybody", "OK, everybody", and the like. Alternatively, for example, the universal wakeup word may be a phrase that has not been used worldwide, a new phrase that has not been used, a coined word, or the like. These universal wakeup words are registered in advance in the voice dialogue device 100.
 ウェイクアップワード分割部20は、全称的なウェイクアップワードが入力音声信号に含まれている場合、入力音声信号に基づく音声信号を、複数のサーバ200に送信する。 When the input voice signal includes a generic wake-up word, the wake-up word dividing unit 20 transmits a voice signal based on the input voice signal to a plurality of servers 200.
 図2は、音声対話装置100が含む処理回路90の構成の一例を示す図である。音声信号取得部10およびウェイクアップワード分割部20の各機能は、処理回路90により実現される。すなわち、処理回路90は、音声信号取得部10およびウェイクアップワード分割部20を有する。 FIG. 2 is a diagram showing an example of the configuration of the processing circuit 90 included in the voice dialogue device 100. Each function of the audio signal acquisition unit 10 and the wakeup word division unit 20 is realized by the processing circuit 90. That is, the processing circuit 90 has an audio signal acquisition unit 10 and a wakeup word division unit 20.
 処理回路90が専用のハードウェアである場合、処理回路90は、例えば、単一回路、複合回路、プログラム化されたプロセッサ、並列プログラム化されたプロセッサ、ASIC(Application Specific Integrated Circuit)、FPGA(Field-Programmable Gate Array)、またはこれらを組み合わせた回路等である。音声信号取得部10およびウェイクアップワード分割部20の各機能は、複数の処理回路により個別に実現されてもよいし、1つの処理回路によりまとめて実現されてもよい。 When the processing circuit 90 is dedicated hardware, the processing circuit 90 may be, for example, a single circuit, a composite circuit, a programmed processor, a parallel programmed processor, an ASIC (Application Specific Integrated Circuit), or an FPGA (Field). -ProgrammableGateArray), or a circuit that combines these. Each function of the audio signal acquisition unit 10 and the wakeup word division unit 20 may be individually realized by a plurality of processing circuits, or may be collectively realized by one processing circuit.
 図3は、音声対話装置100が含む処理回路の構成の別の一例を示す図である。処理回路は、プロセッサ91とメモリ92とを有する。プロセッサ91がメモリ92に格納される音声対話プログラムを実行することにより、音声信号取得部10およびウェイクアップワード分割部20の各機能が実現される。例えば、音声対話プログラムとして記述されたソフトウェアまたはファームウェアが、プロセッサ91によって実行されることにより各機能が実現される。このように、音声対話装置100は、音声対話プログラムを格納するメモリ92と、その音声対話プログラムを実行するプロセッサ91とを有する。言い換えると、メモリ92は、プログラム記録媒体である。 FIG. 3 is a diagram showing another example of the configuration of the processing circuit included in the voice dialogue device 100. The processing circuit includes a processor 91 and a memory 92. By executing the voice dialogue program stored in the memory 92 by the processor 91, each function of the voice signal acquisition unit 10 and the wakeup word division unit 20 is realized. For example, each function is realized by executing software or firmware written as a voice dialogue program by the processor 91. As described above, the voice dialogue device 100 has a memory 92 for storing the voice dialogue program and a processor 91 for executing the voice dialogue program. In other words, the memory 92 is a program recording medium.
 音声対話プログラムには、音声対話装置100が、ユーザによって発話された音声に対応する入力音声信号を取得し、音声認識処理を行う複数のサーバ200を示す全称的なウェイクアップワードがその入力音声信号に含まれている場合に、入力音声信号に基づく音声信号を、複数のサーバ200に送信する機能が記述されている。また、音声対話プログラムは、音声信号取得部10およびウェイクアップワード分割部20の手順または方法をコンピュータに実行させるものである。 In the voice dialogue program, a universal wake-up word indicating a plurality of servers 200 in which the voice dialogue device 100 acquires an input voice signal corresponding to the voice spoken by the user and performs voice recognition processing is the input voice signal. Describes a function of transmitting a voice signal based on an input voice signal to a plurality of servers 200 when included in. Further, the voice dialogue program causes the computer to execute the procedure or method of the voice signal acquisition unit 10 and the wakeup word division unit 20.
 プロセッサ91は、例えば、CPU(Central Processing Unit)、演算装置、マイクロプロセッサ、マイクロコンピュータ、DSP(Digital Signal Processor)等である。メモリ92は、例えば、RAM(Random Access Memory)、ROM(Read Only Memory)、フラッシュメモリ、EPROM(Erasable Programmable Read Only Memory)、EEPROM(Electrically Erasable Programmable Read Only Memory)等の、不揮発性または揮発性の半導体メモリである。または、メモリ92は、磁気ディスク、フレキシブルディスク、光ディスク、コンパクトディスク、ミニディスク、DVD等、今後使用されるあらゆる記憶媒体であってもよい。 The processor 91 is, for example, a CPU (Central Processing Unit), an arithmetic unit, a microprocessor, a microcomputer, a DSP (Digital Signal Processor), or the like. The memory 92 is, for example, non-volatile or volatile such as RAM (RandomAccessMemory), ROM (ReadOnlyMemory), flash memory, EPROM (ErasableProgrammableReadOnlyMemory), and EEPROM (ElectricallyErasableProgrammableReadOnlyMemory). It is a semiconductor memory. Alternatively, the memory 92 may be any storage medium used in the future, such as a magnetic disk, a flexible disk, an optical disk, a compact disk, a mini disk, or a DVD.
 上述した音声信号取得部10およびウェイクアップワード分割部20の各機能は、一部が専用のハードウェアによって実現され、他の一部がソフトウェアまたはファームウェアにより実現されてもよい。このように、処理回路は、ハードウェア、ソフトウェア、ファームウェア、またはこれらの組み合わせによって、上述の各機能を実現する。 Each function of the voice signal acquisition unit 10 and the wakeup word division unit 20 described above may be partially realized by dedicated hardware and the other part may be realized by software or firmware. In this way, the processing circuit realizes each of the above-mentioned functions by hardware, software, firmware, or a combination thereof.
 図4は、実施の形態1における音声対話方法を示すフローチャートである。 FIG. 4 is a flowchart showing the voice dialogue method in the first embodiment.
 ステップS1にて、音声信号取得部10は、ユーザによって発話された音声に対応する入力音声信号を受信する。ここでは、一例として、ユーザが「OK,エニバデ(OK, anybody)。X社の製品はどこで買える?」と発話し、マイク110がその音声を取得する。音声信号取得部10は、そのマイク110から入力音声信号を取得する。 In step S1, the voice signal acquisition unit 10 receives the input voice signal corresponding to the voice uttered by the user. Here, as an example, the user utters "OK, anybody. Where can I buy the product of company X?", And the microphone 110 acquires the voice. The audio signal acquisition unit 10 acquires an input audio signal from the microphone 110.
 ステップS2にて、ウェイクアップワード分割部20は、その入力音声信号に全称的なウェイクアップワードが含まれているか解析する。解析対象の全称的なウェイクアップワードは、予め音声対話装置100に登録されている。ここでは、音声対話装置100には、全称的なウェイクアップワードとして、「OK,エニバデ(OK, anybody)」および「OK,エビバデ(OK, everybody)」が予め登録されている。 In step S2, the wakeup word dividing unit 20 analyzes whether the input voice signal includes a universal wakeup word. The universal wakeup word to be analyzed is registered in the voice dialogue device 100 in advance. Here, "OK, anybody" and "OK, everybody" are registered in advance in the voice dialogue device 100 as universal wake-up words.
 ステップS3にて、ウェイクアップワード分割部20は、全称的なウェイクアップワードが検出されたか否かを判定する。全称的なウェイクアップワードが検出された場合、ステップS4が実行される。全称的なウェイクアップワードが検出されなかった場合、音声対話方法は終了する。 In step S3, the wakeup word dividing unit 20 determines whether or not a universal wakeup word has been detected. If a universal wakeup word is detected, step S4 is executed. If no universal wakeup word is detected, the voice interaction method ends.
 ステップS4にて、ウェイクアップワード分割部20は、複数のサーバ200に、入力音声信号に基づく音声信号を送信する。実施の形態1においては、「OK,エニバデ(OK, anybody)。X社の製品はどこで買える?」に対応する入力音声信号、つまり、音声信号取得部10にて取得された入力音声信号が複数のサーバ200に送信される。 In step S4, the wakeup word dividing unit 20 transmits a voice signal based on the input voice signal to the plurality of servers 200. In the first embodiment, there are a plurality of input voice signals corresponding to "OK, anybody. Where can I buy the product of company X?", That is, a plurality of input voice signals acquired by the voice signal acquisition unit 10. It is transmitted to the server 200 of.
 複数のサーバ200の各々は、音声対話装置100から受信した音声信号に含まれる全称的なウェイクアップワードに基づいて音声認識処理を開始する。そして、複数のサーバ200の各々は、その音声認識処理の結果に基づく応答信号を音声対話装置100に送信する。音声対話装置100は、複数のサーバ200から応答信号を受信する。その応答信号が音声出力装置(図示せず)にて再生されることで、ユーザとの対話が成立する。 Each of the plurality of servers 200 starts the voice recognition process based on the universal wakeup word included in the voice signal received from the voice dialogue device 100. Then, each of the plurality of servers 200 transmits a response signal based on the result of the voice recognition process to the voice dialogue device 100. The voice dialogue device 100 receives response signals from a plurality of servers 200. When the response signal is reproduced by a voice output device (not shown), a dialogue with the user is established.
 以上をまとめると、実施の形態1における音声対話装置100は、ユーザによって発話された音声に対して音声認識処理を行うサーバに、音声信号を送信する。音声対話装置100は、音声信号取得部10およびウェイクアップワード分割部20を含む。音声信号取得部10は、音声に対応する入力音声信号を取得する。ウェイクアップワード分割部20は、音声認識処理を行う複数のサーバ200を示す全称的なウェイクアップワードが入力音声信号に含まれている場合に、入力音声信号に基づく音声信号を、複数のサーバ200に送信する。 Summarizing the above, the voice dialogue device 100 according to the first embodiment transmits a voice signal to a server that performs voice recognition processing on the voice uttered by the user. The voice dialogue device 100 includes a voice signal acquisition unit 10 and a wakeup word division unit 20. The voice signal acquisition unit 10 acquires an input voice signal corresponding to the voice. When the input voice signal includes a generic wake-up word indicating a plurality of servers 200 that perform voice recognition processing, the wake-up word dividing unit 20 transmits a voice signal based on the input voice signal to the plurality of servers 200. Send to.
 このような音声対話装置100は、ユーザによる複数のサーバ200への問い合わせを一度で完了させる。ユーザは、複数のサーバ200に対して一括に問い合わせを行うことができ、一のサーバが回答できない場合であっても、他のサーバに対しての2度目の発話をする必要がなくなる。音声対話装置100は、音声認識処理機能を有する音声認識処理システムに適用でき、その音声対話の効率性を向上させる。 Such a voice dialogue device 100 completes a user's inquiry to a plurality of servers 200 at once. The user can make inquiries to a plurality of servers 200 at once, and even if one server cannot answer, it is not necessary to make a second utterance to another server. The voice dialogue device 100 can be applied to a voice recognition processing system having a voice recognition processing function, and improves the efficiency of the voice dialogue.
 また、実施の形態1における音声対話方法は、ユーザによって発話された音声に対して音声認識処理を行うサーバに、音声信号を送信する。音声対話方法は、音声に対応する入力音声信号を取得し、音声認識処理を行う複数のサーバ200を示す全称的なウェイクアップワードが入力音声信号に含まれている場合に、入力音声信号に基づく音声信号を、複数のサーバ200に送信する。 Further, in the voice dialogue method in the first embodiment, a voice signal is transmitted to a server that performs voice recognition processing for the voice uttered by the user. The voice dialogue method is based on the input voice signal when the input voice signal includes a generic wake-up word indicating a plurality of servers 200 that acquire the input voice signal corresponding to the voice and perform voice recognition processing. The voice signal is transmitted to a plurality of servers 200.
 このような音声対話方法は、ユーザによる複数のサーバ200への問い合わせを一度で完了させる。音声対話方法は、音声認識処理機能を有する音声認識処理システムに適用でき、音声対話の効率性を向上させる。 Such a voice dialogue method completes a user's inquiry to a plurality of servers 200 at once. The voice dialogue method can be applied to a voice recognition processing system having a voice recognition processing function, and improves the efficiency of voice dialogue.
 <実施の形態2>
 実施の形態2における音声対話装置および音声対話方法を説明する。実施の形態2は実施の形態1の下位概念であり、実施の形態2における音声対話装置は、実施の形態1における音声対話装置100の各構成を含む。なお、実施の形態1と同様の構成および動作については説明を省略する。
<Embodiment 2>
The voice dialogue device and the voice dialogue method according to the second embodiment will be described. The second embodiment is a subordinate concept of the first embodiment, and the voice dialogue device in the second embodiment includes each configuration of the voice dialogue device 100 in the first embodiment. The same configuration and operation as in the first embodiment will not be described.
 図5は、実施の形態2における音声対話装置101の構成を示すブロック図である。 FIG. 5 is a block diagram showing the configuration of the voice dialogue device 101 according to the second embodiment.
 実施の形態2における複数のサーバ200の各々は、自己のサーバを示す個別のウェイクアップワードを認識することはできるが、全称的なウェイクアップワードは認識できない。例えば、ユーザが「OK,エニバデ(OK, anybody)。X社の製品はどこで買える?」と発話した場合、複数のサーバ200の各々は、「OK,エニバデ(OK, anybody)」の部分をウェイクアップワードとして認識できない。 Each of the plurality of servers 200 in the second embodiment can recognize an individual wakeup word indicating its own server, but cannot recognize a universal wakeup word. For example, when the user says "OK, anybody. Where can I buy the product of company X?", Each of the plurality of servers 200 wakes the part of "OK, anybody". Cannot be recognized as an upward.
 音声対話装置101には、全称的なウェイクアップワードに加え、音声対話装置101に接続されている複数のサーバ200の各々を示す個別のウェイクアップワードが、予め登録されている。 In the voice dialogue device 101, in addition to the universal wake-up word, individual wake-up words indicating each of the plurality of servers 200 connected to the voice dialogue device 101 are registered in advance.
 音声対話装置101は、実施の形態1の音声信号取得部10およびウェイクアップワード分割部20に加えて、通信処理部30および応答信号出力部40を含む。また、ウェイクアップワード分割部20は、以下に示す機能において、実施の形態1と異なる。 The voice dialogue device 101 includes a communication processing unit 30 and a response signal output unit 40 in addition to the voice signal acquisition unit 10 and the wakeup word division unit 20 of the first embodiment. Further, the wakeup word division unit 20 is different from the first embodiment in the functions shown below.
 ウェイクアップワード分割部20は、全称的なウェイクアップワードが入力音声信号に含まれている場合、入力音声信号から全称的なウェイクアップワードを削除して主音声信号を生成する。そして、ウェイクアップワード分割部20は、その主音声信号を複数のサーバ200に送信する。なお、実施の形態2におけるウェイクアップワード分割部20は、通信処理部30を介して、主音声信号を送信する。 When the universal wakeup word is included in the input audio signal, the wakeup word dividing unit 20 deletes the universal wakeup word from the input audio signal to generate the main audio signal. Then, the wakeup word dividing unit 20 transmits the main audio signal to the plurality of servers 200. The wakeup word dividing unit 20 in the second embodiment transmits a main audio signal via the communication processing unit 30.
 通信処理部30は、ネットワーク130に接続されており、ウェイクアップワード分割部20から出力される主音声信号を複数のサーバ200の各々に送信する。また、通信処理部30は、複数のサーバ200の各々から送信される応答信号を受信して応答信号出力部40に出力する。 The communication processing unit 30 is connected to the network 130 and transmits the main audio signal output from the wakeup word dividing unit 20 to each of the plurality of servers 200. Further, the communication processing unit 30 receives the response signals transmitted from each of the plurality of servers 200 and outputs them to the response signal output unit 40.
 応答信号出力部40は、応答信号を受信する。実施の形態2における応答信号出力部40は、複数のサーバ200から応答信号を受信した順にその応答信号を出力する。 The response signal output unit 40 receives the response signal. The response signal output unit 40 according to the second embodiment outputs the response signals in the order in which the response signals are received from the plurality of servers 200.
 サーバから受信する応答信号は、音声信号、テキスト信号等である。応答信号としての音声信号は、PCM(pulse code modulation)による信号またはmp3のファイルフォーマットで圧縮された信号等であり、応答信号出力部40は、その音声信号をスピーカ120に出力する。応答信号がテキスト信号である場合、応答信号出力部40は、そのテキスト信号に基づいてスピーカ120が音声出力可能な音声信号を、音声合成処理によって生成し、スピーカ120に出力する。 The response signal received from the server is a voice signal, a text signal, etc. The audio signal as the response signal is a signal by PCM (pulse code modulation), a signal compressed in an mp3 file format, or the like, and the response signal output unit 40 outputs the audio signal to the speaker 120. When the response signal is a text signal, the response signal output unit 40 generates a voice signal that the speaker 120 can output voice based on the text signal by voice synthesis processing and outputs the voice signal to the speaker 120.
 スピーカ120は、その応答信号に基づいて音声を出力する。 The speaker 120 outputs voice based on the response signal.
 図6は、実施の形態2における音声対話装置101のハードウェア構成を示す図である。 FIG. 6 is a diagram showing a hardware configuration of the voice dialogue device 101 according to the second embodiment.
 音声対話装置101はメイン処理部93およびプログラム記録媒体94を含む。メイン処理部93は、図2および図3に示された処理回路に対応する。プログラム記録媒体94は、図3に示されたメモリ92に対応する。 The voice dialogue device 101 includes a main processing unit 93 and a program recording medium 94. The main processing unit 93 corresponds to the processing circuits shown in FIGS. 2 and 3. The program recording medium 94 corresponds to the memory 92 shown in FIG.
 実施の形態2における音声信号取得部10、ウェイクアップワード分割部20、通信処理部30および応答信号出力部40の機能は、メイン処理部93によって実現される。また、プログラム記録媒体94には、音声信号取得部10、ウェイクアップワード分割部20、通信処理部30および応答信号出力部40の機能が記載された音声対話プログラムが記憶されている。音声対話プログラムがメイン処理部93で実行されることにより、上記の各機能が実現される。 The functions of the audio signal acquisition unit 10, the wakeup word division unit 20, the communication processing unit 30, and the response signal output unit 40 in the second embodiment are realized by the main processing unit 93. Further, the program recording medium 94 stores a voice dialogue program in which the functions of the voice signal acquisition unit 10, the wakeup word division unit 20, the communication processing unit 30, and the response signal output unit 40 are described. Each of the above functions is realized by executing the voice dialogue program in the main processing unit 93.
 図7は、実施の形態2における音声対話方法を示すフローチャートである。 FIG. 7 is a flowchart showing the voice dialogue method in the second embodiment.
 ステップS10にて、音声信号取得部10は、ユーザによって発話された音声に対応する入力音声信号を受信する。実施の形態1と同様に、ここでは、ユーザが「OK,エニバデ(OK, anybody)。X社の製品はどこで買える?」と発話し、音声信号取得部10は、その音声に対応する入力音声信号を取得する。 In step S10, the voice signal acquisition unit 10 receives the input voice signal corresponding to the voice uttered by the user. Similar to the first embodiment, here, the user utters "OK, anybody. Where can I buy the product of company X?", And the voice signal acquisition unit 10 receives the input voice corresponding to the voice. Get the signal.
 ステップS20にて、ウェイクアップワード分割部20は、入力音声信号にウェイクアップワードが含まれているか解析する。解析対象のウェイクアップワードは、予め音声対話装置101に登録されている。ここでは、音声対話装置101には、特定のサーバを示す個別のウェイクアップワード、および、全称的なウェイクアップワードが、解析対象のウェイクアップワードとして予め登録されている。 In step S20, the wakeup word dividing unit 20 analyzes whether the input voice signal includes the wakeup word. The wake-up word to be analyzed is registered in the voice dialogue device 101 in advance. Here, in the voice dialogue device 101, individual wakeup words indicating a specific server and universal wakeup words are registered in advance as wakeup words to be analyzed.
 ステップS30にて、ウェイクアップワード分割部20は、ウェイクアップワードが検出されたか否かを判定する。ウェイクアップワードが検出された場合、ステップS40が実行される。ウェイクアップワードが検出されなかった場合、音声対話方法は終了する。 In step S30, the wakeup word dividing unit 20 determines whether or not the wakeup word has been detected. If a wakeup word is detected, step S40 is executed. If no wakeup word is detected, the voice interaction method ends.
 ステップS40にて、ウェイクアップワード分割部20は、検出されたウェイクアップワードが全称的なウェイクアップワードであるか否かを判定する。全称的なウェイクアップワードでない場合、つまり、検出されたウェイクアップワードが特定のサーバを示す個別のウェイクアップワードである場合、ステップS50が実行される。全称的なウェイクアップワードである場合、ステップS60が実行される。 In step S40, the wakeup word dividing unit 20 determines whether or not the detected wakeup word is a universal wakeup word. If it is not a universal wakeup word, that is, if the detected wakeup word is an individual wakeup word indicating a particular server, step S50 is executed. If it is a universal wakeup word, step S60 is executed.
 ステップS50にて、ウェイクアップワード分割部20は、特定のサーバを送信先として選択する。 In step S50, the wakeup word division unit 20 selects a specific server as a transmission destination.
 ステップS60にて、ウェイクアップワード分割部20は、複数のサーバ200を送信先として選択する。 In step S60, the wakeup word division unit 20 selects a plurality of servers 200 as transmission destinations.
 ステップS70にて、ウェイクアップワード分割部20は、入力音声信号から全称的なウェイクアップワードを削除して主音声信号を生成する。ここでは、ウェイクアップワード分割部20は、入力音声信号から、全称的なウェイクアップワードである「OK,エニバデ(OK, anybody)」に対応する音声信号を削除し、「X社の製品はどこで買える?」に対応する主音声信号を生成する。 In step S70, the wakeup word dividing unit 20 deletes the universal wakeup word from the input voice signal to generate the main voice signal. Here, the wake-up word dividing unit 20 deletes the voice signal corresponding to the universal wake-up word "OK, anybody" from the input voice signal, and "Where is the product of company X?" Generate a main audio signal corresponding to "Can you buy it?"
 ステップS80にて、通信処理部30は、ステップS50もしくはS60で選択されたサーバに音声信号を送信する。すなわち、音声対話処理がステップS50を経た場合には、通信処理部30は入力音声信号を特定のサーバに送信する。音声対話処理がステップS60およびS70を経た場合には、通信処理部30は主音声信号を複数のサーバ200に送信する。 In step S80, the communication processing unit 30 transmits an audio signal to the server selected in step S50 or S60. That is, when the voice dialogue process goes through step S50, the communication processing unit 30 transmits the input voice signal to a specific server. When the voice dialogue processing goes through steps S60 and S70, the communication processing unit 30 transmits the main voice signal to the plurality of servers 200.
 ステップS90にて、応答信号再生処理が実行される。図8は、実施の形態2における応答信号再生処理を示すフローチャートである。 In step S90, the response signal reproduction process is executed. FIG. 8 is a flowchart showing the response signal reproduction processing according to the second embodiment.
 ステップS91にて、通信処理部30は、複数のサーバ200から応答信号を受信する。 In step S91, the communication processing unit 30 receives response signals from the plurality of servers 200.
 ステップS92にて、応答信号出力部40は、応答信号をスピーカ120に出力する。なお、いずれかのサーバから受信した応答信号がテキスト信号である場合、応答信号出力部40は、音声合成処理によってそのテキスト信号に基づく音声信号を生成し、スピーカ120に出力する。このような処理により、スピーカ120は、複数のサーバ200から応答信号を受信した順に、応答音声を再生することができる。 In step S92, the response signal output unit 40 outputs the response signal to the speaker 120. When the response signal received from any of the servers is a text signal, the response signal output unit 40 generates a voice signal based on the text signal by voice synthesis processing and outputs it to the speaker 120. By such processing, the speaker 120 can reproduce the response voice in the order in which the response signals are received from the plurality of servers 200.
 ステップS93にて、通信処理部30は、全ての対象サーバから応答信号を受信したか否かを判定する。対象サーバとは、ステップS80において、音声信号が送信されたサーバのことであり、特定のサーバもしくは複数のサーバ200のことである。全ての対象サーバから応答信号を受信していない場合、ステップS91が再び実行される。全ての対象サーバから応答信号を受信した場合、応答信号再生処理は終了する。そして、図7に示される音声対話方法が終了する。 In step S93, the communication processing unit 30 determines whether or not response signals have been received from all the target servers. The target server is a server to which a voice signal is transmitted in step S80, and is a specific server or a plurality of servers 200. If no response signals have been received from all target servers, step S91 is executed again. When the response signals are received from all the target servers, the response signal reproduction processing ends. Then, the voice dialogue method shown in FIG. 7 ends.
 以上をまとめると、実施の形態2におけるウェイクアップワード分割部20は、入力音声信号から全称的なウェイクアップワードを削除して主音声信号を生成する。そして、ウェイクアップワード分割部20は、その主音声信号を複数のサーバ200に送信する。 Summarizing the above, the wakeup word dividing unit 20 in the second embodiment deletes the universal wakeup word from the input voice signal to generate the main voice signal. Then, the wakeup word dividing unit 20 transmits the main audio signal to the plurality of servers 200.
 複数のサーバ200が全称的なウェイクアップワードを認識できない場合であっても、実施の形態2における音声対話装置101は、具体的な問い合わせ内容に対応する主音声信号だけをサーバに送信する。そのため、音声対話の正確性が向上する。 Even when the plurality of servers 200 cannot recognize the universal wakeup word, the voice dialogue device 101 according to the second embodiment transmits only the main voice signal corresponding to the specific inquiry content to the server. Therefore, the accuracy of the voice dialogue is improved.
 また、実施の形態2における音声対話装置101は、すでに運用されている複数のサーバ200に接続するだけで、ユーザによる複数のサーバ200への問い合わせを一度で完了させるという効果を奏する。 Further, the voice dialogue device 101 according to the second embodiment has an effect that the user can complete inquiries to the plurality of servers 200 at once simply by connecting to the plurality of servers 200 already in operation.
 また、実施の形態2における応答信号出力部40は、複数のサーバ200から応答信号を受信した順に、その応答信号をスピーカ120に出力する。よって、先に応答信号が返却された順に、応答音声を再生することが可能である。 Further, the response signal output unit 40 according to the second embodiment outputs the response signals to the speaker 120 in the order in which the response signals are received from the plurality of servers 200. Therefore, it is possible to reproduce the response voice in the order in which the response signals are returned first.
 <実施の形態3>
 実施の形態3における音声対話装置および音声対話方法を説明する。実施の形態3は実施の形態1の下位概念であり、実施の形態3における音声対話装置は、実施の形態1における音声対話装置100の各構成を含む。なお、実施の形態1または2と同様の構成および動作については説明を省略する。
<Embodiment 3>
The voice dialogue device and the voice dialogue method according to the third embodiment will be described. The third embodiment is a subordinate concept of the first embodiment, and the voice dialogue device in the third embodiment includes each configuration of the voice dialogue device 100 in the first embodiment. The description of the configuration and operation similar to those of the first or second embodiment will be omitted.
 図9は、実施の形態3における音声対話装置102の構成を示すブロック図である。 FIG. 9 is a block diagram showing the configuration of the voice dialogue device 102 according to the third embodiment.
 実施の形態3における複数のサーバ200の各々は、自己のサーバを示す個別のウェイクアップワードは認識するものの、全称的なウェイクアップワードは認識しない。第1サーバ210は、個別のウェイクアップワードとして、「AAA」を認識する。また、第2サーバ220は、個別のウェイクアップワードとして、「BBB」を認識する。また、第3サーバ230は、個別のウェイクアップワードとして、「OK,CCC」を認識する。「AAA」、「BBB」および「CCC」は、例えば、音声認識処理サービスの名称または略称等である。例えば、第2サーバ220は、ユーザが「ねぇ、BBB」と呼びかけた場合、ウェイクアップワードである「BBB」を認識して音声認識処理を開始する。または、第3サーバ230は、ユーザが「OK,CCC」と呼びかけた場合、ウェイクアップワードである「OK,CCC」を認識して音声認識処理を開始する。 Each of the plurality of servers 200 in the third embodiment recognizes an individual wakeup word indicating its own server, but does not recognize a universal wakeup word. The first server 210 recognizes "AAA" as an individual wakeup word. In addition, the second server 220 recognizes "BBB" as an individual wakeup word. Further, the third server 230 recognizes "OK, CCC" as an individual wakeup word. “AAA”, “BBB” and “CCC” are, for example, names or abbreviations for voice recognition processing services. For example, when the user calls "Hey, BBB", the second server 220 recognizes the wake-up word "BBB" and starts the voice recognition process. Alternatively, when the user calls "OK, CCC", the third server 230 recognizes the wake-up word "OK, CCC" and starts the voice recognition process.
 音声対話装置102には、全称的なウェイクアップワードに加え、それら複数のサーバ200の各々を示す個別のウェイクアップワードが、予め登録されている。 In the voice dialogue device 102, in addition to the universal wakeup word, an individual wakeup word indicating each of the plurality of servers 200 is registered in advance.
 音声対話装置102は、実施の形態1の音声信号取得部10およびウェイクアップワード分割部20に加えて、ウェイクアップワード付与部50、通信処理部30および応答信号出力部40を含む。また、ウェイクアップワード分割部20は、以下に示す機能において、実施の形態1と異なる。 The voice dialogue device 102 includes a wakeup word adding unit 50, a communication processing unit 30, and a response signal output unit 40 in addition to the voice signal acquisition unit 10 and the wakeup word dividing unit 20 of the first embodiment. Further, the wakeup word division unit 20 is different from the first embodiment in the functions shown below.
 ウェイクアップワード分割部20は、実施の形態2と同様に、入力音声信号から全称的なウェイクアップワードを削除した主音声信号を生成する。さらに、実施の形態3のウェイクアップワード分割部20は、後述するウェイクアップワード付与部50によって主音声信号に付与された個別のウェイクアップワードによって示される特定のサーバごとに、音声信号を送信する。なお、実施の形態3において、ウェイクアップワード分割部20は、通信処理部30を介して、音声信号を送信する。 The wakeup word dividing unit 20 generates a main voice signal in which the universal wakeup word is deleted from the input voice signal, as in the second embodiment. Further, the wakeup word dividing unit 20 of the third embodiment transmits a voice signal for each specific server indicated by the individual wakeup words given to the main voice signal by the wakeup word giving unit 50 described later. .. In the third embodiment, the wakeup word dividing unit 20 transmits an audio signal via the communication processing unit 30.
 ウェイクアップワード付与部50は、全称的なウェイクアップワードが入力音声信号に含まれている場合に、複数のサーバ200の各々を示す個別のウェイクアップワードに対応する個別音声信号を主音声信号に付与する。実施の形態3におけるウェイクアップワード付与部50は、上記の主音声信号の前に個別音声信号を連結して音声信号を生成する。個別音声信号は、例えば、固定値としてメモリ92に記憶されている。 When a generic wakeup word is included in the input audio signal, the wakeup word imparting unit 50 uses an individual audio signal corresponding to each of the individual wakeup words indicating each of the plurality of servers 200 as the main audio signal. Give. The wake-up word imparting unit 50 in the third embodiment connects individual voice signals before the main voice signal to generate a voice signal. The individual voice signal is stored in the memory 92 as a fixed value, for example.
 通信処理部30は、ネットワーク130に接続されており、ウェイクアップワード分割部20から出力される音声信号をサーバに送信する。また、通信処理部30は、サーバから送信される応答信号を受信して応答信号出力部40に出力する。 The communication processing unit 30 is connected to the network 130 and transmits the audio signal output from the wakeup word dividing unit 20 to the server. Further, the communication processing unit 30 receives the response signal transmitted from the server and outputs it to the response signal output unit 40.
 応答信号出力部40は、複数のサーバ200から応答信号を受信する。なお、実施の形態3において、応答信号出力部40は、通信処理部30を介して、応答信号を受信する。また、実施の形態3における応答信号は、応答の有効性を示す有効性信号を含む。応答信号出力部40は、有効性信号に基づいて、応答信号をスピーカ120に出力する。例えば、応答が有効であると判断される場合に、応答信号出力部40は、応答信号をスピーカ120に出力する。スピーカ120は、その応答信号に基づいて音声を出力する。 The response signal output unit 40 receives response signals from a plurality of servers 200. In the third embodiment, the response signal output unit 40 receives the response signal via the communication processing unit 30. Further, the response signal in the third embodiment includes an effectiveness signal indicating the effectiveness of the response. The response signal output unit 40 outputs a response signal to the speaker 120 based on the effectiveness signal. For example, when it is determined that the response is valid, the response signal output unit 40 outputs the response signal to the speaker 120. The speaker 120 outputs voice based on the response signal.
 図10は、実施の形態3における有効性信号を含む応答信号の一例を示す図である。図10は、JSON(JavaScript(登録商標) Object Notation)形式で記載された応答信号を示している。「effective」は、有効性信号を示し、「payload」は再生すべき応答のコンテンツを示す。「effective」の値が、「yes」である場合には、応答信号出力部40は、応答信号をスピーカ120に出力し、スピーカ120から音声が再生される。「effective」の値が、「no」である場合には、応答信号出力部40は、応答信号をスピーカ120に出力しない。つまり、スピーカ120から音声は再生されない。「payload」は、PCM(pulse code modulation)またはmp3等のバイナリの音声信号が、BASE64形式等によりテキスト形式に変換されたデータであってもよい。または、「payload」は、「X社の製品はオンラインストアで購入可能です」等の文字列であっても良い。この場合には、上記のように、応答信号出力部40は、音声合成処理によって、そのテキストに対応する音声信号を生成する。 FIG. 10 is a diagram showing an example of a response signal including an effectiveness signal in the third embodiment. FIG. 10 shows a response signal described in JSON (JavaScript (registered trademark) Object Notation) format. “Effective” indicates the validity signal, and “payload” indicates the content of the response to be reproduced. When the value of "effective" is "yes", the response signal output unit 40 outputs the response signal to the speaker 120, and the voice is reproduced from the speaker 120. When the value of "effective" is "no", the response signal output unit 40 does not output the response signal to the speaker 120. That is, no sound is reproduced from the speaker 120. The "payload" may be data in which a binary audio signal such as PCM (pulse code modulation) or mp3 is converted into a text format by a BASE64 format or the like. Alternatively, "payload" may be a character string such as "Products of company X can be purchased at an online store". In this case, as described above, the response signal output unit 40 generates a voice signal corresponding to the text by voice synthesis processing.
 上記のウェイクアップワード分割部20、ウェイクアップワード付与部50、通信処理部30および応答信号出力部40の機能は、図2または図3に示される処理回路によって実現される。 The functions of the wakeup word dividing unit 20, the wakeup word imparting unit 50, the communication processing unit 30, and the response signal output unit 40 are realized by the processing circuit shown in FIG. 2 or FIG.
 図11は、実施の形態3における音声対話方法を示すフローチャートである。 FIG. 11 is a flowchart showing the voice dialogue method in the third embodiment.
 ステップS10からS70までは、実施の形態2と同様である。ステップS70に続いてステップS100が実行される。 Steps S10 to S70 are the same as in the second embodiment. Step S100 is executed following step S70.
 ステップS100にて、ウェイクアップワード付与部50は、個別のウェイクアップワードに対応する個別音声信号を主音声信号に付与する。例えば、ウェイクアップワード付与部50は、第2サーバ220を示す「ねぇ、BBB」に対応する個別音声信号を、「X社の製品はどこで買える?」の主音声信号の前に連結し「ねぇ、BBB、X社の製品はどこで買える?」に対応する音声信号を生成する。または例えば、ウェイクアップワード付与部50は、第3サーバ230を示す「OK,CCC」に対応する個別音声信号を、「X社の製品はどこで買える?」の主音声信号の前に連結し、「OK,CCC、X社の製品はどこで買える?」に対応する音声信号を生成する。 In step S100, the wakeup word adding unit 50 adds an individual voice signal corresponding to each wakeup word to the main voice signal. For example, the wakeup word granting unit 50 connects the individual voice signal corresponding to "Hey, BBB" indicating the second server 220 before the main voice signal of "Where can I buy the product of company X?" And "Hey. , BBB, where can I buy the products of company X? ”Is generated. Or, for example, the wake-up word granting unit 50 concatenates the individual audio signal corresponding to "OK, CCC" indicating the third server 230 before the main audio signal of "Where can I buy the product of company X?" Generates an audio signal corresponding to "Where can I buy products from OK, CCC, and X?".
 ステップS110にて、通信処理部30は、ステップS50もしくはS60で選択されたサーバに音声信号を送信する。 In step S110, the communication processing unit 30 transmits an audio signal to the server selected in step S50 or S60.
 ステップS120にて、応答信号再生処理が実行される。図12は、実施の形態3における応答信号再生処理を示すフローチャートである。 In step S120, the response signal reproduction process is executed. FIG. 12 is a flowchart showing the response signal reproduction processing according to the third embodiment.
 ステップS121にて、通信処理部30は、複数のサーバ200から応答信号を受信する。 In step S121, the communication processing unit 30 receives response signals from the plurality of servers 200.
 ステップS122にて、応答信号出力部40は、有効性信号に基づいて、応答信号が有効であるか否かを判定する。有効である場合、ステップS123が実行される。有効でない場合、ステップS124が実行される。 In step S122, the response signal output unit 40 determines whether or not the response signal is valid based on the validity signal. If it is valid, step S123 is executed. If it is not valid, step S124 is executed.
 ステップS123にて、応答信号出力部40は、応答信号をスピーカ120に出力する。 In step S123, the response signal output unit 40 outputs the response signal to the speaker 120.
 ステップS124にて、通信処理部30は、全ての対象サーバから応答信号を受信したか否かを判定する。全ての対象サーバから応答信号を受信していない場合、ステップS121が再び実行される。全ての対象サーバから応答信号を受信した場合、応答信号再生処理は終了する。そして、図11に示される音声対話方法が終了する。 In step S124, the communication processing unit 30 determines whether or not response signals have been received from all the target servers. If no response signal has been received from all target servers, step S121 is executed again. When the response signals are received from all the target servers, the response signal reproduction processing ends. Then, the voice dialogue method shown in FIG. 11 ends.
 以上をまとめると、実施の形態3における音声対話装置102は、ウェイクアップワード付与部50を含む。ウェイクアップワード付与部50は、全称的なウェイクアップワードが入力音声信号に含まれている場合に、複数のサーバ200の各々を示す個別のウェイクアップワードに対応する個別音声信号を音声信号(実施の形態3においては主音声信号)に付与する。ウェイクアップワード分割部20は、音声信号に付与された個別音声信号に基づいて、個別のウェイクアップワードによって示される特定のサーバごとに、音声信号を送信する。 Summarizing the above, the voice dialogue device 102 in the third embodiment includes the wake-up word giving unit 50. When a generic wakeup word is included in the input audio signal, the wakeup word imparting unit 50 outputs an individual audio signal corresponding to each of the individual wakeup words indicating each of the plurality of servers 200 as an audio signal (implementation). In the third form of the above, it is applied to the main audio signal). The wakeup word dividing unit 20 transmits a voice signal to each specific server indicated by the individual wakeup word based on the individual voice signal given to the voice signal.
 複数のサーバ200の各々が、全称的なウェイクアップワードを認識できず、自己を示す個別のウェイクアップワードを要求する場合に、音声対話装置102は、サーバごとの個別のウェイクアップワードを付与した音声信号を、各サーバに送信する。そのため、サーバごとの音声対話の正確性が向上する。 When each of the plurality of servers 200 cannot recognize the generic wakeup word and requests an individual wakeup word indicating itself, the voice dialogue device 102 assigns an individual wakeup word for each server. A voice signal is transmitted to each server. Therefore, the accuracy of the voice dialogue for each server is improved.
 また、実施の形態3における音声対話装置102は、応答信号出力部40を含む。応答信号出力部40は、音声信号に対する複数の応答信号を複数のサーバ200から受信し、複数の応答信号の各々に含まれる応答の有効性を示す有効性信号に基づいて、複数の応答信号を音声出力装置に出力する。 Further, the voice dialogue device 102 in the third embodiment includes a response signal output unit 40. The response signal output unit 40 receives a plurality of response signals for the voice signal from the plurality of servers 200, and receives a plurality of response signals based on the validity signal indicating the effectiveness of the response included in each of the plurality of response signals. Output to the audio output device.
 このような音声対話装置102は、サーバから受信した応答のうち、有効な回答のみを音声出力装置に再生させることができる。例えば、第1サーバ210および第2サーバ220の応答の内容が「わかりません」であり、かつ、有効性信号の値が「無効」であって、第3サーバ230の応答の内容が「X社の製品はX社のオンラインストアから購入可能です」であり、かつ、有効性信号の値は「有効」である場合、音声対話装置102は、第3サーバ230の応答のみを、音声出力装置に再生させる。 Such a voice dialogue device 102 can cause the voice output device to reproduce only valid answers among the responses received from the server. For example, the content of the response of the first server 210 and the second server 220 is "I do not know", the value of the validity signal is "invalid", and the content of the response of the third server 230 is "X". If the company's product can be purchased from the company X's online store "and the value of the validity signal is" valid ", the voice dialogue device 102 will only respond to the third server 230 as a voice output device. To play.
 全称的なウェイクアップワードにより問い合わせが行われる場合、ユーザとしては必ずしも全部のサーバからの応答を求めているわけではない。音声対話装置102は、良い回答つまり情報豊かな回答を優先して音声出力装置に再生させることができる。 When an inquiry is made by a universal wakeup word, the user does not necessarily request a response from all servers. The voice dialogue device 102 can give priority to a good answer, that is, an information-rich answer, and cause the voice output device to reproduce the answer.
 (実施の形態3の変形例)
 実施の形態3の変形例における音声対話装置102および音声対話方法を説明する。なお、実施の形態3と同様の構成および動作については説明を省略する。
(Modified Example of Embodiment 3)
The voice dialogue device 102 and the voice dialogue method in the modified example of the third embodiment will be described. The same configuration and operation as in the third embodiment will not be described.
 実施の形態3の変形例における全称的なウェイクアップワードは、特定のサーバ以外の複数のサーバ200を示すものである。例えば、全称的なウェイクアップワードは、「OK,AAA以外」であり、第1サーバ210以外の第2サーバ220および第3サーバ230を示している。 The universal wakeup word in the modified example of the third embodiment indicates a plurality of servers 200 other than a specific server. For example, the universal wakeup word is "OK, other than AAA" and indicates a second server 220 and a third server 230 other than the first server 210.
 ウェイクアップワード分割部20は、図11のステップS60にて、特定のサーバ以外の複数のサーバ200として、第2サーバ220および第3サーバ230を送信先として選択する。これ以降のステップは、図11の各ステップと同様であり、ウェイクアップワード分割部20は、第2サーバ220および第3サーバ230に、音声信号を送信する。 In step S60 of FIG. 11, the wakeup word dividing unit 20 selects the second server 220 and the third server 230 as the transmission destinations as the plurality of servers 200 other than the specific server. Subsequent steps are the same as in each step of FIG. 11, and the wakeup word dividing unit 20 transmits an audio signal to the second server 220 and the third server 230.
 <実施の形態4>
 以上の各実施の形態に示された音声対話装置は、ナビゲーション装置と、通信端末と、サーバと、これらにインストールされるアプリケーションの機能とを適宜に組み合わせて構築されるシステムにも適用することができる。ここで、ナビゲーション装置とは、例えば、PND(Portable Navigation Device)などを含む。通信端末とは、例えば、携帯電話、スマートフォンおよびタブレットなどの携帯端末を含む。
<Embodiment 4>
The voice dialogue device shown in each of the above embodiments can be applied to a system constructed by appropriately combining a navigation device, a communication terminal, a server, and the functions of applications installed in the navigation device. it can. Here, the navigation device includes, for example, a PND (Portable Navigation Device) and the like. Communication terminals include, for example, mobile terminals such as mobile phones, smartphones and tablets.
 図13は、実施の形態4における音声対話装置100およびそれに関連して動作する装置の構成を示すブロック図である。 FIG. 13 is a block diagram showing the configuration of the voice dialogue device 100 and the device that operates in connection therewith according to the fourth embodiment.
 音声対話装置100および通信装置150がウェイクアップワード認識サーバ300に設けられている。音声対話装置100は、車両1に設けられたマイク110から通信装置140および通信装置150を介して入力音声信号を取得する。音声対話装置100は、全称的なウェイクアップワードがその入力音声信号に含まれている場合に、入力音声信号に基づく音声信号を、複数のサーバ200に送信する。音声対話装置100は、複数のサーバ200から応答信号を受信し、車両1に設けられたスピーカ120に、各通信装置を介して出力する。 The voice dialogue device 100 and the communication device 150 are provided in the wakeup word recognition server 300. The voice dialogue device 100 acquires an input voice signal from the microphone 110 provided in the vehicle 1 via the communication device 140 and the communication device 150. The voice dialogue device 100 transmits a voice signal based on the input voice signal to a plurality of servers 200 when a generic wakeup word is included in the input voice signal. The voice dialogue device 100 receives response signals from the plurality of servers 200 and outputs the response signals to the speaker 120 provided in the vehicle 1 via each communication device.
 このように、音声対話装置100がウェイクアップワード認識サーバ300に配置されることにより、車載装置の構成を簡素化することができる。 By arranging the voice dialogue device 100 in the wakeup word recognition server 300 in this way, the configuration of the in-vehicle device can be simplified.
 また、音声対話装置100の機能あるいは構成要素の一部がウェイクアップワード認識サーバ300に設けられ、他の一部が車両1に設けられるなど、分散して配置されてもよい。 Further, some of the functions or components of the voice dialogue device 100 may be provided in the wakeup word recognition server 300, and the other part may be provided in the vehicle 1 in a distributed manner.
 なお、本発明は、その発明の範囲内において、各実施の形態を自由に組み合わせたり、各実施の形態を適宜、変形、省略したりすることが可能である。 It should be noted that, within the scope of the invention, the present invention can be freely combined with each embodiment, and each embodiment can be appropriately modified or omitted.
 本発明は詳細に説明されたが、上記した説明は、全ての局面において、例示であって、本発明がそれに限定されるものではない。例示されていない無数の変形例が、この発明の範囲から外れることなく想定され得るものと解される。 Although the present invention has been described in detail, the above description is an example in all aspects, and the present invention is not limited thereto. It is understood that a myriad of variations not illustrated can be envisioned without departing from the scope of the invention.
 10 音声信号取得部、20 ウェイクアップワード分割部、30 通信処理部、40 応答信号出力部、50 ウェイクアップワード付与部、94 プログラム記録媒体、100 音声対話装置、110 マイク、120 スピーカ、200 複数のサーバ。 10 voice signal acquisition unit, 20 wakeup word division unit, 30 communication processing unit, 40 response signal output unit, 50 wakeup word addition unit, 94 program recording medium, 100 voice dialogue device, 110 microphone, 120 speaker, 200 multiple units. server.

Claims (7)

  1.  ユーザによって発話された音声に対して音声認識処理を行うサーバに、音声信号を送信する音声対話装置であって、
     前記音声に対応する入力音声信号を取得する音声信号取得部と、
     前記音声認識処理を行う複数のサーバを示す全称的なウェイクアップワードが前記入力音声信号に含まれている場合に、前記入力音声信号に基づく前記音声信号を、前記複数のサーバに送信するウェイクアップワード分割部と、を備える音声対話装置。
    A voice dialogue device that transmits a voice signal to a server that performs voice recognition processing for voice spoken by a user.
    An audio signal acquisition unit that acquires an input audio signal corresponding to the audio,
    When the input voice signal includes a generic wakeup word indicating a plurality of servers that perform the voice recognition process, the wakeup that transmits the voice signal based on the input voice signal to the plurality of servers. A voice dialogue device including a word division unit.
  2.  前記ウェイクアップワード分割部は、
     前記入力音声信号から前記全称的なウェイクアップワードを削除して主音声信号を生成し、前記主音声信号を前記音声信号として、前記複数のサーバに送信する、請求項1に記載の音声対話装置。
    The wakeup word division section is
    The voice dialogue device according to claim 1, wherein the main voice signal is generated by deleting the generic wakeup word from the input voice signal, and the main voice signal is transmitted as the voice signal to the plurality of servers. ..
  3.  前記全称的なウェイクアップワードが前記入力音声信号に含まれている場合に、前記複数のサーバの各々を示す個別のウェイクアップワードに対応する個別音声信号を前記音声信号に付与するウェイクアップワード付与部をさらに備え、
     前記ウェイクアップワード分割部は、
     前記音声信号に付与された前記個別音声信号に基づいて、前記個別のウェイクアップワードによって示される特定のサーバごとに、前記音声信号を送信する、請求項2に記載の音声対話装置。
    When the generic wake-up word is included in the input voice signal, a wake-up word grant that gives the voice signal an individual voice signal corresponding to an individual wake-up word indicating each of the plurality of servers. With more parts
    The wakeup word division section is
    The voice dialogue device according to claim 2, wherein the voice signal is transmitted to each specific server indicated by the individual wakeup word based on the individual voice signal given to the voice signal.
  4.  前記全称的なウェイクアップワードは、特定のサーバ以外の前記複数のサーバを示すものであり、
     前記ウェイクアップワード分割部は、
     前記入力音声信号に基づく前記音声信号を、前記特定のサーバ以外の前記複数のサーバに送信する、請求項1に記載の音声対話装置。
    The universal wakeup word refers to the plurality of servers other than a specific server.
    The wakeup word division section is
    The voice dialogue device according to claim 1, wherein the voice signal based on the input voice signal is transmitted to the plurality of servers other than the specific server.
  5.  前記音声信号に対する複数の応答信号を前記複数のサーバから受信し、前記複数の応答信号の各々に含まれる応答の有効性を示す有効性信号に基づいて、前記複数の応答信号を音声出力装置に出力する応答信号出力部をさらに備える、請求項1に記載の音声対話装置。 A plurality of response signals to the voice signal are received from the plurality of servers, and the plurality of response signals are sent to the voice output device based on an effectiveness signal indicating the effectiveness of the response included in each of the plurality of response signals. The voice dialogue device according to claim 1, further comprising a response signal output unit for output.
  6.  ユーザによって発話された音声に対して音声認識処理を行うサーバに、音声信号を送信する音声対話方法であって、
     前記音声に対応する入力音声信号を取得し、
     前記音声認識処理を行う複数のサーバを示す全称的なウェイクアップワードが前記入力音声信号に含まれている場合に、前記入力音声信号に基づく前記音声信号を、前記複数のサーバに送信する、音声対話方法。
    It is a voice dialogue method that transmits a voice signal to a server that performs voice recognition processing on the voice spoken by the user.
    The input voice signal corresponding to the voice is acquired, and
    When the input voice signal includes a generic wake-up word indicating a plurality of servers performing the voice recognition process, the voice signal based on the input voice signal is transmitted to the plurality of servers. How to interact.
  7.  ユーザによって発話された音声に対して音声認識処理を行うサーバに、音声信号を送信する音声対話装置として機能させるための音声対話プログラムが記録され、かつ、コンピュータによって読取可能なプログラム記録媒体であって、
     前記音声対話プログラムは、前記コンピュータを、
     前記音声に対応する入力音声信号を取得する音声信号取得部と、
     前記音声認識処理を行う複数のサーバを示す全称的なウェイクアップワードが前記入力音声信号に含まれている場合に、前記入力音声信号に基づく前記音声信号を、前記複数のサーバに送信するウェイクアップワード分割部と、として機能させるためのプログラムである、プログラム記録媒体。
    A program recording medium in which a voice dialogue program for functioning as a voice dialogue device for transmitting a voice signal is recorded on a server that performs voice recognition processing for the voice spoken by the user and can be read by a computer. ,
    The voice dialogue program uses the computer.
    An audio signal acquisition unit that acquires an input audio signal corresponding to the audio,
    When the input voice signal includes a generic wake-up word indicating a plurality of servers performing the voice recognition process, the wake-up that transmits the voice signal based on the input voice signal to the plurality of servers. A program recording medium that is a program for functioning as a word division unit.
PCT/JP2019/031423 2019-08-08 2019-08-08 Voice interaction device, voice interaction method, and program recording medium WO2021024466A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/JP2019/031423 WO2021024466A1 (en) 2019-08-08 2019-08-08 Voice interaction device, voice interaction method, and program recording medium
JP2021537527A JP7224470B2 (en) 2019-08-08 2019-08-08 VOICE DIALOGUE DEVICE, VOICE DIALOGUE METHOD AND PROGRAM RECORDING MEDIUM

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2019/031423 WO2021024466A1 (en) 2019-08-08 2019-08-08 Voice interaction device, voice interaction method, and program recording medium

Publications (1)

Publication Number Publication Date
WO2021024466A1 true WO2021024466A1 (en) 2021-02-11

Family

ID=74503399

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2019/031423 WO2021024466A1 (en) 2019-08-08 2019-08-08 Voice interaction device, voice interaction method, and program recording medium

Country Status (2)

Country Link
JP (1) JP7224470B2 (en)
WO (1) WO2021024466A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115665894A (en) * 2022-10-20 2023-01-31 四川启睿克科技有限公司 Whole-house distributed voice gateway system and voice control method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180040324A1 (en) * 2016-08-05 2018-02-08 Sonos, Inc. Multiple Voice Services
US20180204569A1 (en) * 2017-01-17 2018-07-19 Ford Global Technologies, Llc Voice Assistant Tracking And Activation
JP2019086535A (en) * 2017-11-01 2019-06-06 ソフトバンク株式会社 Transmission control device and program

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180040324A1 (en) * 2016-08-05 2018-02-08 Sonos, Inc. Multiple Voice Services
US20180204569A1 (en) * 2017-01-17 2018-07-19 Ford Global Technologies, Llc Voice Assistant Tracking And Activation
JP2019086535A (en) * 2017-11-01 2019-06-06 ソフトバンク株式会社 Transmission control device and program

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115665894A (en) * 2022-10-20 2023-01-31 四川启睿克科技有限公司 Whole-house distributed voice gateway system and voice control method

Also Published As

Publication number Publication date
JPWO2021024466A1 (en) 2021-12-09
JP7224470B2 (en) 2023-02-17

Similar Documents

Publication Publication Date Title
JP7083270B2 (en) Management layer for multiple intelligent personal assistant services
KR102478951B1 (en) Method and apparatus for removimg an echo signal
JP6321351B2 (en) A system that converts stream-independent sound into haptic effects
US11188289B2 (en) Identification of preferred communication devices according to a preference rule dependent on a trigger phrase spoken within a selected time from other command data
KR20190024762A (en) Music Recommendation Method, Apparatus, Device and Storage Media
US10827065B2 (en) Systems and methods for providing integrated computerized personal assistant services in telephony communications
JP7353497B2 (en) Server-side processing method and server for actively proposing the start of a dialogue, and voice interaction system capable of actively proposing the start of a dialogue
JP2013541042A (en) Method and apparatus for providing input to voice-enabled application program
JP6236805B2 (en) Utterance command recognition system
WO2021033088A1 (en) Distinguishing voice commands
JP6934076B2 (en) Smart service methods, devices and equipment
EP3769303B1 (en) Modifying spoken commands
CN112687286A (en) Method and device for adjusting noise reduction model of audio equipment
CN111462726B (en) Method, device, equipment and medium for answering out call
WO2021024466A1 (en) Voice interaction device, voice interaction method, and program recording medium
CN110659361B (en) Conversation method, device, equipment and medium
KR102485339B1 (en) Apparatus and method for processing voice command of vehicle
KR102204488B1 (en) Communication device
US11803400B2 (en) Method and system for asynchronous notifications for users in contextual interactive systems
US11699438B2 (en) Open smart speaker
KR102135182B1 (en) Personalized service system optimized on AI speakers using voiceprint recognition
JP2021182051A (en) Agent cooperation device
JP6468069B2 (en) Electronic device control system, server, and terminal device
US10909049B1 (en) Converting a pin into a loopback pin
JP2020034597A (en) Voice input device, method, and program

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19940379

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2021537527

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19940379

Country of ref document: EP

Kind code of ref document: A1