WO2015118578A1 - Dispositif d'entrée multimodale, et procédé de commande de temporisation dans un dispositif terminal et dans dispositif d'entrée multimodale - Google Patents

Dispositif d'entrée multimodale, et procédé de commande de temporisation dans un dispositif terminal et dans dispositif d'entrée multimodale Download PDF

Info

Publication number
WO2015118578A1
WO2015118578A1 PCT/JP2014/000686 JP2014000686W WO2015118578A1 WO 2015118578 A1 WO2015118578 A1 WO 2015118578A1 JP 2014000686 W JP2014000686 W JP 2014000686W WO 2015118578 A1 WO2015118578 A1 WO 2015118578A1
Authority
WO
WIPO (PCT)
Prior art keywords
input
information
unit
monitoring
semantic information
Prior art date
Application number
PCT/JP2014/000686
Other languages
English (en)
Japanese (ja)
Inventor
勇 小川
Original Assignee
三菱電機株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 三菱電機株式会社 filed Critical 三菱電機株式会社
Priority to PCT/JP2014/000686 priority Critical patent/WO2015118578A1/fr
Publication of WO2015118578A1 publication Critical patent/WO2015118578A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • G06F3/167Audio in a user interface, e.g. using voice commands for navigating, audio feedback
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1822Parsing for meaning understanding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals

Definitions

  • This invention relates to a multimodal input device for acquiring information input by a plurality of input methods having different formats.
  • the time required from the start of the user's input operation to completion of input of necessary input information depends on the input method of the device, and the meaning of the input operation Since the time required until the information is acquired from the input information depends on the input information, the time required from the start of the input operation by the user until the semantic information of the input operation is acquired differs for each device. Therefore, the waiting time to wait before determining whether or not necessary semantic information is prepared in the information processing apparatus described in Patent Document 1 is determined in consideration of the device having the longest time until the semantic information is acquired. There was a need.
  • the present invention has been made to solve the above-described problems.
  • a necessary input operation is not performed, a multimodal input that shortens the time required to determine whether the input operation has not been performed.
  • the object is to obtain a device.
  • the semantic information indicating the meaning of the input operation obtained from the input information input in each input operation is an abstraction that does not depend on the specific input method.
  • abstracted semantic information is referred to as abstract information.
  • the multimodal input device of the present invention detects that semantic information indicating the meaning of each input operation of a plurality of input methods having different formats is detected, and an input method other than the input method in which acquisition of semantic information is detected
  • An input detection unit that detects that the input operation has been started, and monitoring of unimplemented input operations of input methods other than the input method for which acquisition of semantic information is detected based on the detection result of the input detection unit And a monitoring processing unit to be performed.
  • the terminal device is a terminal device that receives an input operation monitored by a server device that performs unmonitored monitoring of input operations of a plurality of input methods, and means to indicate the meaning of each input operation of the plurality of input methods Detects that information has been acquired, detects that the input operation has started for an input method other than the input method for which acquisition of semantic information has been detected, and acquires semantic information based on the results of these detections
  • the input operation using the corresponding input method is performed.
  • the input start information indicating that it has started has been output.
  • a timeout control method for a multimodal input device is a timeout control method for a multimodal input device in which input is performed by a plurality of input methods having different formats, and indicates the meaning of each input operation of the plurality of input methods.
  • the multimodal input device of the present invention it is detected that semantic information indicating the meaning of each input operation of a plurality of input methods having different formats is detected, and acquisition of semantic information is detected. Based on the detection that the input operation was started for an input method other than the input method, the input operation of the input method that did not detect the acquisition of semantic information was monitored. When the necessary input operation is not performed, the time required to determine whether the input operation is not performed can be shortened. According to the terminal device of the present invention, when the terminal device accepts an input operation by the corresponding input method, the input start information indicating that the input operation by the corresponding input method has been started is output to the server device.
  • the server device that monitors the input operation of the input method other than the input method for which the acquisition of the semantic information is detected based on the result of the detection is detected by the server device that has detected the acquisition of the semantic information. It is possible to detect that the input operation has been started for an input method other than the method. Thereby, when the user does not perform the necessary input operation, the time required until the server apparatus determines that the input operation is not performed can be shortened.
  • the timeout control method of the multimodal input device of the present invention it is detected that semantic information indicating the meaning of each input operation of a plurality of input methods having different formats is acquired, and acquisition of semantic information is detected. Based on the detection that the input operation was started for an input method other than the input method, the input operation of the input method that did not detect the acquisition of semantic information was monitored. When the necessary input operation is not performed, the time required to determine whether the input operation is not performed can be shortened.
  • FIG. 1 is a block diagram showing a functional configuration of a multimodal input apparatus according to Embodiment 1 of the present invention.
  • voice input and gesture input will be described as examples of input methods to the multimodal input device.
  • the present invention does not limit the input method to these two methods.
  • various other input methods such as gaze detection, facial expression detection, handwriting input, and keyboard input may be used.
  • three or more of these input methods may be used.
  • the multimodal input device of this embodiment includes a voice acquisition unit 101, a voice information abstraction unit 102, a gesture acquisition unit 103, a gesture information abstraction unit 104, an operation processing unit 105, a timeout control unit 106, and a guidance unit. 107.
  • the voice acquisition unit 101 converts a user's speech voice acquired by a sound collection device (not shown) such as a microphone into a data signal (referred to as voice data).
  • the voice acquisition unit 101 detects the start and completion of the user's utterance based on the acquired voice data.
  • the voice acquisition unit 101 outputs the acquired voice data from the start to the end of the utterance to the voice information abstraction unit 102 as input information for voice input.
  • the speech information abstraction unit 102 recognizes input speech data and acquires a speech recognition result, and then acquires and outputs abstraction information corresponding to the speech recognition result. is there.
  • the speech information abstraction unit 102 outputs utterance start information (speech input input start information) indicating the start of a speech input operation to the motion processing unit 105.
  • the gesture acquisition unit 103 converts a user gesture acquired by an imaging device (not shown) such as a camera into a data signal (gesture data).
  • the gesture acquisition unit 103 detects the start and completion of the user's gesture based on the gesture data.
  • the gesture acquisition unit 103 outputs the gesture data from the start to the completion of the gesture to the gesture information abstraction unit 104 as input information of the gesture input.
  • the gesture information abstraction unit 104 recognizes the gesture data received from the gesture acquisition unit 103 and acquires a gesture recognition result, and then acquires the abstraction information corresponding to the gesture recognition result and outputs the gesture information to the motion processing unit 105. Is an input information recognition unit corresponding to.
  • the operation processing unit 105 uses each of the abstract information received from the voice information abstraction unit 102 and the gesture information abstraction unit 104 to determine an operation corresponding to the abstract information, and performs the operation. Further, the timeout control unit 106 determines completion of the input operation by the user based on information from the operation processing unit 105 and a timeout when the input operation is not performed. Details of information exchanged between the operation processing unit 105 and the timeout control unit 106 will be described later.
  • the guidance unit 107 generates and outputs an acoustic signal for generating a guidance voice based on the signal output from the operation processing unit 105.
  • the acoustic signal is a digital or analog signal for generating sound from a sound source device such as a speaker.
  • the guidance voice is used here, it is possible to use another output method such as an image signal for displaying the guidance on the screen.
  • FIG. 2 is a block diagram showing a detailed configuration of the timeout control unit 105.
  • the timeout control unit 105 includes an input detection unit 111 that processes information received from the operation processing unit, and a monitoring processing unit 112 that performs monitoring processing based on the processing result of the input detection unit 111.
  • the voice acquisition unit 101, the voice information abstraction unit 102, the gesture acquisition unit 103, the gesture information abstraction unit 104, the operation processing unit 105, the timeout control unit 106, the guidance unit 107, and the timeout control unit 106 described above are provided.
  • the input detection unit 111 and the monitoring processing unit 112 include a general-purpose processor or a processor such as a DSP (Digital Signal Processor), a volatile memory such as a RAM (Random Access Memory), a nonvolatile memory such as a flash memory, and other peripheral circuits. It can be realized by a program executed on the hardware and the processor configured as described above. Also, it can be realized by hardware such as ASIC (Application Specific Specific Integrated Circuit).
  • ASIC Application Specific Specific Integrated Circuit
  • the operation of the multimodal input device of this embodiment will be described.
  • operations of the voice input acquisition unit 101 and the voice information abstraction unit 102 and operations of the gesture acquisition unit 103 and the gesture information abstraction unit 104 will be described.
  • the operations of the voice input acquisition unit 101 and the voice information abstraction unit 102 and the processing of the gesture acquisition unit 103 and the gesture information abstraction unit 104 are input operations to devices corresponding to the respective input methods, that is, user utterances. It is performed independently according to the input of voice or gesture.
  • the voice input acquisition unit 101 receives a sound signal acquired by a sound collection device such as a microphone, converts voice uttered by the user into voice data, and detects the start and completion of the utterance. Then, the voice data from the start of utterance to the completion of utterance is output to the voice information abstraction unit 102 as input information for voice input.
  • the audio data is PCM (Pulse Code Modulation) data obtained by digitizing a sound signal acquired by the sound collection device.
  • the start and completion of utterances are detected based on the acoustic feature quantity of the voice extracted from the voice data and judged based on this feature quantity, or the amplitude of the sound signal is extracted from the voice data and based on the magnitude Various methods can be considered, such as making judgments.
  • the voice information abstraction unit 102 determines the start of the user's utterance and indicates the start of the voice input input operation to the operation processing unit 105. Start information (input start information) is output.
  • the speech information abstraction unit 102 performs speech recognition processing on the input speech data, and acquires speech input abstraction information corresponding to the recognition result obtained as a result.
  • the audio information abstraction unit 102 outputs the acquired abstract information to the operation processing unit 105.
  • the voice information abstraction unit 102 can acquire the abstract information based on the voice recognition result (voice information) by holding a table as shown in FIG. 3 in advance.
  • FIG. 3 shows an example where the speech recognition result is text.
  • the voice information abstraction unit 102 is shown in FIG. Search the table and get the abstraction information “control: switch”.
  • the gesture acquisition unit 103 converts image information such as a video signal acquired from the imaging device into gesture data, and detects the start and completion of the gesture. Then, the gesture data from the start to the completion of the gesture is output to the gesture information abstraction unit 104 as input information of the gesture input.
  • the gesture data is digitized image signal data, and may be data subjected to compression processing such as JPEG (Joint Photographic Experts Group), Motion JPEG or MPEG (Moving Picture Experts Group). Note that the determination of the start and completion of the gesture can be made by detecting the movement of the object defined in the image based on the relationship with the background, for example.
  • the gesture information abstraction unit 104 performs gesture recognition of the input gesture data, acquires the abstract information corresponding to the gesture recognition result (gesture information) obtained as a result, and To output abstraction information.
  • the gesture recognition result is a specific gesture pattern determined in advance, and is assumed to be, for example, a “pointing action”, a “waving hand action”, or the like. It is determined whether or not these patterns match by analyzing the image of the gesture data.
  • the gesture information abstraction unit 104 can acquire the abstract information from the gesture recognition result by, for example, holding the table shown in FIG. 4 in advance. For example, when the user performs a gesture pointing at the power switch of the device to be operated and acquires the “pointing action” as the gesture recognition result of the gesture, the gesture information abstraction unit 104 displays the table shown in FIG. To obtain abstraction information “control: switch”.
  • This embodiment describes a case where the voice input operation requires a longer time than the gesture input operation, and the gesture information abstraction unit 104 inputs the input start information indicating the start of the user's gesture input. Is not output to the motion processing unit 105, but when the input operation by gesture requires a longer time, the start of the gesture input may be notified to the motion processing unit 105. Further, both the voice information abstraction unit 102 and the gesture information abstraction unit 104 may output input start information that notifies the start of the input operation.
  • FIG. 5 is a flowchart showing an operation flow of the operation processing unit 105.
  • the flowchart shown in FIG. 5 is an example of the operation flow of the operation processing unit 105 of this embodiment. As long as an equivalent processing result is obtained, the process may be performed in a procedure different from that in FIG. Good.
  • the operation processing unit 105 waits for input of information from the voice information abstraction unit 102, the gesture information abstraction unit 104, and the timeout control unit 106, and when receiving the information input, the type of the input information In response to this, the following processes ST102, ST105, and ST107 are performed (ST101).
  • the operation processing unit 105 receives the utterance start information from the voice information abstraction unit 102 or the respective abstract information from the voice information abstraction unit 102 or the gesture information abstraction unit 104 (ST102). Next to ST102, the operation processing unit 105 sends an utterance start information reception notification (speech input start information reception notification), a voice input or gesture input abstraction information reception notification, which is a reception notification of received information, to the timeout control unit 106. (ST103). Note that the timeout control unit 106 does not need the utterance start information itself and the abstraction information itself, but the operation processing unit 105 in this example uses the received utterance start information itself as the utterance start information reception notification and the received abstraction. The information itself is output to the timeout control unit 106 as an abstract information reception notification.
  • the operation processing unit 105 stores the received abstract information when the received information is abstract information (ST104). After ST104, the operation processing unit 105 waits for information input from the voice information abstraction unit 102, the gesture information abstraction unit 104, and the timeout control unit 106 again.
  • the operation processing unit 105 receives an input completion notification from the timeout control unit 106 (ST105). Details of the input completion notification will be described later.
  • the operation processing unit 105 performs an operation according to the contents of the abstract information received from the stored speech information abstraction unit 102 and gesture information abstraction unit 104. Is determined and executed (ST106). After ST106, the operation processing unit 105 waits for information input from the voice information abstraction unit 102, the gesture information abstraction unit 104, and the timeout control unit 106 again.
  • the processing according to the content of the abstract information performed by the operation processing unit 105 is appropriately defined in a system to which the multimodal input device is applied.
  • the operation processing unit 105 Determines whether the power switch of the operation target device can be operated.
  • the operation processing unit 105 sends an instruction to the guidance unit 107 to output the guidance voice “Please operate the power switch”.
  • the guidance unit 107 receives this instruction, the guidance unit 107 generates and outputs an acoustic signal of guidance voice “Please operate the power switch”. By outputting the guidance voice from the speaker, the user recognizes that the power switch can be operated.
  • the operation processing unit 105 receives the first or second timeout detection notification from the timeout control unit 106 (ST107). Details of the timeout detection notification will be described later. When receiving the timeout detection notification, operation processing section 105 determines whether the timeout detection notification is the first timeout detection notification (ST108).
  • the operation processing unit 105 determines whether or not the voice input abstraction information has been acquired (ST109). If the abstract information of the voice input has been acquired, the operation processing unit 105 sends an instruction to the guidance unit 107 to output a guidance requesting gesture input (ST110). Upon receiving this instruction, the guidance unit 107 generates and outputs an acoustic signal of guidance voice “Please input a gesture”.
  • the operation processing unit 105 sends an instruction to the guidance unit 107 to output a guidance requesting the voice input (ST111). Upon receiving this instruction, the guidance unit 107 generates and outputs an acoustic signal of guidance voice “Please input voice”.
  • the operation processing unit 105 determines that the timeout detection notification received in the processing of ST108 is not the first timeout detection notification, the operation processing unit 105 instructs the guidance unit 107 to output guidance for interrupting acceptance of input operations. Send (step ST112). Upon receiving this instruction, the guidance unit 107 generates and outputs an acoustic signal of the guidance voice “The reception of input is interrupted”.
  • the timeout control unit 106 sets the abstraction information of each input method received from the operation processing unit 105 and the time required from the start of the input operation of the user until the abstraction information of the input operation is acquired to another method. Processing is performed based on input start information of a longer input method (that is, utterance start information in this embodiment). Note that, as described above, the timeout control unit 106 does not need the input start information itself or the abstraction information itself, and thus, for example, the operation processing unit 105 generates and uses reception notifications for the input start information and the abstraction information. For example, it may be configured to operate by receiving input of other information that can detect that the input operation has started and that the abstraction information has been acquired.
  • FIG. 6 is a flowchart showing an operation flow of the timeout control unit 106 of this embodiment.
  • timeout control section 106 performs a reception notification process in ST200 and a monitoring process in ST300 shown in the flowchart of FIG.
  • the reception notification process is performed by the input detection unit 111
  • the monitoring process is performed by the monitoring processing unit 112. Details of the reception notification process and the monitoring process will be described below.
  • the input detection unit 111 determines whether or not the gesture input abstraction information, which is an input method that requires a short time until the abstraction information is acquired, is received from the operation control unit 105 (ST201). When receiving the gesture input abstraction information, the input detection unit 111 stores that the gesture input abstraction information has been received (ST202). Next, the input detection unit 111 confirms whether the abstract information of the voice input that is the abstract information of another input method has been received, and determines whether the abstract information of the two types of input methods is available. (ST203).
  • the input detection unit 111 When the abstract information of the voice input has been received, the input detection unit 111 performs control to stop the counting of the timer being counted (either timer A or timer B or both), and both the voice input and the gesture input Therefore, the input completion notification is output to the operation processing unit 105 (ST204). After performing the process of ST204, the input detection unit 111 ends the reception notification process.
  • the timer A is a timer for monitoring whether the user's voice input or gesture input is not performed.
  • the monitoring processing unit 112 A first timeout detection notification is output to the operation processing unit 105.
  • the timer B cancels the input operation that has been performed so far when the specified second waiting time has elapsed without performing an input operation of another input method. This is a timer for the monitoring processing unit 112 to output a second timeout detection notification to the operation processing unit 105.
  • the monitoring processing unit 112 performs actual processing related to the timer as described later.
  • the input detection unit 111 outputs control information instructing the timer count to stop, and the monitoring processing unit 112 receives this control information and receives the timer count. Process to stop.
  • the control information corresponding to each control output from the input detection unit 111 is similarly monitored for the control for stopping the timer count performed by the input detection unit 111 and the control for starting the timer count described below. 112 receives and processes.
  • the input detection unit 111 performs control to start the timer A and the timer B (ST205). Next, the input detection unit 111 determines whether utterance start information indicating that an input operation of voice input that requires a long time until the abstraction information is acquired has been received from the operation processing unit 105. (ST206). If the utterance start information has been received, the input detection unit 111 performs control to stop the count of the timer A because the voice input input operation has been started (ST207). After performing the process of ST207, the input detection unit 111 ends the reception notification process. Further, when the utterance start information is not received from the operation processing unit 105 in the process of ST206, the input detection unit 111 ends the reception notification process.
  • the input detection unit 111 determines whether the voice input abstraction information has been received (ST208). If voice input abstraction information has been received, input detection section 111 stores reception of voice input abstraction information (ST209). Next, the input detection unit 111 determines whether or not the gesture input abstraction information has been received (ST210). If the abstract information of the gesture input has not been received, the timer A and the timer B are controlled to start counting (ST211). After the process of ST211, the input detection unit 111 ends the reception notification process. Also, when the gesture input abstraction information has been received in the process of ST210, the process proceeds to the above-described process of ST204.
  • the input detection unit 111 determines whether the utterance start information is received from the operation control unit 105 (ST212). If the utterance start information is received, the input detection unit 111 stores the reception of the utterance start information (ST213). Next, the input detection unit 111 determines whether or not the gesture input abstraction information has been received (ST214). If the gesture input abstraction information has been received, it is not necessary to continue the count of timer A started (ST205) when the gesture input abstraction information is received. Control to stop counting is performed (ST215). Then, the input detection unit 111 ends the reception notification process.
  • the input detection unit 111 ends the reception notification process. To do.
  • the monitoring processor 112 determines whether or not there is control information instructing to stop or start the timer count output from the input detector 111 (ST301).
  • the monitoring processing unit 112 receives the control information and performs a process of stopping or starting the timer A and timer B counts (ST302).
  • the timer A and the timer B are timers that continue counting to add 1 each time a predetermined time elapses until a predetermined count expiration value is reached.
  • the count expiration value is set so that the timer B has a larger value than the timer A.
  • the monitoring processing unit 112 determines whether there is a timer that is counting (ST303). When there is a timer that is being counted, the monitoring processing unit 112 updates the timer that is being counted (ST304). That is, 1 is added again when the predetermined time has passed since the previous addition of 1.
  • monitoring processor 112 determines whether timer A has reached the count expiration value (ST305).
  • timeout control section 106 outputs a first timeout detection notification to operation processing section 105 (ST306).
  • the processing of the operation processing unit 105 that has received the first timeout detection notification is as described above.
  • the monitoring processor 112 determines whether or not the timer B has reached the count expiration value (ST307). When timer B expires, monitoring processor 112 outputs a second timeout detection notification to operation processor 105 (ST308). The processing of the operation processing unit 105 that has received the second timeout detection notification is as described above. When there is no timer being counted in ST303 processing, when timer B has not expired in ST307 processing, and after the processing in ST308, the monitoring processing unit 112 ends the monitoring processing.
  • FIGS. 6, 7, and 8 are examples of the operation flow of the timeout control unit 106 of this embodiment, and as long as an equivalent processing result is obtained, processing is performed in a procedure different from that described above. You may make it do.
  • the monitoring processing unit 112 When the abstract information of the gesture input is received, as a result of the processing of ST205 of the input detection unit 111 of the timeout control unit 106, the monitoring processing unit 112 starts counting timer A and timer B by the processing of ST302. Thereafter, when the voice input operation is not performed and the timeout control unit 106 does not receive the utterance start information, the monitoring processing unit 112 of the timeout control unit 106 continues to update the timer A and the timer B. As a result, when the timer A expires, the monitoring processing unit 112 of the timeout control unit 106 outputs a first timeout detection notification to the operation processing unit 105 by the process of ST306.
  • timer A count is stopped when the speech start information is received (ST215), the time until the timer A count expires is the start of the user's voice input input operation. Therefore, it is not necessary to set the count expiration value of timer A so that the time required for the multimodal input device to acquire the abstracted information is long, and therefore the first timeout detection can be performed in a shorter time. Go and give guidance voice to the user.
  • the time required for the time-out control unit 106 until the multimodal input device acquires the abstraction information of the input operation from the start of the user's input operation Since the utterance start information indicating the start of the input operation of the voice input that is a long input method is used, the timer A that monitors non-execution of the input operation is stopped when the utterance start information is received. It is possible to shorten the time until the timer count expires.
  • the multimodal input device warns the user by determining that there is no input in a shorter time. It is possible to improve the efficiency of the user's input work.
  • the utterance start information that is the input start information of the input operation output from the voice information abstraction unit 102 is used to stop the timer A that monitors the non-execution of the input operation when the utterance start information is input.
  • the multimodal input device can prevent the user from warning that the user has forgotten to input when the user is performing voice input, and can improve convenience.
  • the speech information abstraction unit 102 that takes time from the start of input to the output of the abstraction information outputs the utterance start information to the operation processing unit 105 as input start information.
  • the amount of calculation related to the input start information in the multimodal input device can be suppressed.
  • Embodiment 1 described above input start information is received for an input method in which the time required from the start of the user's input operation until the abstraction information of the input operation is acquired is longer than in other methods.
  • the timer A count is stopped when it is executed, and the timer A count is stopped when the abstraction information is received for the other input methods.
  • the same effect can be obtained by stopping the count of the timer A based on the input start information and ending the monitoring of the unexecuted input operation for other input methods. be able to.
  • the input start information can be obtained for other input methods, it is confirmed that the input operation of any one of the input methods is started based on the reception of the input start information instead of the abstraction information. It may be detected and the timer A starts counting in anticipation of acquiring semantic information.
  • the time required from the start of the input operation of the user until the abstraction information of the input operation is acquired is longer than the other methods based on the input start information. As described above, the amount of calculation related to the input start information in the multimodal input device can be suppressed.
  • the multimodal input device of this embodiment processes voice input and gesture input has been described.
  • the present invention is not limited to these, and other input methods are adopted. Also good.
  • the input method is not limited to two types, and the same effect can be obtained when three or more types of input methods are employed.
  • three or more types of input methods there may be a case where there are a plurality of types of input methods in which the time required until the abstract information is acquired is comparable to that of other input methods. In such a case, it is conceivable to stop the timer A by detecting the start of the input operation of all input methods that require a long time.
  • the time-out control unit 106 determines the completion of input when two types of abstraction information of input operations are prepared, but when receiving the abstraction information of one input operation as necessary, Also, it may be determined that the input is completed. For example, when operating a low-risk switch, the input is completed only by voice, and when operating a high-risk switch, it is necessary depending on the degree of safety by determining that the input is complete when voice and gesture are input. Convenience can be improved by distinguishing various input operations.
  • time-out is appropriately set. It becomes possible to detect completion of input.
  • the speech information abstraction unit 102 outputs the utterance start information
  • the speech recognition fails or when the abstract information corresponding to the text of the recognition result is not found
  • the recurrent speech information is output.
  • the time-out control unit 106 may initialize the timer and start counting again when re-utterance information is input.
  • the timer to be initialized at this time may be only timer A or both timers A and B. By controlling in this way, even when the voice input is not normally performed, it is possible to appropriately detect the timeout and the input completion.
  • the count expiration values of timer A and timer B are not fixed values, and may be different values depending on the input status.
  • the value set in ST205 in FIG. 7 when the gesture abstraction information is input first is different from the value set in ST211 in FIG. 7 when the sound abstraction information is input first. Also good.
  • a value corresponding to the maximum time from the start of voice input for monitoring the absence of voice input until the abstraction information acquisition is set, and in ST211 of FIG. 7, the abstraction information from the start of gesture input. Set a value based on the maximum time to output. This makes it possible to wait for an input suitable for each input device.
  • Embodiment 2 the case where the voice information abstraction unit 102, the gesture information abstraction unit 104, the operation processing unit 105, and the timeout control unit 106 are provided in the same device has been described. Next, these functions are distributed and arranged. A multimodal input device composed of a plurality of devices will be described.
  • FIG. 9 is a block diagram showing a configuration of a multimodal input device according to Embodiment 2 of the present invention.
  • the multimodal input device of this embodiment includes a terminal device 201 and a server device 202.
  • the voice acquisition unit 101, the voice information abstraction unit 102b, the gesture acquisition unit 103, the gesture information abstraction unit 104b, and the guidance unit 107b included in the terminal device 201 are the voice acquisition unit according to the first embodiment illustrated in FIG. 101, an audio information abstraction unit 102, a gesture acquisition unit 103, a gesture information abstraction unit 104, and a guidance unit 107.
  • the voice information abstraction unit 102b, the gesture information abstraction unit 104b, and the guidance unit 107b are not connected to the operation processing unit 105, but are connected to the communication unit 203 of the terminal device 201.
  • the communication unit 204 is connected.
  • the terminal communication unit 203 of the terminal device 201 and the communication unit 204 of the server device 202 are connected via a communication path such as a communication line.
  • the processing performed by the voice acquisition unit 101, the voice information abstraction unit 102b, the gesture acquisition unit 103, and the gesture information abstraction unit 104b of the terminal device 201 is the same as the corresponding part in the first embodiment.
  • the speech information abstraction unit 102 b outputs speech start information and speech input abstraction information to the communication unit 203
  • the gesture information abstraction unit 104 b outputs gesture input abstraction information to the communication unit 203.
  • the communication part 203 of the terminal device 201 transmits the information input from the audio
  • the communication unit 204 of the server device 202 outputs the utterance start information, the voice input abstraction information, and the gesture input abstraction information received from the terminal device 201 to the operation processing unit 105b.
  • the processing performed by the motion processing unit 105b that has received the speech start information, the speech input abstraction information, and the gesture input abstraction information and the processing performed by the timeout control unit 106 corresponding to the processing of the motion processing unit 105b are the same as those in the first embodiment. It is the same.
  • the operation processing unit 105b outputs the guidance voice output instruction output to the guidance unit 107 in the first embodiment to the communication unit 204.
  • voice transmits this instruction
  • the communication unit 203 of the terminal device 201 outputs an instruction to output the guidance voice received from the server device 202 to the guidance unit 107b. And the guidance part 107b reproduces
  • the terminal device 201 performs voice recognition and gesture recognition
  • the server device 202 performs determination and implementation of an operation corresponding to the user's input, and timeout detection. Since the server apparatus 202 communicates with a plurality of terminal apparatuses 201 and can centrally manage instructions to a plurality of users, the plurality of users work through the terminal apparatus 201 in cooperation with each other. Therefore, it is possible to appropriately give instructions to each user and improve work efficiency.
  • the functions are distributed and arranged in the terminal device 201 and the server device 202, the calculation amount of the terminal device 201 can be reduced.
  • the communication unit 203c of the terminal device 201c when voice data is input from the voice acquisition unit 101c, the communication unit 203c of the terminal device 201c outputs the data to the communication unit 203c.
  • the 203c When the gesture data is input from the gesture acquisition unit 103c, the 203c outputs the data to the communication unit 203c.
  • the communication unit 204c of the server device 202c outputs the audio data received from the terminal device 201c to the audio information abstraction unit 102c, and outputs the gesture data received from the terminal device 201c to the gesture information abstraction unit 104c.
  • Other operations are the same as those of the multimodal input device shown in FIG.
  • server device 202c since the server device 202c performs voice recognition, gesture recognition, determination and execution of the operation to be performed, and timeout detection, the amount of calculation of the terminal device 201c can be further reduced. In addition, by realizing server device 202c using server hardware with high processing capability, high-accuracy voice recognition and gesture recognition using abundant computing power becomes possible, and work is performed efficiently based on high recognition accuracy. can do.
  • a server device 202d including a speech information abstraction unit 102d, a gesture information abstraction unit 104d, and a communication unit 204d, and a speech
  • a speech A terminal device including an acquisition unit 101d, a gesture acquisition unit 103d, a communication unit 203d, an operation processing unit 105d, a timeout control unit 106, and a guidance unit 107 may be used.
  • the processing amount of the terminal device 201d can be reduced by performing the processing of voice recognition and gesture recognition that require computing power by the server device 202d.
  • input start information such as utterance start information may be output from the voice acquisition unit 101d and the gesture acquisition unit 103d to the operation processing unit 105d.
  • the functions are not limited to the above-described modification example, and the functions may be distributed and arranged by another function division.
  • Embodiment 3 FIG.
  • the speech information abstraction unit 102 outputs the utterance start information unconditionally to the motion processing unit 105 when an utterance is detected, but the following prescribed condition is satisfied Only the speech information abstraction unit 102 outputs the utterance start information to the operation processing unit 105.
  • the configuration of the multimodal input device according to this embodiment is the same as that of FIG. 1 described in the first embodiment.
  • the operation of the multimodal input device of this embodiment is different from that of the first embodiment in the operation of the voice information abstraction unit 102 at the time of voice input.
  • the voice information abstraction unit 102 starts to receive voice data from the voice acquisition unit 101, measures the time when the utterance is performed after detecting the user's utterance start, and measures the time from the start of the utterance. Is longer than a prescribed time (for example, 0.5 seconds), utterance start information is output to the operation processing unit 105. If the utterance is finished before the utterance time reaches the specified time, the speech information abstraction unit 102 does not output the utterance start information.
  • a prescribed time for example, 0.5 seconds
  • the time-out control unit 106 can be prevented from receiving the utterance start information, and a more accurate operation can be performed.
  • voice input has been described as an example, but other input methods may be the same.
  • FIG. 6 illustrates an embodiment for dynamically determining
  • the configuration of the multimodal input device according to this embodiment is the same as that of FIG. 1 described in the first embodiment.
  • the multimodal input device of this embodiment performs the following processing at startup.
  • the speech information abstraction unit 102 of the multimodal input device of this embodiment at the time of starting the device detects the speech start of the speech recognition result of the speech input to be recognized, acquires the recognition result, and acquires the abstract information
  • the required time to do is estimated, and the longest required time among the estimated times is output to the operation processing unit 105 which is a required time determination unit.
  • the gesture information abstraction unit 104 acquires the recognition result for the recognition result of the gesture input to be recognized, acquires the recognition result, and estimates the time required to acquire the abstraction information. The longest time is output to the operation processing unit 105.
  • the operation processing unit 105 compares the required time input from the speech information abstraction unit 102 with the required time input from the gesture information abstraction unit 104, and outputs the input start information to the person with the longer required time. Instruct.
  • voice input and gesture input have been described as examples. However, the same may be applied when other input methods are used.
  • the multimodal input device controls the multimodal input by controlling the input start information to be output only to the input device having the longest time required for the input operation when the device is activated. Even when the input method to the apparatus is changed, it can be easily handled.
  • the time required to determine whether the input operation is not performed can be shortened. It is useful in a system using

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

 La présente invention est équipée : d'une unité de détection (111) d'entrée permettant de détecter que des informations sémantiques indiquant la signification d'une fonction d'entrée pour chaque procédé parmi une pluralité de procédés d'entrée différant en mode ont été acquises, et de détecter, par rapport à un procédé d'entrée autre que le procédé d'entrée pour lequel l'acquisition des informations sémantiques a été détectée, qu'une fonction d'entrée dudit procédé d'entrée a commencé ; et d'une unité de traitement (112) de surveillance permettant de surveiller, sur la base du résultat de détection selon lequel des informations sémantiques ont été acquises et du résultat de détection selon lequel la fonction d'entrée a commencé, tels que détectés par l'unité de détection d'entrée, une non-exécution de la fonction d'entrée du procédé d'entrée autre que le procédé d'entrée pour lequel l'acquisition des informations sémantiques a été détectée.
PCT/JP2014/000686 2014-02-10 2014-02-10 Dispositif d'entrée multimodale, et procédé de commande de temporisation dans un dispositif terminal et dans dispositif d'entrée multimodale WO2015118578A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/JP2014/000686 WO2015118578A1 (fr) 2014-02-10 2014-02-10 Dispositif d'entrée multimodale, et procédé de commande de temporisation dans un dispositif terminal et dans dispositif d'entrée multimodale

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2014/000686 WO2015118578A1 (fr) 2014-02-10 2014-02-10 Dispositif d'entrée multimodale, et procédé de commande de temporisation dans un dispositif terminal et dans dispositif d'entrée multimodale

Publications (1)

Publication Number Publication Date
WO2015118578A1 true WO2015118578A1 (fr) 2015-08-13

Family

ID=53777421

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2014/000686 WO2015118578A1 (fr) 2014-02-10 2014-02-10 Dispositif d'entrée multimodale, et procédé de commande de temporisation dans un dispositif terminal et dans dispositif d'entrée multimodale

Country Status (1)

Country Link
WO (1) WO2015118578A1 (fr)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0836480A (ja) * 1994-07-22 1996-02-06 Hitachi Ltd 情報処理装置
WO2008149482A1 (fr) * 2007-06-05 2008-12-11 Mitsubishi Electric Corporation Dispositif de commande pour un véhicule
JP2011081541A (ja) * 2009-10-06 2011-04-21 Canon Inc 入力装置及びその制御方法
JP2013064777A (ja) * 2011-09-15 2013-04-11 Ntt Docomo Inc 端末装置、音声認識プログラム、音声認識方法および音声認識システム
JP2013257694A (ja) * 2012-06-12 2013-12-26 Kyocera Corp 装置、方法、及びプログラム

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0836480A (ja) * 1994-07-22 1996-02-06 Hitachi Ltd 情報処理装置
WO2008149482A1 (fr) * 2007-06-05 2008-12-11 Mitsubishi Electric Corporation Dispositif de commande pour un véhicule
JP2011081541A (ja) * 2009-10-06 2011-04-21 Canon Inc 入力装置及びその制御方法
JP2013064777A (ja) * 2011-09-15 2013-04-11 Ntt Docomo Inc 端末装置、音声認識プログラム、音声認識方法および音声認識システム
JP2013257694A (ja) * 2012-06-12 2013-12-26 Kyocera Corp 装置、方法、及びプログラム

Similar Documents

Publication Publication Date Title
JP6230726B2 (ja) 音声認識装置および音声認識方法
US9824685B2 (en) Handsfree device with continuous keyword recognition
JP2011229159A (ja) 撮像制御装置及び撮像装置の制御方法
RU2015137291A (ru) Способ и устройство для управления интеллектуальным жилищным устройством
JP2017083713A (ja) 対話装置、対話機器、対話装置の制御方法、制御プログラム、および記録媒体
WO2015118578A1 (fr) Dispositif d'entrée multimodale, et procédé de commande de temporisation dans un dispositif terminal et dans dispositif d'entrée multimodale
JP2020160431A (ja) 音声認識装置、音声認識方法及びそのプログラム
JP7133969B2 (ja) 音声入力装置、及び遠隔対話システム
JP6673243B2 (ja) 音声認識装置
JP2011039222A (ja) 音声認識システム、音声認識方法および音声認識プログラム
JP7215417B2 (ja) 情報処理装置、情報処理方法、およびプログラム
US10210886B2 (en) Voice segment detection system, voice starting end detection apparatus, and voice terminal end detection apparatus
JP6748565B2 (ja) 音声対話システム及び音声対話方法
JP7449070B2 (ja) 音声入力装置、音声入力方法及びそのプログラム
JP4451166B2 (ja) 音声対話システム
US11322145B2 (en) Voice processing device, meeting system, and voice processing method for preventing unintentional execution of command
JP7404568B1 (ja) プログラム、情報処理装置、及び情報処理方法
US11308966B2 (en) Speech input device, speech input method, and recording medium
JP7303091B2 (ja) 制御装置、電子機器、制御装置の制御方法および制御プログラム
KR102208496B1 (ko) 연속 음성 명령에 기반하여 서비스를 제공하는 인공지능 음성단말장치 및 음성서비스시스템
JP6633139B2 (ja) 情報処理装置、プログラム及び情報処理方法
JPWO2018207483A1 (ja) 情報処理装置、電子機器、制御方法、および制御プログラム
WO2021044569A1 (fr) Dispositif et procédé de soutien à la reconnaissance vocale
JP5229014B2 (ja) 通信制御装置、通信制御システム、通信制御方法、及び通信制御装置のプログラム
JP2020106746A (ja) 制御装置、制御方法、制御プログラム、及び対話装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14881672

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

NENP Non-entry into the national phase

Ref country code: JP

122 Ep: pct application non-entry in european phase

Ref document number: 14881672

Country of ref document: EP

Kind code of ref document: A1