US11615787B2 - Dialogue system and method of controlling the same - Google Patents
Dialogue system and method of controlling the same Download PDFInfo
- Publication number
- US11615787B2 US11615787B2 US17/116,082 US202017116082A US11615787B2 US 11615787 B2 US11615787 B2 US 11615787B2 US 202017116082 A US202017116082 A US 202017116082A US 11615787 B2 US11615787 B2 US 11615787B2
- Authority
- US
- United States
- Prior art keywords
- input
- sentence
- input sentence
- output
- processed
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active, expires
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/183—Speech classification or search using natural language modelling using context dependencies, e.g. language models
- G10L15/19—Grammatical context, e.g. disambiguation of the recognition hypotheses based on word sequence rules
- G10L15/197—Probabilistic grammars, e.g. word n-grams
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
- G06F40/35—Discourse or dialogue representation
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
- G06F40/42—Data-driven translation
- G06F40/44—Statistical methods, e.g. probability models
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
- G06N3/0442—Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/1815—Semantic context, e.g. disambiguation of the recognition hypotheses based on word meaning
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/1822—Parsing for meaning understanding
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
- G10L2015/225—Feedback of the input speech
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
- G10L2015/226—Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics
Definitions
- the present disclosure relates to a dialogue system and a controlling method of a dialogue system capable of providing a service corresponding to a user's speech.
- the dialogue system must provide an appropriate response corresponding to OOD speech, such as notifying that the system cannot provide the service requested by a user or providing an alternative service.
- OOD detection which determines whether a user's speech corresponds to an OOD speech, is a very important function.
- OOD detection is mainly based on rules or machine learning.
- rule-based OOD detection the accuracy of OOD detection is determined by how the rule is constructed. In order to obtain high accuracy, it takes a lot of time and cost to establish a rule for the OOD detection.
- a machine learning-based OOD detection requires time and cost for additional data construction because sentences corresponding to the OOD speech must be separately collected for learning.
- a dialogue system includes a processor configured to: generate a meaning representation corresponding to an input sentence by performing Natural Language Understanding on the input sentence, generate an output sentence corresponding to the input meaning representation based on Recurrent Neural network (RNN). determine whether the input sentence cannot be processed using the natural language generator, calculate a parameter representing a probability of outputting the input sentence when the meaning representation corresponding to the input sentence is input r, and determine whether the input sentence cannot be processed based on the calculated parameter.
- RNN Recurrent Neural network
- the natural language generator may include a plurality of cells outputting words corresponding to the input meaning representation.
- Each of the plurality of cells may generate a probability distribution for a plurality of pre-stored words, and input a word included in the input sentence among the plurality of words into the next cell in response to the input.
- the determiner may calculate the parameter based on an output probability of a word included in the input sentence from the generated probability distribution.
- Each of the plurality of cells may generate a ranking distribution fora plurality of pre-stored words and inputs a word included in the input sentence among the plurality of words into the next cell in response to an input.
- the determiner may calculate the parameter based on the output ranking of words included in the input sentence in the generated ranking distribution.
- the determiner may determine that the input sentence is a sentence that cannot be processed when the parameter is less than reference value.
- the input sentence that cannot be processed may be an Out-of-domain (OOD) sentence.
- OOD Out-of-domain
- Each of the plurality of cells may generate a probability distribution or rank distribution for a plurality of pre-stored words, and input the word having the highest output probability or output priority among the plurality of words into the next cell in response to input when the input sentence is a sentence that can be processed.
- the natural language generator may generate an output sentence consisting of words output from each of the plurality of cells.
- a controlling method of a dialogue system may comprise generating a meaning representation corresponding to an input sentence by performing Natural Language Understanding on the input sentence; and determining whether the input sentence cannot be processed using a processor which generate an output sentence corresponding to the input meaning representation based on Recurrent Neural network (RNN), wherein the determining whether the input sentence cannot be processed may include calculating a parameter representing a probability of outputting the input sentence when the meaning representation corresponding to the input sentence is input to the natural language generator, and determining whether the input sentence annot be processed based on the calculated parameter.
- RNN Recurrent Neural network
- the natural language generator may include a plurality of cells outputting words corresponding to the input meaning representation.
- Determining whether the input sentence is a sentence that cannot be processed may include generating a probability distribution for a plurality of pre-stored words, and inputting a word included in the input sentence among the plurality of words into the next cell in response to the input.
- Determining whether the input sentence is a sentence that cannot be processed may include calculating the parameter based on an output probability of a word included in the input sentence from the generated probability distribution.
- Determining whether the input sentence is a sentence that cannot be processed may include generating a ranking distribution for a plurality of pre-stored words in response to the input by the each of the plurality of cells and inputting a word included in the input sentence among the plurality of words into the next cell in response to an input.
- Determining whether the input sentence is a sentence that cannot be processed may include, calculating the parameter based on the output ranking of words included in the input sentence in the generated ranking distribution.
- Determining whether the input sentence is a sentence that cannot be processed may include, determining that the input sentence is a sentence that cannot be processed when the parameter is less than reference value.
- the input sentence that cannot be processed may be an Out-of-domain (OOD) sentence.
- OOD Out-of-domain
- the method may further include generating a probability distribution or rank distribution for a plurality of pre-stored words by each of the plurality of cells, and inputting the word having the highest output probability or output priority among the plurality of words into the next cell in response to input when the input sentence is a sentence that can be processed.
- the method may further include generating an output sentence consisting of words output from each of the plurality of cells with the highest output probability or the highest output rank when the input sentence is a sentence that can be processed.
- FIG. 1 is a control block diagram of a dialogue system according to an embodiment of the present disclosure.
- FIG. 2 is a control block diagram of a dialogue system further including a speech recognizer according to an embodiment of the present disclosure.
- FIG. 3 is a control block diagram of a dialogue system further including a communicator according to an embodiment of the present disclosure.
- FIG. 4 is a structural diagram illustrating a natural language generation algorithm performed in a natural language generator of a dialogue system and an example of applying the algorithm according to an embodiment of the present disclosure.
- FIGS. 5 and 6 are diagrams illustrating a process in which the dialogue system according to an embodiment performs OOD detection using a natural language generator of the present disclosure.
- FIG. 7 is a control block diagram of a dialogue system further including a result processor according to an embodiment of the present disclosure.
- FIG. 8 is a flowchart of a method for controlling a dialogue system according to an embodiment of the present disclosure.
- FIG. 9 is a flowchart illustrating a process of determining whether an input sentence is a sentence that cannot be processed in a method of controlling a dialogue system according to an exemplary embodiment of the present disclosure.
- terms such as “ ⁇ part”, “ ⁇ group”, “ ⁇ block”, “ ⁇ member”, “ ⁇ module” may refer to a unit for processing at least one function or operation.
- the terms may refer to at least one hardware processed by at least one piece of hardware such as a field-programmable gate array (FPGA)/application specific integrated circuit (ASIC), at least one software stored in a memory, or a processor.
- FPGA field-programmable gate array
- ASIC application specific integrated circuit
- ordinal numbers such as “first” and “second” used before the components described herein are merely used to distinguish the components from each other.
- the ordinal numbers used before the components are not used to specify the order of connection between these components and the order of use thereof.
- the ordinal numbers do not have a different meaning, such as priority.
- the disclosed embodiments may be implemented in the form of a recording medium for storing instructions executable by a computer. Instructions may be stored in the form of program code and, when executed by a processor, may generate a program module to perform the operations of the disclosed embodiments.
- the recording medium may be implemented as a computer-readable recording medium.
- Computer-readable recording media may include all kinds of recording media having stored thereon instructions which can be read by a computer.
- ROM read only memory
- RAM random-access memory
- magnetic tape a magnetic tape
- magnetic disk a magnetic disk
- flash memory an optical data storage device, and the like.
- the dialogue system is a device that analyzes a user's speech to understand the user's intention and provides a service suitable for the user's intention.
- the dialogue system can make the user feel as if they are talking with the dialogue system by outputting the system response to provide a service suitable for the user's intention.
- the system response may include an answer to a user's question, a question to confirm the user's intention, and a guide for a service to be provided.
- FIG. 1 is a control block diagram of a dialogue system according to an embodiment of the present disclosure.
- the dialogue system 100 includes a Natural language interpreter 110 which generates a meaning representation corresponding to the input sentence by performing natural language understanding on the input sentence, a natural language generator 120 that generates an output sentence corresponding to the input meaning representation based on a recurrent neural network (RNN), and a determiner 130 that determines whether the input sentence is a sentence that cannot be processed using the natural language generator 120 .
- a Natural language interpreter 110 which generates a meaning representation corresponding to the input sentence by performing natural language understanding on the input sentence
- a natural language generator 120 that generates an output sentence corresponding to the input meaning representation based on a recurrent neural network (RNN)
- RNN recurrent neural network
- the dialogue system 100 may include at least one memory storing a program that performs an operation described later and related data, and at least one processor that executes the stored program.
- the natural language interpreter 110 may each use separate memories and processors, and some or all of them may share a memory and a processor.
- the dialogue system 100 may be implemented as a server, and in this case, the components 110 , 120 , and 130 of the dialogue system 100 may be provided in the server. However, some of the components 110 , 120 , and 130 of the dialogue system 100 may be provided in a user terminal that connects the user and the dialogue system 100 .
- the user terminal when the user terminal is a vehicle, some of the components of the dialogue system 100 may be provided in the vehicle, and when the user terminal is a mobile device such as a smartphone, an AI speaker, or a PC, some of the components of the dialogue system 100 may be provided in the mobile device, the AI speaker, or the PC.
- a mobile device such as a smartphone, an AI speaker, or a PC
- some of the components of the dialogue system 100 may be provided in the mobile device, the AI speaker, or the PC.
- the user download and use a program that performs some functions among the components of the dialogue system 100 to the user terminal.
- the input sentence inputted to the natural language interpreter 110 is converted from a user speech inputted into a microphone into text, and may be provided from a user terminal or may be provided from a speech recognizer provided in the dialogue system 100 .
- FIG. 2 is a control block diagram of a dialogue system further including a speech recognizer according to an embodiment of the present disclosure.
- FIG. 3 is a control block diagram of a dialogue system further including a communicator according to an embodiment of the present disclosure.
- the dialogue system 100 may further include a speech recognizer 140 for converting a user speech into text.
- the speech recognizer 140 may convert a user speech transmitted from a user terminal into text by applying a speech recognition algorithm or a Speech To Text (STT) algorithm.
- STT Speech To Text
- the speech recognizer 140 may extract a feature vector of a speech signal corresponding to a user speech by applying Feature vector extraction techniques such as Cepstrum, Linear Predictive Coefficient (LPC), Mel Frequency Cepstral Coefficient (MFCC), or Filter Bank Energy.
- Feature vector extraction techniques such as Cepstrum, Linear Predictive Coefficient (LPC), Mel Frequency Cepstral Coefficient (MFCC), or Filter Bank Energy.
- a recognition result can be obtained by comparing the extracted feature vector with the trained reference pattern.
- an acoustic model for modeling and comparing signal characteristics of speech and a language model for modeling a linguistic order relationship such as words or syllables corresponding to a recognized vocabulary may be used.
- the speech recognizer 140 can use any of the known speech recognition techniques to convert the user's speech into text.
- the speech recognizer 140 converts the user's speech to text and inputs it into the natural language interpreter 110 .
- a user speech converted to text will be referred to as an input sentence.
- a microphone into which a user's speech is input and a speaker that outputs a system response may be provided in a user terminal such as a vehicle, a mobile device, or a PC, and the user terminal may be connected to the dialogue system 100 through wireless communication.
- the dialogue system 100 may further include a communicator 150 capable of exchanging data with a user terminal through wireless communication.
- the user speech input through the microphone may be transmitted to the communicator 150 of the dialogue system 100 .
- a speech recognizer that recognizes a user's speech and converts it into text may be provided in the user terminal.
- the communicator 150 may receive an input sentence from the user terminal, and the received input sentence may be input to the natural language interpreter 110 .
- the natural language interpreter 110 can analyze the input sentence to understand the user intention included in the users speech. To this end, the natural language interpreter 110 may apply machine learning or deep learning-based natural language understanding to input sentences.
- the natural language interpreter 110 converts the input string into a morpheme sequence by performing morpheme analysis on the user's speech in text form.
- the conversation manager 120 may recognize the entity name from the user's speech.
- the entity name is a proper noun such as a person's name, place name, organization name, time, date, currency, etc.
- entity name recognition is the task of identifying the entity name in a sentence and determining the type of the identified entity name. The meaning of the sentence can be grasped by extracting important keywords from the sentence through entity name recognition.
- the natural language interpreter 110 can analyze the speech behavior of the user's speech.
- Speech act analysis is the task of analyzing the intention of the user's speech, and is to grasp the intention of the speech such as whether the user is asking a question, making a request, responding, or expressing a simple emotion.
- the natural language interpreter 110 may generate a meaning representation used to generate a system response corresponding to a user intention or to provide a service corresponding to the user intention, based on the analysis result of the input sentence.
- the meaning representation in conversation processing may be a result of understanding natural language or may be an input of natural language generation.
- the natural language interpreter 110 may analyze the user's speech to generate a meaning representation that expresses the user's intention, and may generate a meaning representation corresponding to the next system response in consideration of the conversation flow and situation.
- the term dialogue act may be used instead of meaning representation.
- the meaning representation may include information, such as a speech act, a data type, and a data value corresponding thereto for generating a system response corresponding to the user's intention.
- the meaning representation may be a set of various meaning representation tags.
- the meaning representation of the natural language sentence “Please guide me to Seoul Station” includes a speech act tag called “request”, a data type tag called “navigation”, and a data value tag of “Seoul Station” corresponding to the data type tag.
- the natural language generator 120 may generate a sentence (hereinafter, referred to as an output sentence) to be output as a system response based on the meaning representation output from the natural language interpreter 110 , and the generated sentence may be synthesized as a voice signal by a text to speech (TTS) engine provided in the result processor 160 (see FIG. 7 ) and output through a speaker provided in the user terminal.
- TTS text to speech
- FIG. 4 is a structural diagram illustrating a natural language generation algorithm performed in a natural language generator of a dialogue system and an example of applying the algorithm according to an embodiment of the present disclosure.
- the natural language generator 120 may generate an output sentence corresponding to a meaning representation input from the natural language interpreter 110 .
- the natural language generator 120 may generate an output sentence based on a deep learning technique using a deep neural network.
- Deep neural networks used for natural language generation may include at least one of Recurrent Neural Network (RNN), Bi-directional RNN (BRNN), Long Short Term Memory (LSTM), Bi-directional LSTM (BLSTM), Gated Recurrent Unit (GRU), and Bi-directional GRU (BGRU).
- RNN Recurrent Neural Network
- BRNN Bi-directional RNN
- LSTM Long Short Term Memory
- BSSTM Bi-directional LSTM
- GRU Gated Recurrent Unit
- BGRU Bi-directional GRU
- RNN is characterized by using the hidden state of the past time step and the system input value of the current time step to calculate the new hidden state. For example, in the generation of natural language using an RNN, an output value outputted in response to a specific input value can be used as an input value in the next time step, and the input value and output value at this time can include words constituting a sentence. By repeating this process, words constituting the output sentence can be generated in a certain order.
- the meaning representation which is an analysis result of the natural language interpreter 110 , is input as an initial hidden layer of the RNN-based natural language generator 120 , and the natural language generator 120 may include a plurality of cells 121 that output words corresponding to the input meaning representation.
- the plurality of cells 121 may include at least one of RNN cells such as Long-Short Term Memory (LSTM) and Gated Recurrent Unit (GRU).
- LSTM Long-Short Term Memory
- GRU Gated Recurrent Unit
- Each of the plurality of cells 121 may generate a probability distribution for a plurality of pre-stored words in response to an input.
- the plurality of pre-stored words may be words registered in the dictionary of the dialogue system 100 , and the probability distribution for a plurality of words represents a probability that each word, which is a probability that each word matches an input of a corresponding cell, is output from a corresponding cell.
- the probability for each word in the probability distribution for a plurality of words will be referred to as an output probability.
- the word having the highest output probability becomes the output of the cell 121 of the current time step, and the output of the cell 121 of the current time step becomes the input of the cell 121 of the next time step.
- each of the plurality of cells 121 may input a word output from a cell in a previous time step, and a word outputted by itself may be an input to a cell in the next time step.
- BOS Begin of Sentence
- indicating the start of a sentence may be an input of the first cell 121 - 1 , and an output of the first cell 121 - 1 may be “there”. That is, among a plurality of words stored in the dictionary of the dialogue system 100 , the word with the highest output probability in the first cell 121 - 1 is “there”.
- the second cell 121 - 2 may output “is”. That is, among a plurality of words stored in the dictionary of the dialogue system 100 , the word with the highest output probability in the second cell 121 - 2 is “is”.
- the third cell 121 - 3 may output “a”. That is, among a plurality of words stored in the dictionary of the dialogue system 100 , the word with the highest output probability in the third cell 121 - 3 is “a”.
- A which is an output of the third cell 121 - 3
- the fourth cell 121 - 4 may output “nice”. That is, among a plurality of words stored in the dictionary of the dialogue system 100 , the word with the highest output probability in the fourth cell 121 - 4 is “nice”.
- TGI which is an output of the seventh cell 121 - 7 , becomes an input of the eighth cell 121 - 8 , and the eighth cell 121 - 8 may output an End of Sentence (EOS). That is, EOS has the highest output probability in the eighth cell 121 - 8 .
- the output sentence of the natural language generator 120 corresponding to the meaning representation generated by the natural language interpreter 110 becomes “there is a nice restaurant named TGI”.
- the dialogue system 100 may not only use the natural language generator 120 to generate an output sentence, but also use it to determine whether an input sentence is a sentence that cannot be processed by the dialogue system 100 . That is, the dialogue system 100 according to an embodiment of the present disclosure may use the natural language generator 120 to detect Out of Domain (OOD).
- OOD Out of Domain
- the input sentence corresponding to the user's speech is a request for a service that is not supported by the dialogue system 100 or the input sentence itself is meaningless, the input sentence may be considered to correspond to a sentence that cannot be processed by the dialogue system 100 .
- the determiner 130 of the dialogue system 100 calculates a parameter indicating a probability of outputting an input sentence when a meaning representation corresponding to an input sentence is input to the natural language generator 120 , and determines whether an input sentence is a sentence that cannot be processed by the dialogue system 100 , that is, whether the input sentence is an OOD sentence based on the calculated parameter.
- FIGS. 5 and 6 are diagrams illustrating a process in which the dialogue system according to an embodiment of the present disclosure performs OOD detection using a natural language generator.
- the natural language interpreter 110 may analyze the input sentence and generate a meaning representation [act: request/type: add_schedule/date: next week] as illustrated in FIG. 5 .
- Each of the plurality of cells 121 constituting the RNN-based natural language generator 120 generates a probability distribution for a plurality of pre-stored words in response to an input, and inputs a word included in the input sentence among the plurality of words into the next cell.
- the determiner 130 may search for a word included in an input sentence from the generated probability distribution. Specifically, the determiner 130 searches for a word corresponding to the current time step among words included in the input sentence, and calculates a parameter indicating a probability of outputting an input sentence based on the output probability of the searched word.
- the meaning representation generated by the natural language interpreter 110 is input into the initial hidden layer of the natural language generator 120 , and when a BOS for starting a sentence is input to the first cell 121 - 1 , the first cell 121 - 1 may generate a probability distribution for a plurality of pre-stored words in response to the input.
- the determiner 130 may search for a cell of the current time stage, that is, the first word “add” corresponding to the first cell 121 - 1 , among words included in the input sentence from the generated probability distribution, and may store the output probability of the searched “add”.
- the first word of the input sentence “add” can be input into the second cell 121 - 2 , and the second cell 121 - 2 generates a probability distribution for a plurality of words in response to the input.
- the determiner 130 may search for the second word “schedule” among words included in the input sentence from the generated probability distribution, and may store the output probability of the searched “schedule”.
- the second word of the input sentence “schedule” may be input to the third cell 121 - 3 , and the third cell 121 - 3 may generate a probability distribution for a plurality of words in response to the input.
- the determiner 130 may search for “next week”, which is a third word among words included in the input sentence, from the generated probability distribution, and store the output probability of the searched “next week”.
- the third word “next week” of the input sentence may be input into the fourth cell 121 - 4 , and the fourth cell 121 - 4 may generate a probability distribution for a plurality of words in response to the input have.
- the determiner 130 may search for EOS for ending a sentence from the generated probability distribution, and may store the output probability of the searched EOS.
- the determiner 130 may determine whether the input sentence is a sentence that cannot be processed based on the stored output probability. For example, an average value of the output probability of each word constituting the input sentence may be calculated as a parameter indicating the probability of outputting the input sentence.
- the determiner 130 may determine that the input sentence is a sentence that can be processed by the dialogue system 100 , which is an in-domain sentence, if the average value of the output probability of each word constituting the input sentence is more than a predetermined reference value. Conversely, if the average value of the output probability of each word constituting the input sentence is less than a predetermined reference value, it may be determined that the input sentence is a sentence that cannot be processed by the dialogue system 100 , which is an OOD sentence.
- the determiner 130 may calculate a parameter indicating a probability of outputting an input sentence based on an output ranking of words included in the input sentence in the ranking distribution for a plurality of words. The determiner 130 determines that the input sentence is a sentence that can be processed by the dialogue system 100 if the calculated parameter is more than a predetermined reference value, and determines that the input sentence is a sentence that cannot be processed by the dialogue system 100 if the determiner 130 is less than a predetermined reference value.
- the natural language generator 120 is a model generated by learning in-domain sentences. Accordingly, as described above, when the natural language generator 120 is used, it is possible to perform OOD detection using the learning result of the in-domain sentence without building a learning database by collecting separate OOD sentences.
- FIG. 7 is a control block diagram of a dialogue system further including a result processor according to an embodiment of the present disclosure.
- the dialogue system 100 may further include a result processor 160 processing the natural language understanding result of the natural language interpreter 110 further in the natural language generation result of the natural language generator 120 and the OOD detection result of the determiner 130 .
- the natural language generator 120 may generate an output sentence as a system response to a user speech.
- the meaning representation corresponding to the input sentence is input to the natural language generator 120 as in the case of OOD detection, the word input from the cell of the current time step to the cell of the next time step is not a word included in the input sentence, but a word having the highest output probability or output ranking in a probability distribution or ranking distribution.
- the text-to-speech (TTS) engine of the result processor 160 converts the output sentence in text form into an audio signal and transmits it to the user terminal through the communicator 150 .
- the transmitted voice signal may be output through a speaker provided in the user terminal.
- the result processor 160 may generate a control signal or a request signal for performing a function corresponding to the user's intention analyzed by the natural language interpreter 110 .
- a control signal for performing a control corresponding to the user's intention may be generated.
- the generated control signal may be transmitted to a home appliance or vehicle through the communicator 150 .
- a request signal for requesting it to an external server that provides the information may be generated.
- the generated request signal may be transmitted to an external server through the communicator 150 .
- the result processor 160 may output a guide indicating that the user's speech cannot be processed.
- an output sentence for guidance may be generated by the natural language generator 120 or imported from a sentence database stored in the result processor 160 .
- the output sentence for guidance may be converted into a voice signal in the TTS engine, and the converted voice signal may be transmitted to the user terminal through the communicator 150 .
- FIG. 8 is a flowchart of a method for controlling a dialogue system according to an embodiment of the present disclosure.
- a meaning representation corresponding to an input sentence is generated ( 310 ).
- the input sentence is a result of converting the user's speech into text by performing speech recognition, when the speech recognizer 140 is provided in the dialogue system 100 , the input sentence output from the speech recognizer 140 may be input to the natural language interpreter 110 , and when the speech recognizer 140 is not provided in the dialogue system 100 and the speech recognizer 140 is provided in the user terminal, the communicator 150 of the dialogue system 100 may receive an input sentence transmitted from the user terminal.
- the meaning representation is a result of the natural language interpreter 110 analyzing the input sentence by performing natural language understanding, and may be a set of various meaning representation tags.
- the description of the meaning representation is as described above.
- the natural language generator 110 of the dialogue system 100 may generate an output sentence corresponding to the meaning representation using an RNN that has learned in-domain sentences.
- OOD detection which is a determination of whether an input sentence can be processed, may be performed. A detailed description of this will be described later.
- a guidance sentence for outputting a guidance indicating that the user's speech cannot be processed may be generated ( 340 ).
- Guidance sentences may be generated by the natural language generator 120 or may be imported from the sentence database stored in the result processor 160 .
- the guidance sentence is converted into a voice signal in the TTS engine ( 350 ), and the converted voice signal may be transmitted to the user terminal through the communicator 150 .
- an output sentence corresponding to a meaning representation is generated ( 360 ) using an RNN-based natural language generator, and the output sentence is converted into a voice signal ( 350 ), and then transmitted to a user terminal through a communicator 150 .
- meaning representation corresponding to the input sentence is input to the natural language generator 120 , the word input from the cell of the current time step to the cell of the next time step is not a word included in the input sentence, but a word having the highest output probability or output ranking in a probability distribution or ranking distribution.
- the input sentence is a processable sentence (No in 330 )
- a control signal for performing a control corresponding to the user's intention may be generated.
- the generated control signal may be transmitted to a home appliance or vehicle through the communicator 150 .
- a request signal for requesting it to an external server that provides the information may be generated.
- the generated request signal may be transmitted to an external server through the communicator 150 .
- FIG. 9 is a flowchart illustrating a process of determining whether an input sentence is a sentence that cannot be processed in a method of controlling a dialogue system according to an exemplary embodiment of the present disclosure.
- the RNN-based natural language generator 120 includes a plurality of cells 121 according to the structures of FIGS. 4 to 6 described above.
- a semantic expression generated through natural language understanding is input to an initial hidden layer of the natural language generator 120 ( 321 ).
- the cell of the current time step generates a probability distribution for a plurality of pre-stored words in response to an input ( 322 ), and stores an output probability of a word included in the input sentence among the plurality of words ( 323 ).
- the word in which the output probability is stored is a word corresponding to the current time step among words included in the input sentence.
- a word included in the input sentence is input into the next cell ( 324 ).
- the word with the highest output probability is entered into the cell of the next time step, and when OOD is detected words included in the input sentence, among others, a word corresponding to the current time step may be input to the cell of the next time step.
- a parameter is calculated based on the stored output probability ( 326 ). For example, an average value of the output probability of each word constituting the input sentence may be calculated as a parameter indicating the probability of outputting the input sentence.
- the input sentence is a sentence that cannot be processed based on the calculated parameter ( 327 ). For example, if the average value of the output probability of each word constituting the input sentence is equal to or greater than a predetermined reference value, it may be determined that the input sentence is an in-domain sentence, that is, a sentence that can be processed by the dialogue system 100 . Conversely, if the average value of the output probability of each word constituting the input sentence is less than a predetermined reference value, it may be determined that the input sentence is an OOD sentence, that is, a sentence that cannot be processed by the dialogue system 100 .
- each of the plurality of cells 121 corresponds to an input
- a ranking distribution for a plurality of pre-stored words may be generated, and a word included in an input sentence among the plurality of words, particularly a word corresponding to a current time step, may be input into the next cell.
- the determiner 130 may calculate a parameter indicating a probability of outputting an input sentence based on an output ranking of words included in the input sentence in the ranking distribution for a plurality of words.
- the determiner 130 may determine that the input sentence is a sentence that can be processed by the dialogue system 100 if the calculated parameter is more than a predetermined reference value, and may determine that the input sentence is a sentence that cannot be processed by the dialogue system 100 if it is less than the predetermined reference value.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Artificial Intelligence (AREA)
- Theoretical Computer Science (AREA)
- Multimedia (AREA)
- Acoustics & Sound (AREA)
- Human Computer Interaction (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Probability & Statistics with Applications (AREA)
- Evolutionary Computation (AREA)
- Data Mining & Analysis (AREA)
- Biophysics (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- Biomedical Technology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Machine Translation (AREA)
- Databases & Information Systems (AREA)
Abstract
Description
Claims (14)
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| KR10-2020-0048281 | 2020-04-21 | ||
| KR1020200048281A KR20210130024A (en) | 2020-04-21 | 2020-04-21 | Dialogue system and method of controlling the same |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| US20210327415A1 US20210327415A1 (en) | 2021-10-21 |
| US11615787B2 true US11615787B2 (en) | 2023-03-28 |
Family
ID=77919598
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US17/116,082 Active 2041-04-23 US11615787B2 (en) | 2020-04-21 | 2020-12-09 | Dialogue system and method of controlling the same |
Country Status (3)
| Country | Link |
|---|---|
| US (1) | US11615787B2 (en) |
| KR (1) | KR20210130024A (en) |
| DE (1) | DE102020215954A1 (en) |
Families Citing this family (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP6799297B1 (en) * | 2019-10-23 | 2020-12-16 | ソプラ株式会社 | Information output device, information output method, and information output program |
| DE212021000356U1 (en) | 2020-04-07 | 2023-01-03 | Cascade Reading, Inc. | Generating graded text formatting for electronic documents and advertisements |
| US11170154B1 (en) | 2021-04-09 | 2021-11-09 | Cascade Reading, Inc. | Linguistically-driven automated text formatting |
| KR102688562B1 (en) * | 2021-06-22 | 2024-07-25 | 국립공주대학교 산학협력단 | Method, Computing Device and Computer-readable Medium for Classification of Encrypted Data Using Neural Network |
| US12254870B2 (en) * | 2021-10-06 | 2025-03-18 | Cascade Reading, Inc. | Acoustic-based linguistically-driven automated text formatting |
| US12315495B2 (en) | 2021-12-17 | 2025-05-27 | Snap Inc. | Speech to entity |
| US12361934B2 (en) * | 2022-07-14 | 2025-07-15 | Snap Inc. | Boosting words in automated speech recognition |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20170140761A1 (en) * | 2013-08-01 | 2017-05-18 | Amazon Technologies, Inc. | Automatic speaker identification using speech recognition features |
| US9875081B2 (en) * | 2015-09-21 | 2018-01-23 | Amazon Technologies, Inc. | Device selection for providing a response |
| US20200125603A1 (en) * | 2018-10-23 | 2020-04-23 | Samsung Electronics Co., Ltd. | Electronic device and system which provides service based on voice recognition |
| US20200279279A1 (en) * | 2017-11-13 | 2020-09-03 | Aloke Chaudhuri | System and method for human emotion and identity detection |
-
2020
- 2020-04-21 KR KR1020200048281A patent/KR20210130024A/en active Pending
- 2020-12-09 US US17/116,082 patent/US11615787B2/en active Active
- 2020-12-15 DE DE102020215954.8A patent/DE102020215954A1/en active Pending
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20170140761A1 (en) * | 2013-08-01 | 2017-05-18 | Amazon Technologies, Inc. | Automatic speaker identification using speech recognition features |
| US9875081B2 (en) * | 2015-09-21 | 2018-01-23 | Amazon Technologies, Inc. | Device selection for providing a response |
| US20180210703A1 (en) * | 2015-09-21 | 2018-07-26 | Amazon Technologies, Inc. | Device Selection for Providing a Response |
| US20200279279A1 (en) * | 2017-11-13 | 2020-09-03 | Aloke Chaudhuri | System and method for human emotion and identity detection |
| US20200125603A1 (en) * | 2018-10-23 | 2020-04-23 | Samsung Electronics Co., Ltd. | Electronic device and system which provides service based on voice recognition |
Also Published As
| Publication number | Publication date |
|---|---|
| US20210327415A1 (en) | 2021-10-21 |
| DE102020215954A1 (en) | 2021-10-21 |
| KR20210130024A (en) | 2021-10-29 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US11615787B2 (en) | Dialogue system and method of controlling the same | |
| CN111933129B (en) | Audio processing method, language model training method and device and computer equipment | |
| US12230268B2 (en) | Contextual voice user interface | |
| US20240153489A1 (en) | Data driven dialog management | |
| US11514886B2 (en) | Emotion classification information-based text-to-speech (TTS) method and apparatus | |
| US10878807B2 (en) | System and method for implementing a vocal user interface by combining a speech to text system and a speech to intent system | |
| US10453117B1 (en) | Determining domains for natural language understanding | |
| CN108989341B (en) | Voice autonomous registration method and device, computer equipment and storage medium | |
| CN111159364B (en) | Dialogue system, dialogue device, dialogue method and storage medium | |
| US11574637B1 (en) | Spoken language understanding models | |
| CN114220461A (en) | Customer service call guiding method, device, equipment and storage medium | |
| CN112309406B (en) | Voiceprint registration method, device and computer-readable storage medium | |
| US11450320B2 (en) | Dialogue system, dialogue processing method and electronic apparatus | |
| CN115497465B (en) | Voice interaction method, device, electronic equipment and storage medium | |
| CN107967916A (en) | Determine voice relation | |
| US11978438B1 (en) | Machine learning model updating | |
| US11295733B2 (en) | Dialogue system, dialogue processing method, translating apparatus, and method of translation | |
| CN113763992B (en) | Voice evaluation method, device, computer equipment and storage medium | |
| US11804225B1 (en) | Dialog management system | |
| US11756550B1 (en) | Integration of speech processing functionality with organization systems | |
| US20040006469A1 (en) | Apparatus and method for updating lexicon | |
| US11551666B1 (en) | Natural language processing | |
| US12462798B1 (en) | Evaluation of speech processing components | |
| KR102915192B1 (en) | Dialogue system, dialogue processing method and electronic apparatus | |
| US12488184B1 (en) | Alternative input representations |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| AS | Assignment |
Owner name: KIA MOTORS CORPORATION, KOREA, REPUBLIC OF Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:PARK, YOUNGMIN;KIM, SEONA;LEE, JEONG-EOM;REEL/FRAME:054592/0929 Effective date: 20201120 Owner name: HYUNDAI MOTOR COMPANY, KOREA, REPUBLIC OF Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:PARK, YOUNGMIN;KIM, SEONA;LEE, JEONG-EOM;REEL/FRAME:054592/0929 Effective date: 20201120 |
|
| FEPP | Fee payment procedure |
Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS |
|
| STCF | Information on status: patent grant |
Free format text: PATENTED CASE |