WO2021254838A1 - Driving companion comprising a natural language understanding system and method for training the natural language understanding system - Google Patents

Driving companion comprising a natural language understanding system and method for training the natural language understanding system Download PDF

Info

Publication number
WO2021254838A1
WO2021254838A1 PCT/EP2021/065372 EP2021065372W WO2021254838A1 WO 2021254838 A1 WO2021254838 A1 WO 2021254838A1 EP 2021065372 W EP2021065372 W EP 2021065372W WO 2021254838 A1 WO2021254838 A1 WO 2021254838A1
Authority
WO
WIPO (PCT)
Prior art keywords
utterance
historical
context
driving companion
companion
Prior art date
Application number
PCT/EP2021/065372
Other languages
French (fr)
Inventor
Yukun MA
Vinay Vishnumurthy Adiga
Original Assignee
Continental Automotive Gmbh
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Continental Automotive Gmbh filed Critical Continental Automotive Gmbh
Publication of WO2021254838A1 publication Critical patent/WO2021254838A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • G06F3/167Audio in a user interface, e.g. using voice commands for navigating, audio feedback
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1815Semantic context, e.g. disambiguation of the recognition hypotheses based on word meaning
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1822Parsing for meaning understanding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/226Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics

Definitions

  • This invention generally relates to a driving companion and specifically relates to natural language understanding systems of a driving companion.
  • Driving a vehicle can be a lonesome and boring task, especially on long-distance trips.
  • Electronic companions conventionally used for domestic purposes have been proposed to assist drivers on such long-distance trips.
  • Electronic companions may be built to help the driver accomplish certain tasks and for social interaction (e.g., chit-chat or exhibit empathy).
  • a conversation might become too long and complicated for the electronic companion to understand and provide an appropriate action. It is, therefore, necessary to discover the most relevant contextual information in the conversation as well as discover the correlation between dialogue turns, in order for the electronic companion to effectively facilitate its understanding and to thereby choose appropriate dialogue actions.
  • NLU systems assess users’ input and parse the input into a structural form (e.g., recognizing intention and recognizing entities within the input). Therefore, NLU systems are a core component of an electronic companion or an in-car assistant to understand a user’s intention of the dialogue.
  • a challenge for NLU systems is that the user’s intention may be conveyed not only by language but also by other less observable states, such as emotion, given the context of a conversation. On top of that, further complications can happen during multi-party conversations, where users intervene one another during conversation or switch topics, thereby making it difficult for the NLU system to analyse the correlation between dialogue turns or analyse the context of the conversation.
  • Most existing NLU systems or models analyse dialogue based on sequential input.
  • a driving companion comprising: a microphone to detect utterances from at least one occupant of a vehicle; and a natural language understanding system comprising: a disentanglement module configured to: receive a current utterance detected by the microphone; determine a historical sub-sequence comprised in historical utterances that has a highest probability of relevance to the current utterance; merge the determined historical sub-sequence with the current utterance, thereby providing a context-dependent current utterance; and a classification layer configured to classify the context-dependent current utterance into a predetermined category for the driving companion to determine an appropriate action responsive to the classified category.
  • a method of training a natural language understanding system of a driving companion to determine appropriate actions comprising: determining a historical sub-sequence comprised in a database of historical utterances that has a highest probability of relevance to a test utterance; merging the determined historical sub-sequence with the test utterance, thereby providing a context-dependent test utterance; classifying the context-dependent test utterance into a predetermined category to optimize the action that the driving companion has to determine responsive to the classified category.
  • the historical sub-sequence that has the highest probability of relevance with respect to context to the current utterance is referred to herein as the “most relevant historical sub-sequence”.
  • relevance refers to being contextually relevant to a current utterance or test utterance, unless otherwise specified.
  • the term “most relevant historical sub-sequence” not only includes one historical sub-sequence that is the most relevant or has the highest probability of relevance, but may also include a historical sub-sequence that has a highest probability of relevance relative to the other historical sub-sequences that are being analysed.
  • the determined historical sub-sequence may not necessarily be the sub-sequence immediately preceding the current or test utterance.
  • the context-dependent current or test utterance may comprise the current or test utterance and the determined historical sub-sequence.
  • the context-dependent current or test utterance may then be fed into the classification layer.
  • a current instance and a plurality of historical instances immediately preceding the current instance is typically classified.
  • the choice of historical instances used may not be the best or the most relevant for classification of a current instance. It is therefore an advantage of the present disclosure that the current utterance, along with the most relevant historical sub-sequence that is chosen, is fed into the classification layer.
  • the analysis of the dialogue may therefore not be constrained by its sequence or flow.
  • the flow of the dialogue may effectively be reorganized or disentangled in relation to the current or test utterance.
  • the correlation found between the current utterance and historical utterances advantageously provides the context needed for the current utterance.
  • the determined historical sub-sequence may advantageously be used as a contextual feature to represent the current utterance.
  • the disclosed context-dependent current utterance may be additionally represented or tagged or annotated by other known methods, such as by neural net-based embeddings, e.g. semantic tagging or by human designed feature templates.
  • the natural language understanding system may further comprise an utterance encoder configured to encode a current or test utterance into a feature vector.
  • the current or test utterance may be an audible utterance detected by a microphone.
  • the natural language understanding system may obtain other inputs from the occupant to supplement understanding of the dialogue. The additional inputs may be converted into a feature vector that is received by the natural language understanding system, the disentanglement module or the utterance encoder.
  • the utterance encoder may be configured to combine more than one feature vector into a fused feature vector. The fused feature vector may be used in the determination of the most relevant historical sub-sequence.
  • inputs including less observable states such as gestures and facial expressions, that are useful for the determination of the most relevant historical sub-sequence may also be considered.
  • a more complete context to the current utterance may result, leading to a more appropriate action being determined and/or performed by the driving companion.
  • the determination step may comprise determining conditional probability of relevance to the current or test utterance of each historical sub-sequence comprised in the historical utterances.
  • the determination step may further comprise determining a distribution of the conditional probabilities of the historical sub-sequences and determining the most relevant historical sub-sequence based on the distribution.
  • the most relevant historical sub-sequence may be determined by selecting the historical sub-sequence with the highest conditional probability, for example based on the distribution.
  • the most relevant historical sub-sequence may be determined by selecting the historical sub-sequence with the highest weighted sum of all historical sub-sequences.
  • the reorganization of the dialogue flow may be done in a task-specific and data-driven manner without requiring additional human annotation.
  • the disclosed training method may comprise backpropagation across the steps of the method, in order to minimize any error in each step.
  • the classification layer may provide error feedback to the disentanglement module.
  • each step executed by a trained natural language understanding system or each component of a trained natural language understanding system may work together with minimal error.
  • the classification layer of a trained natural language understanding system may be conditioned on the disentanglement module of the trained system.
  • FIG. 1 shows an illustration of a vehicle 100 in accordance with an embodiment of the invention.
  • FIG. 2 shows an illustration of a natural language understanding system 102 in accordance with an embodiment of the invention.
  • Fig. 3 shows an illustration of a disentanglement module 204’ of natural language understanding system 102 used in a training method in accordance with an embodiment of the invention.
  • like numerals denote like parts.
  • a driving companion comprising a microphone to detect utterances from at least one occupant of a vehicle.
  • the driving companion further comprises a natural language understanding system.
  • the natural language understanding system comprises a disentanglement module and a classification layer.
  • the disentanglement module is configured to receive a current utterance detected by the microphone; determine a historical sub-sequence comprised in historical utterances that has a highest probability of relevance to the current utterance; and merge the determined historical sub-sequence with the current utterance, thereby providing a context-dependent current utterance.
  • the classification layer is configured to classify the context-dependent current utterance into a predetermined category for the driving companion to determine an appropriate action responsive to the classified category.
  • Fig. 1 shows a vehicle 100 in accordance with an embodiment of the invention.
  • the vehicle 100 comprises a microphone 104.
  • the microphone 104 may be positioned in a suitable location to detect utterances from at least one occupant of the vehicle 100.
  • the vehicle comprises a driving companion.
  • the driving companion may be a system comprising the one or more microphones 104 and a natural language understanding (NLU) system 102.
  • NLU natural language understanding
  • the driving companion may be referred to as an electronic companion or a digital assistant.
  • the driving companion may be designed to accomplish tasks indicated or expressed by a user of the companion, such as an occupant, driver or passenger of the vehicle.
  • the driving companion may be designed to perform instructions, such as to recommend a restaurant at the most convenient location or to execute vehicle-related instructions e.g. to change radio station.
  • the driving companion may be designed to chat with one or more occupants of the vehicle.
  • the driving companion may be designed to proactively provide information, such as in response to a conversation between the driving companion and one or more occupants, or a conversation between occupants.
  • the driving companion may be designed to participate in a conversation between the driving companion and one or more occupants, or a conversation between occupants.
  • the driving companion may be designed to respond upon detecting a trigger, such as a spoken wake word or a physical input e.g. a button.
  • the driving companion may be a computing device, such as a general computing device or an embedded system.
  • the driving companion may be integrated in the vehicle, e.g. an OEM part or an in-vehicle digital assistant or integrated into a vehicle system such as an in-vehicle navigation system.
  • the driving companion may be implemented in a separate device, e.g. a user device such as a smartphone.
  • the microphone 104 may be the microphone of such device.
  • the separate device communicates with the vehicle to perform vehicle-related actions and/or to make use of vehicle systems.
  • the NLU system 102 may be a computing device, such as a general computing device or an embedded system.
  • the NLU system 102 may be a computing device separate from the driving companion. In such case, the NLU system 102 may be integrated in the vehicle or may be implemented in a separate device. Alternatively, the NLU system 102 may be part of the computing device of the driving companion.
  • a user may be uttering a command.
  • the microphone 104 may detect such utterance from the user.
  • the command may be for example a command to change the radio station that is playing on the vehicle’s infotainment system.
  • the user Before the current command, the user may have uttered a command to recommend a restaurant at a convenient location.
  • the driving companion may draw a conclusion from previous commands relating to radio stations, or more broadly, previous commands relating to infotainment.
  • the driving companion may draw a conclusion from the immediately preceding command of restaurant recommendation, to determine an appropriate action.
  • the driving companion may draw a conclusion from all previous commands to determine an appropriate action.
  • the NLU system 102 may determine which previous command has the highest probability of relevance to the current command.
  • the NLU system 102 may determine which previous command or commands have the highest probability of relevance to the current command.
  • Each command may be referred to as a sub-sequence.
  • a sub-sequence may comprise an utterance or multiple utterances.
  • An utterance may be defined as a group of words or sounds detected in one time step, as determined by the microphone or by the NLU system 102, depending on the design of the driving companion.
  • the user may be having a conversation with the driving companion or with one or more other occupants of the vehicle.
  • the term “conversation” may be synonymous with the term “dialogue”, particularly with respect to a conversation between the driving companion and the user and/or one or more other occupants.
  • the microphone 104 may detect such conversation.
  • the conversation may comprise subjects such as music that the user likes.
  • the portion of the conversation relating to the music that the user likes may be referred to as a sub-sequence.
  • a portion of the conversation relating to another subject may be referred to as another sub-sequence.
  • a change of sub-sequence may be referred to as a dialogue turn.
  • a dialogue turn may also include a change from a user uttering a command to the user uttering another command.
  • Each dialogue turn may comprise a user utterance and a companion-generated audio reply. Consequently, the conversation may comprise a plurality of sub-sequences.
  • a potential relevant sub-sequence may include the sub-sequence relating to the music that the user likes.
  • the driving companion may draw a conclusion from previous sub-sequences to determine an appropriate action. The driving companion may determine which previous sub-sequence or sub-sequences have the highest probability of relevance to the current command or utterance.
  • the utterances detected by the microphone 104 may be converted into audio signals that are transmitted to the NLU system 102.
  • the NLU system 102 is illustrated in Fig. 2 in accordance with an embodiment of the invention.
  • the NLU system 102 comprises a disentanglement module 204 and a classification layer 212.
  • the disentanglement module 204 is configured to receive a current utterance 201 detected by the microphone 104.
  • the current utterance 201 may be in the form of audio signals received from the microphone 104.
  • the NLU system 102 may further comprise: an utterance encoder 202 configured to encode a current utterance 201 detected by the microphone 104 into a feature vector 203.
  • the audio signals transmitted to the NLU system 102 may be received by the utterance encoder 202.
  • the audio signals transmitted to the NLU system 102 may be encoded by the utterance encoder 202 into a feature vector 203.
  • the vehicle or the driving companion may further comprise a camera 106 to detect gestures from the occupant. Additionally, or alternatively, the camera 106 may be configured to detect facial expressions of the occupant. Additionally, the camera may be configured to detect other movements of the occupant, such as drowsiness, drunkenness, attentiveness and/or movements related to disease.
  • the camera 106 may be positioned in a suitable location to detect such gestures, facial movements and/or body movements from at least one occupant of the vehicle 100. There may be one or more cameras 106 positioned at suitable locations in the vehicle 100.
  • the image or video frame captured by camera 106 may be converted into a feature vector by an image processing module (not shown in the figures).
  • the image processing module may be part of the camera 106 or the driving companion.
  • the image processing module may include an image encoder configured to encode the image or video frame into a feature vector.
  • the utterance encoder 202 may be further configured to receive the feature vector encoded from an image of a gesture from the occupant obtained from camera 106.
  • the vehicle or the driving companion may further comprise a display 108 to display information for viewing by occupant(s) of the vehicle 100.
  • the display 108 may be a human-machine interface.
  • the display 108 may be a display screen or an in-vehicle display screen.
  • a user may interact with the display 108 by touching or pressing an option displayed on the display 108.
  • the display 108 may allow the user to select letters displayed on the display 108 to formulate a query.
  • a user may interact with the display 108 by pressing one or more buttons and/or turning one or more dials to select option(s) displayed on the display 108.
  • the display 108 may be connected to input devices (not shown in the figures), such as a keyboard or touchpad, to enable the user to type in a query.
  • the user may provide input via the display 108 for the driving companion to consider.
  • the display 108 may be positioned in a suitable location for viewing by occupant(s) of the vehicle 100. There may be one or more displays 108 positioned at suitable locations in the vehicle 100.
  • the input or text received by the display 108 may be converted into a feature vector by the display 108 or the driving companion.
  • the display 108 or the driving companion may include a text encoder configured to encode the input or text received by the display 108 into a feature vector.
  • the utterance encoder 202 may be further configured to receive the feature vector encoded from input received by the display 108.
  • the encoders disclosed herein may comprise a machine learning model. Suitable machine learning models may include neural networks or recurrent neural networks.
  • the encoders disclosed herein may reduce a set of data, such as the set of audio or text or image data, into representations, in the form of feature vectors, for feature detection.
  • the utterance encoder 202 may be configured to combine more than one feature vector into a fused feature vector.
  • the feature vectors derived from different modes, e.g. audio, text or images, may be input into the utterance encoder 202 by different input channels.
  • the input into the utterance encoder 202 may therefore be multi-modal.
  • the feature vectors from different input channels may be combined or fused into a unified feature representation or fused feature vector.
  • the feature vector 203 or, where the input is multi-modal, the fused feature vector 203 may be fed into the disentanglement module 204.
  • the current utterance received by the disentanglement module 204 may be a current utterance feature vector 203.
  • the current utterance feature vector 203 may be a fused feature vector.
  • the disentanglement module 204 may be configured to receive a current utterance feature vector 203 or a fused feature vector 203.
  • the driving companion or the NLU system 102 may be a computing device that typically comprises, among other components, one or more processors connected to computer-readable storage media.
  • the historical utterances may be stored in storage media of the NLU system 102 or storage media of the driving companion.
  • the disentanglement module 204 may be configured to retrieve the historical utterances from the storage media.
  • the historical utterances may be stored in a database in the storage media.
  • the historical utterances may be maintained in the storage media as sub-sequences.
  • the historical utterances may be grouped as sub-sequences and stored in the storage media.
  • a sub-sequence may comprise one or more utterances related to each other.
  • Historical utterances may be stored for up to a time period, depending on the design of the driving companion.
  • Historical utterances may be stored for each conversation.
  • Historical utterances may be stored for up to a session, e.g. for each time the driving companion is turned on.
  • Historical utterances may be stored in non-transitory computer-readable storage media and may persist in storage even after the driving companion is turned off.
  • the disentanglement module 204 may be configured to determine, in step 206, a historical sub-sequence out of the historical utterances retrieved from the storage media, that has a highest probability of relevance to the current utterance 201.
  • the disentanglement module 204 may be configured to determine, in step 206, a historical sub-sequence out of the historical sub-sequences retrieved from the storage media, that has a highest probability of relevance to the current utterance 201.
  • the determination step 206 may comprise: determining conditional probability of relevance to the current utterance 201 of each historical sub-sequence comprised in the historical utterances, and determining the historical sub-sequence with a maximum conditional probability.
  • the most relevant historical sub-sequence may be inferred from dialogue history or historical utterances, based on a conditional probability distribution conditioned on the original dialogue flow and historical utterances concatenated to each sub-sequence.
  • the conditional probability may have a value from 1 to K, given the current utterance, the K sub-sequences and the original dialogue flow.
  • the conditional probability may be determined using a softmax operator.
  • the determination step 206 creates a task-oriented disentanglement of the dialogue flow to facilitate effective understanding of the dialogue.
  • the reorganization of the dialogue flow i.e. the non-sequential analysis of the dialogue flow, in a task-specific and data-driven manner reduces or eliminates the need for additional human annotation.
  • the disentanglement module 204 may be configured to merge the current utterance 201 with the historical sub-sequence determined to have the highest probability of relevance.
  • the merging step 208 results in a context-dependent current utterance 209.
  • the merging step 208 may therefore be referred to as a context encoding step.
  • the feature vector 203 of current utterance or the fused feature vector 203 may be merged with the feature vector of the most relevant historical sub-sequence to generate a context-dependent feature vector 209.
  • Prior NLU systems or models encode the dialogue contexts based on sequential input, which contrasts with the present disclosure.
  • the attention model or transformer model may provide non-sequential encoding processes, they are not able to discover the underlying relation between dialogue turns, which contrasts with the present disclosure. Therefore, the existing NLU systems or models are unable to disentangle the dialogue flow to effectively facilitate understanding.
  • the historical sub-sequence determined to have the highest probability of relevance to the current utterance 201 may provide the most relevant contextual information to the current utterance 201. The most relevant historical sub-sequence may therefore be the link between dialogue turns, thereby facilitating effective understanding of the dialogue and to support choosing appropriate dialogue actions by the driving companion.
  • the NLU system 102 may be required to extract information from the dialogue, such as the user’s intention, emotion, topic and constrains.
  • the context of the dialogue may be necessary to interpret such information accurately. For example, assume that Mary Lee and Mary Chan are both found in the user's address book. When the user asks the driving companion to “call Mary’s cell phone” in the first turn, context does not indicate which Mary is intended; hence there is ambiguity. However, if Mary Chan was mentioned in a historical utterance, such historical utterance may be considered the most relevant to the current utterance of “call Mary’s cell phone”, depending on the other historical sub-sequences of course.
  • Information extracted from a dialogue may include intent.
  • Intent refers to an overall purpose of the dialogue or the task to be performed by the driving companion, such as instructions like placing a phone call, questions like searching for a restaurant or knowing more about a place, or responses.
  • Information extracted from a dialogue may include a topic, such as calling someone to schedule a meeting, eating at a restaurant at a convenient location between meetings, or exploring a new town.
  • Information extracted from a dialogue may include an emotion, which may supplement the determination of an appropriate action to be performed by the driving companion.
  • the context-dependent current utterance 209 may be fed into classification layer 212 of the NLU system 102.
  • the classification layer 212 of the NLU system 102 may be configured to classify the current utterance or the context-dependent current utterance 209 into a predetermined category for the driving companion to determine an appropriate action responsive to the classified category.
  • the context-dependent current utterance 209 includes the most relevant historical sub-sequence, other historical utterances or historical sub-sequences may not be required to be fed into the classification layer 212.
  • the classification layer 212 may advantageously require only the context-dependent current utterance 209 in order to determine a category.
  • the context-dependent current utterance 209 along with a plurality of historical utterances, e.g. the next most relevant historical utterances, may be fed into the classification layer 212.
  • the classification layer 212 may be configured to classify the context-dependent current utterance 209 and a plurality of historical utterances into a predetermined category.
  • Predetermined categories may be selected to assist in interpreting a dialogue or conversation. Predetermined categories may be selected to extract constituents of a dialogue or conversation. The predetermined categories may assist in determining an appropriate action forthe driving companion, given the context of the dialogue or conversation. The predetermined categories may be selected from intention, emotion or topic. The predetermined categories may be selected depending on a framework for determining an appropriate action to be performed by the driving companion. A category may be classified to each current utterance 201. If multiple categories are assigned, the NLU system 102 may combine the multiple outputs from the classification layer 212 into a feature vector 213 for input into the framework. The output 213 of the classification layer 212 or the NLU system 102 may be fed into the framework for the driving companion to determine and/or perform an appropriate action.
  • the framework may be any suitable one, for example according to predefined rules or learned from data using an action model.
  • rules may be defined based on the categories, for example intention, topic and emotion.
  • an action model may be a machine learning model that learns from data what an optimal action would be, given the input from the NLU system 102 language understanding module.
  • NLU systems typically fill in slots predefined for a certain task. Slots may allow many types of data to be filled in, such as strings, numbers, or pointers to other slots.
  • the classification layer 212 may output a probability distribution of the predetermined categories (e.g., intentions or emotions or topic) in order to select a predetermined category that fits the context-dependent current utterance 209.
  • the classifying step of the classification layer 212 may comprise: determining a probability distribution of each of the predetermined categories; and determining the most probable predetermined category, based on the probability distribution, for the driving companion to determine an appropriate action responsive to the classified category.
  • the classified category or classified categories may be converted or encoded into a feature vector 213.
  • the feature vector 213 may be the output of the NLU system 102 for the driving companion to determine and/or perform an appropriate action responsive to the classified category.
  • the output 213 of the classification layer 212 or the NLU system 102 may be fed into downstream module(s), such as the framework disclosed above, in order to determine and/or perform an appropriate action for the driving companion.
  • a method of training a natural language understanding system of a driving companion to determine appropriate actions comprises: determining a historical sub-sequence comprised in a database of historical utterances that has a highest probability of relevance to a test utterance; merging the determined historical sub-sequence with the test utterance, thereby providing a context-dependent test utterance; classifying the context-dependent test utterance into a predetermined category to optimize the action that the driving companion has to determine responsive to the classified category.
  • the disclosed method may be performed by the driving companion or the NLU system disclosed herein.
  • the disclosed method may be performed by other computing devices, e.g. in a computer lab.
  • the database of historical utterances may be stored in a computer-readable storage medium, for example in the storage media of the computing device the method is performed on.
  • Test utterances may be obtained from any suitable source, such as from databases obtained commercially or from testing in a vehicle. Test utterances obtained during testing may be detected as described herein. The test utterances may be stored in the storage media together with the database of historical utterances. The test utterances may be included in the database of historical utterances. The test utterances may be converted into audio signals that are input into the NLU system.
  • the NLU system may be one described herein, such as the NLU system 102 illustrated in Fig. 2.
  • the test utterance 201 may be encoded into a feature vector 203 by the NLU system 102.
  • Other test data may be obtained from any suitable source. The other test data may be provided as described above, e.g.
  • test data may be provided to supplement understanding of the test utterance.
  • the test utterance 201 and the other test data, if included, may be combined into a fused feature vector.
  • the test utterance feature vector 203 or the fused feature vector 203 may be fed into a disentanglement module 204’ as illustrated in Fig. 3 in accordance with an embodiment of the invention.
  • a historical sub-sequence comprised in the database of historical utterances that is most relevant to the test utterance may be determined.
  • the determination step 206’ may be performed by the disentanglement module 204’.
  • the determination step 206’ may comprise: determining conditional probability of relevance to the test utterance 201 of each historical sub-sequence comprised in the historical utterances, determining a distribution of the conditional probabilities of the historical sub-sequences, and determining the most relevant historical sub-sequence based on the distribution.
  • the conditional probability may be as disclosed herein.
  • the conditional probability may have a value from 1 to K, given the test utterance, the K sub-sequences and the original flow of the test utterances.
  • the conditional probability may be determined using a softmax operator.
  • the conditional probability may be sampled before proceeding to process the next test utterance.
  • Sampling may be performed by the step of determining a distribution of the conditional probabilities of the historical sub-sequences.
  • the distribution may provide an indication of whether the determined conditional probabilities are correct or are not anomalies. Further advantageously, backpropagation of the flow of test utterances may not be hindered by the sampling of the relevance probabilities of the historical sub-sequences.
  • the distribution may be a Gumbel distribution. Sampling may comprise computing a Gumbel approximation.
  • the sampling process may therefore be turned into a differentiable approximation that does not hinder backpropagation.
  • the most relevant historical sub-sequence may then be determined based on the distribution.
  • the end-to-end training of the NLU system 102 including disentanglement module 204 and classification layer 212, may be enabled by re-parameterizing the distribution over the discrete variables of historical utterances or historical sub-sequences, thereby disentangling or reorganizing the flow of the utterances.
  • the most relevant historical sub-sequence may be merged in step 208, as described herein, with the test utterance, thereby providing a context-dependent test utterance.
  • the test utterance may be encoded with a context feature vector to result in a context-dependent test utterance. Merging the determined historical sub-sequence with the test utterance provides a context-dependent test utterance.
  • the context-dependent test utterance may be fed into classification layer 212 of the NLU system 102 as described herein.
  • the classification layer 212 may be configured to classify the test utterance or the context-dependent test utterance into a predetermined category as described herein.
  • the predetermined categories may be selected from intention, emotion or topic.
  • the classification layer 212 may output a probability distribution of the predetermined categories in order to select a predetermined category that fits the context-dependent current utterance 209.
  • the classifying step of the classification layer 212 may comprise: determining a probability distribution of each of the predetermined categories; and determining the most probable predetermined category, based on the probability distribution, to optimize the action that the driving companion has to determine responsive to the classified category.
  • the NLU system 102 may comprise a machine learning model, such as a model based on modern neural networks.
  • the disentanglement module 204’ and classification layer 212 may comprise machine learning model, such as a model based on modern neural networks.
  • the model may be based on modern neural nets with all components of the NLU system 102 being differentiable in order to compute gradients.
  • the determining of the conditional probability of relevance with respect to context to the test utterance 201 of each historical sub-sequence comprised in the historical utterances and the determining of the distribution of the conditional probabilities of the historical sub-sequences are facilitated by the softmax operator and the Gumbel distribution.
  • this Gumbel-softmax trick in the determination step 206’ enables the updating of each component of the NLU system 102 by backpropagation.
  • the model yields a prediction on a label of interest.
  • An error is then computed, which represents the difference between the prediction and the true label (human annotation). This error is usually measured by a loss function which is differentiable across parameters of all the components.
  • the model updates each parameter of each component to minimize this error.
  • the output of the NLU system 102 may be fed into downstream module(s) in order to determine and/or optimize the action that the driving companion has to perform.
  • the NLU system 102 trained according to the disclosed method may be incorporated into the driving companion.

Abstract

There is provided a driving companion comprising: a microphone to detect utterances from at least one occupant of a vehicle; and a natural language understanding system comprising: a disentanglement module configured to: receive a current utterance detected by the microphone; determine a historical sub-sequence comprised in historical utterances that has a highest probability of relevance to the current utterance; merge the determined historical sub-sequence with the current utterance, thereby providing a context-dependent current utterance; and a classification layer configured to classify the context-dependent current utterance into a predetermined category for the driving companion to determine an appropriate action responsive to the classified category. There is also provided a method of training a natural language understanding system of a driving companion to determine appropriate actions.

Description

DRIVING COMPANION COMPRISING A NATURAL LANGUAGE UNDERSTANDING SYSTEM AND METHOD FOR TRAINING THE NATURAL LANGUAGE UNDERSTANDING SYSTEM
Field of Invention
[001] This invention generally relates to a driving companion and specifically relates to natural language understanding systems of a driving companion.
Background of Invention
[002] Driving a vehicle can be a lonesome and boring task, especially on long-distance trips. Electronic companions conventionally used for domestic purposes have been proposed to assist drivers on such long-distance trips.
[003] Electronic companions may be built to help the driver accomplish certain tasks and for social interaction (e.g., chit-chat or exhibit empathy). During any interaction with the electronic companion, it is common that a conversation might become too long and complicated for the electronic companion to understand and provide an appropriate action. It is, therefore, necessary to discover the most relevant contextual information in the conversation as well as discover the correlation between dialogue turns, in order for the electronic companion to effectively facilitate its understanding and to thereby choose appropriate dialogue actions.
[004] Natural language understanding (NLU) systems assess users’ input and parse the input into a structural form (e.g., recognizing intention and recognizing entities within the input). Therefore, NLU systems are a core component of an electronic companion or an in-car assistant to understand a user’s intention of the dialogue. A challenge for NLU systems is that the user’s intention may be conveyed not only by language but also by other less observable states, such as emotion, given the context of a conversation. On top of that, further complications can happen during multi-party conversations, where users intervene one another during conversation or switch topics, thereby making it difficult for the NLU system to analyse the correlation between dialogue turns or analyse the context of the conversation. [005] Most existing NLU systems or models analyse dialogue based on sequential input. The correlation between dialogue turns is therefore constrained by the original sequence of utterances, which oversimplifies conversational behaviours of human beings by failing to consider complexities such as discourse relation. Although non-sequential models, such as an attention model or transformer model, might somehow relieve the problem by non-sequential analyses, they are not able to discover the underlying relation between dialogue turns.
[006] Accordingly, there is a need to provide an electronic companion that overcomes or at least ameliorates one or more of the disadvantages discussed above and other disadvantages.
Summary
[007] It is an object to provide a driving companion to address the problems discussed above. In particular, it is an object to provide a driving companion that is able to disentangle a dialogue flow and ultimately choose an appropriate dialogue action, thereby addressing the problems discussed above.
[008] To accomplish this and other objects, there is provided, in an aspect, a driving companion comprising: a microphone to detect utterances from at least one occupant of a vehicle; and a natural language understanding system comprising: a disentanglement module configured to: receive a current utterance detected by the microphone; determine a historical sub-sequence comprised in historical utterances that has a highest probability of relevance to the current utterance; merge the determined historical sub-sequence with the current utterance, thereby providing a context-dependent current utterance; and a classification layer configured to classify the context-dependent current utterance into a predetermined category for the driving companion to determine an appropriate action responsive to the classified category.
[009] In another aspect, there is provided a method of training a natural language understanding system of a driving companion to determine appropriate actions, the method comprising: determining a historical sub-sequence comprised in a database of historical utterances that has a highest probability of relevance to a test utterance; merging the determined historical sub-sequence with the test utterance, thereby providing a context-dependent test utterance; classifying the context-dependent test utterance into a predetermined category to optimize the action that the driving companion has to determine responsive to the classified category.
[010] Known solutions analyse past utterances to obtain the context of a current utterance. However, the past utterances used in such analyses may not be the best or the most relevant utterance that should be considered for determining the context of the current utterance. It is therefore an advantage of the present disclosure that the most relevant historical sub-sequence, in particular with respect to context, to the current utterance or test utterance is selected to provide the context to the current utterance or test utterance respectively. By identifying the historical sub-sequence most relevant to the current or test utterance, the correlation between dialogue turns may advantageously be discovered. Thus, the present disclosure provides an improved natural language understanding system that is better able to facilitate effective understanding of a dialogue. The present disclosure also provides an improved driving companion that is better able to determine and subsequently perform an appropriate action responsive to an utterance.
[011] The historical sub-sequence that has the highest probability of relevance with respect to context to the current utterance is referred to herein as the “most relevant historical sub-sequence”. Thus, in the context of the present disclosure, relevance refers to being contextually relevant to a current utterance or test utterance, unless otherwise specified. The term “most relevant historical sub-sequence” not only includes one historical sub-sequence that is the most relevant or has the highest probability of relevance, but may also include a historical sub-sequence that has a highest probability of relevance relative to the other historical sub-sequences that are being analysed.
[012] The determined historical sub-sequence may not necessarily be the sub-sequence immediately preceding the current or test utterance. The context-dependent current or test utterance may comprise the current or test utterance and the determined historical sub-sequence. The context-dependent current or test utterance may then be fed into the classification layer. In known solutions, a current instance and a plurality of historical instances immediately preceding the current instance is typically classified. However as mentioned above, the choice of historical instances used may not be the best or the most relevant for classification of a current instance. It is therefore an advantage of the present disclosure that the current utterance, along with the most relevant historical sub-sequence that is chosen, is fed into the classification layer. Merging the most relevant historical sub-sequence with the current utterance effectively disentangles the flow of a dialogue. The analysis of the dialogue may therefore not be constrained by its sequence or flow. The flow of the dialogue may effectively be reorganized or disentangled in relation to the current or test utterance. The correlation found between the current utterance and historical utterances advantageously provides the context needed for the current utterance. The determined historical sub-sequence may advantageously be used as a contextual feature to represent the current utterance. The disclosed context-dependent current utterance may be additionally represented or tagged or annotated by other known methods, such as by neural net-based embeddings, e.g. semantic tagging or by human designed feature templates.
[013] The natural language understanding system may further comprise an utterance encoder configured to encode a current or test utterance into a feature vector. The current or test utterance may be an audible utterance detected by a microphone. The natural language understanding system may obtain other inputs from the occupant to supplement understanding of the dialogue. The additional inputs may be converted into a feature vector that is received by the natural language understanding system, the disentanglement module or the utterance encoder. The utterance encoder may be configured to combine more than one feature vector into a fused feature vector. The fused feature vector may be used in the determination of the most relevant historical sub-sequence. Advantageously, inputs, including less observable states such as gestures and facial expressions, that are useful for the determination of the most relevant historical sub-sequence may also be considered. A more complete context to the current utterance may result, leading to a more appropriate action being determined and/or performed by the driving companion.
[014] The determination step may comprise determining conditional probability of relevance to the current or test utterance of each historical sub-sequence comprised in the historical utterances. In the disclosed training method, the determination step may further comprise determining a distribution of the conditional probabilities of the historical sub-sequences and determining the most relevant historical sub-sequence based on the distribution. The most relevant historical sub-sequence may be determined by selecting the historical sub-sequence with the highest conditional probability, for example based on the distribution. Alternatively, the most relevant historical sub-sequence may be determined by selecting the historical sub-sequence with the highest weighted sum of all historical sub-sequences. Advantageously, the reorganization of the dialogue flow may be done in a task-specific and data-driven manner without requiring additional human annotation. The disclosed training method may comprise backpropagation across the steps of the method, in order to minimize any error in each step. During training, the classification layer may provide error feedback to the disentanglement module. Advantageously, each step executed by a trained natural language understanding system or each component of a trained natural language understanding system may work together with minimal error. The classification layer of a trained natural language understanding system may be conditioned on the disentanglement module of the trained system. Brief Description of Drawings
[015] Fig. 1 shows an illustration of a vehicle 100 in accordance with an embodiment of the invention.
[016] Fig. 2 shows an illustration of a natural language understanding system 102 in accordance with an embodiment of the invention. [017] Fig. 3 shows an illustration of a disentanglement module 204’ of natural language understanding system 102 used in a training method in accordance with an embodiment of the invention. [018] In the figures, like numerals denote like parts.
Detailed Description
[019] Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings. The detailed description of this invention will be provided for the purpose of explaining the principles of the invention and its practical application, thereby enabling a person skilled in the art to understand the invention for various exemplary embodiments and with various modifications as are suited to the particular use contemplated. The detailed description is not intended to be exhaustive or to limit the invention to the precise embodiments disclosed. Modifications and equivalents will be apparent to practitioners skilled in this art and are encompassed within the spirit and scope of the appended claims.
[020] In an embodiment, there is provided a driving companion. The driving companion comprises a microphone to detect utterances from at least one occupant of a vehicle. The driving companion further comprises a natural language understanding system. The natural language understanding system comprises a disentanglement module and a classification layer. The disentanglement module is configured to receive a current utterance detected by the microphone; determine a historical sub-sequence comprised in historical utterances that has a highest probability of relevance to the current utterance; and merge the determined historical sub-sequence with the current utterance, thereby providing a context-dependent current utterance. The classification layer is configured to classify the context-dependent current utterance into a predetermined category for the driving companion to determine an appropriate action responsive to the classified category. [021] The embodiment is illustrated in Fig. 1 , which shows a vehicle 100 in accordance with an embodiment of the invention. The vehicle 100 comprises a microphone 104. The microphone 104 may be positioned in a suitable location to detect utterances from at least one occupant of the vehicle 100. There may be one or more microphones 104 positioned at suitable locations in the vehicle 100. The vehicle comprises a driving companion. The driving companion may be a system comprising the one or more microphones 104 and a natural language understanding (NLU) system 102. [022] The driving companion may be referred to as an electronic companion or a digital assistant. The driving companion may be designed to accomplish tasks indicated or expressed by a user of the companion, such as an occupant, driver or passenger of the vehicle. The driving companion may be designed to perform instructions, such as to recommend a restaurant at the most convenient location or to execute vehicle-related instructions e.g. to change radio station. Alternatively, or additionally, the driving companion may be designed to chat with one or more occupants of the vehicle. The driving companion may be designed to proactively provide information, such as in response to a conversation between the driving companion and one or more occupants, or a conversation between occupants. The driving companion may be designed to participate in a conversation between the driving companion and one or more occupants, or a conversation between occupants. The driving companion may be designed to respond upon detecting a trigger, such as a spoken wake word or a physical input e.g. a button. [023] The driving companion may be a computing device, such as a general computing device or an embedded system. The driving companion may be integrated in the vehicle, e.g. an OEM part or an in-vehicle digital assistant or integrated into a vehicle system such as an in-vehicle navigation system. Alternatively, the driving companion may be implemented in a separate device, e.g. a user device such as a smartphone. In such case, the microphone 104 may be the microphone of such device. In such case, the separate device communicates with the vehicle to perform vehicle-related actions and/or to make use of vehicle systems. [024] The NLU system 102 may be a computing device, such as a general computing device or an embedded system. The NLU system 102 may be a computing device separate from the driving companion. In such case, the NLU system 102 may be integrated in the vehicle or may be implemented in a separate device. Alternatively, the NLU system 102 may be part of the computing device of the driving companion.
[025] In an example, a user may be uttering a command. The microphone 104 may detect such utterance from the user. The command may be for example a command to change the radio station that is playing on the vehicle’s infotainment system. Before the current command, the user may have uttered a command to recommend a restaurant at a convenient location. To determine an appropriate action for the current command, the driving companion may draw a conclusion from previous commands relating to radio stations, or more broadly, previous commands relating to infotainment. The driving companion may draw a conclusion from the immediately preceding command of restaurant recommendation, to determine an appropriate action. The driving companion may draw a conclusion from all previous commands to determine an appropriate action. However, as not all previous commands have relevance to the current command of changing radio station, the NLU system 102 may determine which previous command has the highest probability of relevance to the current command. The NLU system 102 may determine which previous command or commands have the highest probability of relevance to the current command. Each command may be referred to as a sub-sequence. Depending on how the user utters a command, e.g. the user may utter a command at one go or provide multiples utterances for the one command, a sub-sequence may comprise an utterance or multiple utterances. An utterance may be defined as a group of words or sounds detected in one time step, as determined by the microphone or by the NLU system 102, depending on the design of the driving companion.
[026] The user may be having a conversation with the driving companion or with one or more other occupants of the vehicle. The term “conversation” may be synonymous with the term “dialogue”, particularly with respect to a conversation between the driving companion and the user and/or one or more other occupants. The microphone 104 may detect such conversation. The conversation may comprise subjects such as music that the user likes. The portion of the conversation relating to the music that the user likes may be referred to as a sub-sequence. A portion of the conversation relating to another subject may be referred to as another sub-sequence. A change of sub-sequence may be referred to as a dialogue turn. A dialogue turn may also include a change from a user uttering a command to the user uttering another command. Each dialogue turn may comprise a user utterance and a companion-generated audio reply. Consequently, the conversation may comprise a plurality of sub-sequences. Referring to the example above, a potential relevant sub-sequence may include the sub-sequence relating to the music that the user likes. The driving companion may draw a conclusion from previous sub-sequences to determine an appropriate action. The driving companion may determine which previous sub-sequence or sub-sequences have the highest probability of relevance to the current command or utterance.
[027] The utterances detected by the microphone 104 may be converted into audio signals that are transmitted to the NLU system 102.
[028] The NLU system 102 is illustrated in Fig. 2 in accordance with an embodiment of the invention. The NLU system 102 comprises a disentanglement module 204 and a classification layer 212. The disentanglement module 204 is configured to receive a current utterance 201 detected by the microphone 104. The current utterance 201 may be in the form of audio signals received from the microphone 104. The NLU system 102 may further comprise: an utterance encoder 202 configured to encode a current utterance 201 detected by the microphone 104 into a feature vector 203. The audio signals transmitted to the NLU system 102 may be received by the utterance encoder 202. The audio signals transmitted to the NLU system 102 may be encoded by the utterance encoder 202 into a feature vector 203. [029] The vehicle or the driving companion may further comprise a camera 106 to detect gestures from the occupant. Additionally, or alternatively, the camera 106 may be configured to detect facial expressions of the occupant. Additionally, the camera may be configured to detect other movements of the occupant, such as drowsiness, drunkenness, attentiveness and/or movements related to disease. The camera 106 may be positioned in a suitable location to detect such gestures, facial movements and/or body movements from at least one occupant of the vehicle 100. There may be one or more cameras 106 positioned at suitable locations in the vehicle 100. The image or video frame captured by camera 106 may be converted into a feature vector by an image processing module (not shown in the figures). The image processing module may be part of the camera 106 or the driving companion. The image processing module may include an image encoder configured to encode the image or video frame into a feature vector. The utterance encoder 202 may be further configured to receive the feature vector encoded from an image of a gesture from the occupant obtained from camera 106.
[030] The vehicle or the driving companion may further comprise a display 108 to display information for viewing by occupant(s) of the vehicle 100. The display 108 may be a human-machine interface. The display 108 may be a display screen or an in-vehicle display screen. A user may interact with the display 108 by touching or pressing an option displayed on the display 108. The display 108 may allow the user to select letters displayed on the display 108 to formulate a query. Alternatively, or additionally, a user may interact with the display 108 by pressing one or more buttons and/or turning one or more dials to select option(s) displayed on the display 108. The display 108 may be connected to input devices (not shown in the figures), such as a keyboard or touchpad, to enable the user to type in a query. The user may provide input via the display 108 for the driving companion to consider. The display 108 may be positioned in a suitable location for viewing by occupant(s) of the vehicle 100. There may be one or more displays 108 positioned at suitable locations in the vehicle 100. The input or text received by the display 108 may be converted into a feature vector by the display 108 or the driving companion. The display 108 or the driving companion may include a text encoder configured to encode the input or text received by the display 108 into a feature vector. The utterance encoder 202 may be further configured to receive the feature vector encoded from input received by the display 108.
[031] The encoders disclosed herein, e.g. the utterance encoder 202 or the text encoder or the image encoder, may comprise a machine learning model. Suitable machine learning models may include neural networks or recurrent neural networks. The encoders disclosed herein may reduce a set of data, such as the set of audio or text or image data, into representations, in the form of feature vectors, for feature detection.
[032] The utterance encoder 202 may be configured to combine more than one feature vector into a fused feature vector. The feature vectors derived from different modes, e.g. audio, text or images, may be input into the utterance encoder 202 by different input channels. The input into the utterance encoder 202 may therefore be multi-modal. The feature vectors from different input channels may be combined or fused into a unified feature representation or fused feature vector.
[033] The feature vector 203 or, where the input is multi-modal, the fused feature vector 203 may be fed into the disentanglement module 204. The current utterance received by the disentanglement module 204 may be a current utterance feature vector 203. The current utterance feature vector 203 may be a fused feature vector. The disentanglement module 204 may be configured to receive a current utterance feature vector 203 or a fused feature vector 203. [034] The driving companion or the NLU system 102 may be a computing device that typically comprises, among other components, one or more processors connected to computer-readable storage media. The historical utterances may be stored in storage media of the NLU system 102 or storage media of the driving companion. The disentanglement module 204 may be configured to retrieve the historical utterances from the storage media. The historical utterances may be stored in a database in the storage media. The historical utterances may be maintained in the storage media as sub-sequences. The historical utterances may be grouped as sub-sequences and stored in the storage media. As mentioned above, a sub-sequence may comprise one or more utterances related to each other. Historical utterances may be stored for up to a time period, depending on the design of the driving companion. Historical utterances may be stored for each conversation. Historical utterances may be stored for up to a session, e.g. for each time the driving companion is turned on. Historical utterances may be stored in non-transitory computer-readable storage media and may persist in storage even after the driving companion is turned off.
[035] The disentanglement module 204 may be configured to determine, in step 206, a historical sub-sequence out of the historical utterances retrieved from the storage media, that has a highest probability of relevance to the current utterance 201. The disentanglement module 204 may be configured to determine, in step 206, a historical sub-sequence out of the historical sub-sequences retrieved from the storage media, that has a highest probability of relevance to the current utterance 201.
[036] The determination step 206 may comprise: determining conditional probability of relevance to the current utterance 201 of each historical sub-sequence comprised in the historical utterances, and determining the historical sub-sequence with a maximum conditional probability. The most relevant historical sub-sequence may be inferred from dialogue history or historical utterances, based on a conditional probability distribution conditioned on the original dialogue flow and historical utterances concatenated to each sub-sequence. The conditional probability may have a value from 1 to K, given the current utterance, the K sub-sequences and the original dialogue flow. The conditional probability may be determined using a softmax operator.
[037] Obtaining the historical sub-sequence that has the maximum conditional probability of relevance to the current utterance 201 enables the discovery of the underlying relation between dialogue turns. The determination step 206 creates a task-oriented disentanglement of the dialogue flow to facilitate effective understanding of the dialogue. The reorganization of the dialogue flow, i.e. the non-sequential analysis of the dialogue flow, in a task-specific and data-driven manner reduces or eliminates the need for additional human annotation.
[038] In step 208, the disentanglement module 204 may be configured to merge the current utterance 201 with the historical sub-sequence determined to have the highest probability of relevance. The merging step 208 results in a context-dependent current utterance 209. The merging step 208 may therefore be referred to as a context encoding step. With the reorganized dialogue flow, the feature vector 203 of current utterance or the fused feature vector 203 may be merged with the feature vector of the most relevant historical sub-sequence to generate a context-dependent feature vector 209. Prior NLU systems or models encode the dialogue contexts based on sequential input, which contrasts with the present disclosure. Although the attention model or transformer model may provide non-sequential encoding processes, they are not able to discover the underlying relation between dialogue turns, which contrasts with the present disclosure. Therefore, the existing NLU systems or models are unable to disentangle the dialogue flow to effectively facilitate understanding. The historical sub-sequence determined to have the highest probability of relevance to the current utterance 201 may provide the most relevant contextual information to the current utterance 201. The most relevant historical sub-sequence may therefore be the link between dialogue turns, thereby facilitating effective understanding of the dialogue and to support choosing appropriate dialogue actions by the driving companion.
[039] In order to determine or generate proper responses, the NLU system 102 may be required to extract information from the dialogue, such as the user’s intention, emotion, topic and constrains. In some scenarios, the context of the dialogue may be necessary to interpret such information accurately. For example, assume that Mary Lee and Mary Chan are both found in the user's address book. When the user asks the driving companion to “call Mary’s cell phone” in the first turn, context does not indicate which Mary is intended; hence there is ambiguity. However, if Mary Chan was mentioned in a historical utterance, such historical utterance may be considered the most relevant to the current utterance of “call Mary’s cell phone”, depending on the other historical sub-sequences of course. Hence, the merging step 208 or context encoding may provide disambiguation to the dialogue. [040] Information extracted from a dialogue may include intent. Intent refers to an overall purpose of the dialogue or the task to be performed by the driving companion, such as instructions like placing a phone call, questions like searching for a restaurant or knowing more about a place, or responses. Information extracted from a dialogue may include a topic, such as calling someone to schedule a meeting, eating at a restaurant at a convenient location between meetings, or exploring a new town. Information extracted from a dialogue may include an emotion, which may supplement the determination of an appropriate action to be performed by the driving companion.
[041] The context-dependent current utterance 209 may be fed into classification layer 212 of the NLU system 102. The classification layer 212 of the NLU system 102 may be configured to classify the current utterance or the context-dependent current utterance 209 into a predetermined category for the driving companion to determine an appropriate action responsive to the classified category. As the context-dependent current utterance 209 includes the most relevant historical sub-sequence, other historical utterances or historical sub-sequences may not be required to be fed into the classification layer 212. Thus, the classification layer 212 may advantageously require only the context-dependent current utterance 209 in order to determine a category. In other embodiments, the context-dependent current utterance 209 along with a plurality of historical utterances, e.g. the next most relevant historical utterances, may be fed into the classification layer 212. In such embodiments, the classification layer 212 may be configured to classify the context-dependent current utterance 209 and a plurality of historical utterances into a predetermined category.
[042] Predetermined categories may be selected to assist in interpreting a dialogue or conversation. Predetermined categories may be selected to extract constituents of a dialogue or conversation. The predetermined categories may assist in determining an appropriate action forthe driving companion, given the context of the dialogue or conversation. The predetermined categories may be selected from intention, emotion or topic. The predetermined categories may be selected depending on a framework for determining an appropriate action to be performed by the driving companion. A category may be classified to each current utterance 201. If multiple categories are assigned, the NLU system 102 may combine the multiple outputs from the classification layer 212 into a feature vector 213 for input into the framework. The output 213 of the classification layer 212 or the NLU system 102 may be fed into the framework for the driving companion to determine and/or perform an appropriate action. The framework may be any suitable one, for example according to predefined rules or learned from data using an action model. For predefined rules, rules may be defined based on the categories, for example intention, topic and emotion. For action models, an action model may be a machine learning model that learns from data what an optimal action would be, given the input from the NLU system 102 language understanding module.
[043] NLU systems typically fill in slots predefined for a certain task. Slots may allow many types of data to be filled in, such as strings, numbers, or pointers to other slots. The predetermined categories may each have a slot which will be filled in after the NLU system 102 classifies the context-dependent current utterance 209. For example, a query to call a person may specify that slots for intent and content be filled in. In another example, a query to play music may specify that slots for intent, content and emotion be filled in. In this example, where the slots are filled in as follows: “intention=play music”, “topic=entertainment”, and “emotion=happy”, the driving companion will determine and/or perform the action of playing a happy song for the user. [044] The classification layer 212 may output a probability distribution of the predetermined categories (e.g., intentions or emotions or topic) in order to select a predetermined category that fits the context-dependent current utterance 209. The classifying step of the classification layer 212 may comprise: determining a probability distribution of each of the predetermined categories; and determining the most probable predetermined category, based on the probability distribution, for the driving companion to determine an appropriate action responsive to the classified category. The classified category or classified categories may be converted or encoded into a feature vector 213. The feature vector 213 may be the output of the NLU system 102 for the driving companion to determine and/or perform an appropriate action responsive to the classified category. The output 213 of the classification layer 212 or the NLU system 102 may be fed into downstream module(s), such as the framework disclosed above, in order to determine and/or perform an appropriate action for the driving companion. [045] In an embodiment, there is provided a method of training a natural language understanding system of a driving companion to determine appropriate actions. The method comprises: determining a historical sub-sequence comprised in a database of historical utterances that has a highest probability of relevance to a test utterance; merging the determined historical sub-sequence with the test utterance, thereby providing a context-dependent test utterance; classifying the context-dependent test utterance into a predetermined category to optimize the action that the driving companion has to determine responsive to the classified category.
[046] The disclosed method may be performed by the driving companion or the NLU system disclosed herein. The disclosed method may be performed by other computing devices, e.g. in a computer lab. The database of historical utterances may be stored in a computer-readable storage medium, for example in the storage media of the computing device the method is performed on.
[047] Test utterances may be obtained from any suitable source, such as from databases obtained commercially or from testing in a vehicle. Test utterances obtained during testing may be detected as described herein. The test utterances may be stored in the storage media together with the database of historical utterances. The test utterances may be included in the database of historical utterances. The test utterances may be converted into audio signals that are input into the NLU system. The NLU system may be one described herein, such as the NLU system 102 illustrated in Fig. 2. The test utterance 201 may be encoded into a feature vector 203 by the NLU system 102. Other test data may be obtained from any suitable source. The other test data may be provided as described above, e.g. audio, image, video or text data, to add to the robustness of the disclosed method. The other test data may be provided to supplement understanding of the test utterance. The test utterance 201 and the other test data, if included, may be combined into a fused feature vector. The test utterance feature vector 203 or the fused feature vector 203 may be fed into a disentanglement module 204’ as illustrated in Fig. 3 in accordance with an embodiment of the invention. [048] With respect to a test utterance, a historical sub-sequence comprised in the database of historical utterances that is most relevant to the test utterance may be determined. The determination step 206’ may be performed by the disentanglement module 204’. The determination step 206’ may comprise: determining conditional probability of relevance to the test utterance 201 of each historical sub-sequence comprised in the historical utterances, determining a distribution of the conditional probabilities of the historical sub-sequences, and determining the most relevant historical sub-sequence based on the distribution. [049] The conditional probability may be as disclosed herein. The conditional probability may have a value from 1 to K, given the test utterance, the K sub-sequences and the original flow of the test utterances. The conditional probability may be determined using a softmax operator. [050] As part of training the NLU system 102, the method or determination step
206’ may comprise sampling the conditional probability of relevance of each historical sub-sequence comprised in the historical utterances to the test utterance. The conditional probability may be sampled before proceeding to process the next test utterance. Sampling may be performed by the step of determining a distribution of the conditional probabilities of the historical sub-sequences. Advantageously, the distribution may provide an indication of whether the determined conditional probabilities are correct or are not anomalies. Further advantageously, backpropagation of the flow of test utterances may not be hindered by the sampling of the relevance probabilities of the historical sub-sequences. The distribution may be a Gumbel distribution. Sampling may comprise computing a Gumbel approximation. The sampling process may therefore be turned into a differentiable approximation that does not hinder backpropagation. The most relevant historical sub-sequence may then be determined based on the distribution. The end-to-end training of the NLU system 102, including disentanglement module 204 and classification layer 212, may be enabled by re-parameterizing the distribution over the discrete variables of historical utterances or historical sub-sequences, thereby disentangling or reorganizing the flow of the utterances. [051] The most relevant historical sub-sequence may be merged in step 208, as described herein, with the test utterance, thereby providing a context-dependent test utterance. The test utterance may be encoded with a context feature vector to result in a context-dependent test utterance. Merging the determined historical sub-sequence with the test utterance provides a context-dependent test utterance.
[052] The context-dependent test utterance may be fed into classification layer 212 of the NLU system 102 as described herein. The classification layer 212 may be configured to classify the test utterance or the context-dependent test utterance into a predetermined category as described herein. For example, the predetermined categories may be selected from intention, emotion or topic.
[053] The classification layer 212 may output a probability distribution of the predetermined categories in order to select a predetermined category that fits the context-dependent current utterance 209. The classifying step of the classification layer 212 may comprise: determining a probability distribution of each of the predetermined categories; and determining the most probable predetermined category, based on the probability distribution, to optimize the action that the driving companion has to determine responsive to the classified category.
[054] In some implementations, the NLU system 102 may comprise a machine learning model, such as a model based on modern neural networks. The disentanglement module 204’ and classification layer 212 may comprise machine learning model, such as a model based on modern neural networks. The model may be based on modern neural nets with all components of the NLU system 102 being differentiable in order to compute gradients. The determining of the conditional probability of relevance with respect to context to the test utterance 201 of each historical sub-sequence comprised in the historical utterances and the determining of the distribution of the conditional probabilities of the historical sub-sequences are facilitated by the softmax operator and the Gumbel distribution. Together, this Gumbel-softmax trick in the determination step 206’ enables the updating of each component of the NLU system 102 by backpropagation. Specifically, in a forward pass-through of all components of the NLU system 102, the model yields a prediction on a label of interest. An error is then computed, which represents the difference between the prediction and the true label (human annotation). This error is usually measured by a loss function which is differentiable across parameters of all the components. Based on the gradients, the model then updates each parameter of each component to minimize this error.
[055] The output of the NLU system 102 may be fed into downstream module(s) in order to determine and/or optimize the action that the driving companion has to perform. [056] The NLU system 102 trained according to the disclosed method may be incorporated into the driving companion.

Claims

Patent claims
1. A driving companion comprising: a microphone to detect utterances from at least one occupant of a vehicle; and a natural language understanding system comprising: a disentanglement module configured to: receive a current utterance detected by the microphone; determine a historical sub-sequence comprised in historical utterances that has a highest probability of relevance with respect to context to the current utterance; merge the determined historical sub-sequence with the current utterance, thereby providing a context-dependent current utterance; and a classification layer configured to classify the context-dependent current utterance into a predetermined category for the driving companion to determine an appropriate action responsive to the classified category.
2. The driving companion of claim 1 , wherein the determination step of the disentanglement module comprises: determining conditional probability of relevance with respect to context to the current utterance of each historical sub-sequence comprised in the historical utterances, determining the historical sub-sequence with a maximum conditional probability with respect to context.
3. The driving companion of claim 2, wherein the conditional probability is determined using a softmax operator.
4. The driving companion of any preceding claim, wherein the predetermined categories are selected from intention, emotion or topic.
5. The driving companion of any preceding claim, wherein the classifying step of the classification layer comprises: determining a probability distribution of each of the predetermined categories; determining the most probable predetermined category, based on the probability distribution, for the driving companion to determine an appropriate action responsive to the classified category.
6. The driving companion of any preceding claim, wherein the natural language understanding system further comprises: an utterance encoder configured to encode a current utterance detected by the microphone into a feature vector.
7. The driving companion of claim 6, wherein the utterance encoder is further configured to receive a feature vector encoded from an image of a gesture from the occupant obtained from a camera.
8. The driving companion of claim 6 or 7, wherein the utterance encoder is further configured to receive a feature vector encoded from input received by an in-vehicle display screen.
9. The driving companion of any one of claims 6-8, wherein the utterance encoder is configured to combine more than one feature vector into a fused feature vector.
10. The driving companion of claim 6 or 9, wherein the disentanglement module is configured to receive the current utterance feature vector or the fused feature vector.
11. A method of training a natural language understanding system of a driving companion to determine appropriate actions, the method comprising: determining a historical sub-sequence comprised in a database of historical utterances that has a highest probability of relevance with respect to context to a test utterance; merging the determined historical sub-sequence with the test utterance, thereby providing a context-dependent test utterance; classifying the context-dependent test utterance into a predetermined category to optimize the action that the driving companion has to determine responsive to the classified category.
12. The method of claim 11 , wherein the determination step comprises: determining conditional probability of relevance with respect to context to the test utterance of each historical sub-sequence comprised in the historical utterances, determining a distribution of the conditional probabilities of the historical sub-sequences, determining the most relevant historical sub-sequence with respect to context based on the distribution.
13. The method of claim 12, wherein the conditional probability is determined using a softmax operator.
14. The method of any one of claims 12-13, wherein the distribution is a Gumbel distribution.
15. The method of any one of claims 11-14, wherein the predetermined categories are selected from intention, emotion or topic.
16. The method of any one of claims 11-15, wherein the classifying step comprises: determining a probability distribution of each of the predetermined categories; determining the most probable predetermined category, based on the probability distribution, to optimize the action that the driving companion has to determine responsive to the classified category.
PCT/EP2021/065372 2020-06-19 2021-06-09 Driving companion comprising a natural language understanding system and method for training the natural language understanding system WO2021254838A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
GB2009391.0 2020-06-19
GB2009391.0A GB2596141A (en) 2020-06-19 2020-06-19 Driving companion

Publications (1)

Publication Number Publication Date
WO2021254838A1 true WO2021254838A1 (en) 2021-12-23

Family

ID=71838397

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2021/065372 WO2021254838A1 (en) 2020-06-19 2021-06-09 Driving companion comprising a natural language understanding system and method for training the natural language understanding system

Country Status (2)

Country Link
GB (1) GB2596141A (en)
WO (1) WO2021254838A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116991982A (en) * 2023-09-27 2023-11-03 深圳市中科云科技开发有限公司 Interactive dialogue method, device, equipment and storage medium based on artificial intelligence

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6964023B2 (en) * 2001-02-05 2005-11-08 International Business Machines Corporation System and method for multi-modal focus detection, referential ambiguity resolution and mood classification using multi-modal input

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10431205B2 (en) * 2016-04-27 2019-10-01 Conduent Business Services, Llc Dialog device with dialog support generated using a mixture of language models combined using a recurrent neural network
US10943606B2 (en) * 2018-04-12 2021-03-09 Qualcomm Incorporated Context-based detection of end-point of utterance

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6964023B2 (en) * 2001-02-05 2005-11-08 International Business Machines Corporation System and method for multi-modal focus detection, referential ambiguity resolution and mood classification using multi-modal input

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ABRO WAHEED AHMED ET AL: "Multi-turn Intent Determination for Goal-oriented Dialogue systems", 2019 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), IEEE, 14 July 2019 (2019-07-14), pages 1 - 8, XP033622083, DOI: 10.1109/IJCNN.2019.8852246 *
TIANCHENG ZHAO ET AL: "Unsupervised Discrete Sentence Representation Learning for Interpretable Neural Dialog Generation", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 22 April 2018 (2018-04-22), XP081228554 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116991982A (en) * 2023-09-27 2023-11-03 深圳市中科云科技开发有限公司 Interactive dialogue method, device, equipment and storage medium based on artificial intelligence
CN116991982B (en) * 2023-09-27 2024-02-09 深圳市天富利信息技术有限公司 Interactive dialogue method, device, equipment and storage medium based on artificial intelligence

Also Published As

Publication number Publication date
GB2596141A (en) 2021-12-22
GB202009391D0 (en) 2020-08-05

Similar Documents

Publication Publication Date Title
US11503155B2 (en) Interactive voice-control method and apparatus, device and medium
US11495224B2 (en) Contact resolution for communications systems
CN108874766B (en) Method and system for voice matching in digital assistant services
CN106548773B (en) Child user searching method and device based on artificial intelligence
US10089981B1 (en) Messaging account disambiguation
CN110288985B (en) Voice data processing method and device, electronic equipment and storage medium
US9263037B2 (en) Interactive manual, system and method for vehicles and other complex equipment
US20150331665A1 (en) Information provision method using voice recognition function and control method for device
US7684977B2 (en) User adaptive system and control method thereof
WO2019005772A1 (en) Electronic device with two-phase detection of spoken wakeword
CN111966320B (en) Multimodal interaction method for vehicle, storage medium, and electronic device
JP2011059659A (en) Method and system for activating multiple functions including first function and second function
US11574637B1 (en) Spoken language understanding models
CN114127710A (en) Ambiguity resolution using dialog search history
CN109754793A (en) Device and method for recommending the function of vehicle
JP5045486B2 (en) Dialogue device and program
CN113239178A (en) Intention generation method, server, voice control system and readable storage medium
CN114127694A (en) Error recovery for conversational systems
CN112528004A (en) Voice interaction method, voice interaction device, electronic equipment, medium and computer program product
Siegert et al. “Speech Melody and Speech Content Didn’t Fit Together”—Differences in Speech Behavior for Device Directed and Human Directed Interactions
WO2021254838A1 (en) Driving companion comprising a natural language understanding system and method for training the natural language understanding system
KR102458343B1 (en) Device and method for transreceiving audio data
KR20240034189A (en) Creating semantically augmented context representations
CN111667829B (en) Information processing method and device and storage medium
CN115062131A (en) Multi-mode-based man-machine interaction method and device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21732832

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21732832

Country of ref document: EP

Kind code of ref document: A1