WO2020159693A1 - Conversational speech agent - Google Patents

Conversational speech agent Download PDF

Info

Publication number
WO2020159693A1
WO2020159693A1 PCT/US2020/013160 US2020013160W WO2020159693A1 WO 2020159693 A1 WO2020159693 A1 WO 2020159693A1 US 2020013160 W US2020013160 W US 2020013160W WO 2020159693 A1 WO2020159693 A1 WO 2020159693A1
Authority
WO
WIPO (PCT)
Prior art keywords
response
audio
user model
caller
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/US2020/013160
Other languages
English (en)
French (fr)
Inventor
Anthony Scodary SCODARY
Alex Barron
David Cohen
Evan MAC MILLAN
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Gridspace Inc Korea
Gridspace Inc
Original Assignee
Gridspace Inc Korea
Gridspace Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Gridspace Inc Korea, Gridspace Inc filed Critical Gridspace Inc Korea
Priority to JP2021544347A priority Critical patent/JP7313455B2/ja
Priority to EP20748597.0A priority patent/EP3918508A4/en
Publication of WO2020159693A1 publication Critical patent/WO2020159693A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0475Generative networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/09Supervised learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/091Active learning
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/027Concept to speech synthesisers; Generation of natural phrases from machine-based concepts
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/42Systems providing special services or facilities to subscribers
    • H04M3/487Arrangements for providing information services, e.g. recorded voice services or time announcements
    • H04M3/493Interactive information services, e.g. directory enquiries ; Arrangements therefor, e.g. interactive voice response [IVR] systems or voice portals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/42Systems providing special services or facilities to subscribers
    • H04M3/50Centralised arrangements for answering calls; Centralised arrangements for recording messages for absent or busy subscribers ; Centralised arrangements for recording messages
    • H04M3/527Centralised call answering arrangements not requiring operator intervention
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/01Probabilistic graphical models, e.g. probabilistic networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/226Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M2201/00Electronic components, circuits, software, systems or apparatus used in telephone systems
    • H04M2201/39Electronic components, circuits, software, systems or apparatus used in telephone systems using speech synthesis

Definitions

  • FIG. 1 illustrates a system 100 in accordance with one embodiment.
  • FTG. 2 illustrates a method 200 in accordance with one embodiment.
  • FIG. 3 illustrates a system 300 in accordance with one embodiment.
  • FIG. 4 illustrates a method 400 in accordance with one embodiment.
  • FIG. 5 illustrates a process 500 in accordance with one embodiment.
  • FIG. 6 illustrates a supervised training phase 600 in accordance with one embodiment.
  • FIG. 7 illustrates a supervised training phase 700 in accordance with one embodiment.
  • FIG. 8 illustrates a supervised training phase 800 in accordance with one embodiment.
  • FTG. 9 illustrates a convolutional neural network 900 in accordance with one embodiment.
  • FIG. 10 illustrates a convolutional neural network layers 1000 in accordance with one embodiment.
  • FIG. 1 1 illustrates a VGG net 1 100 in accordance with one embodiment.
  • FIG. 12 illustrates a convolution layer filtering 1200 in accordance with one embodiment.
  • FIG. 13 illustrates a pooling layer function 1300 in accordance with one embodiment.
  • FIG. 14 illustrates a diagram 1400 in accordance with one embodiment.
  • a partially automated member services representative (MSR) bot may handle hand off seamlessly to a human MSR when he or she is available.
  • the MSR bot may include the capability to summarize the automated portion of a call and tools to monitor and review bot behaviors.
  • the recorded MSR calls may be utilized to train the speech synthesizer engine to sound natural and the conversational agent to accurately react to caller requests.
  • the MSR bot may be trained using un-sanitized conversational speech rather than clean, performed speech.
  • the MSR bot may be able to model tone, prosody, and dialect of individual MSRs.
  • the MSR bot may be trained using speech recognition transcripts rather than human transcripts.
  • CNNs are particularly well suited to classifying features in data sets modelled in two or three dimensions. This makes CNNs popular for image classification, because images can be represented in computer memories in three dimensions (two dimensions for width and height, and a third dimension for pixel features like color components and intensity) .
  • a color JPEG image of size 480 x 480 pixels can be modelled in computer memory using an array that is 480 x 480 x 3, where each of the values of the third dimension is a red, green, or blue color component intensity for the pixel ranging from 0 to 255.
  • Image classification is the task of taking an input image and outputting a class (a cat, dog, etc.) or a probability of classes that best describes the image.
  • CNNs input the data set, pass it through a series of convolutional transformations, nonlinear activation functions (e.g., RELU), and pooling operations (downsampling, e.g., maxpool), and an output layer (e.g., softmax) to generate the classifications.
  • nonlinear activation functions e.g., RELU
  • pooling operations e.g., maxpool
  • output layer e.g., softmax
  • a method of operating a speech synthesizing conversation agent involves operating an audio interface to receive a caller audio signal during a call session.
  • the method generates an audio transcript comprising a sentiment score from the caller audio signal through operation of a sentiment analysis engine configured by a sentiment model.
  • the method communicates the audio transcript to a user interface switch configured to receive inputs from a user model.
  • the method communicates a response control from the user interface switch to a speech synthesizer engine trained with historical conversation data from the user model.
  • the method then operates the speech synthesizer engine.
  • the speech synthesizer engine generates a response signal for the caller audio signal and the audio transcript through operation of a response logic engine configured by hi storical conversation data.
  • the speech synthesizer engine generates a synthesized audio response comprising an ambient signal and a synthesized user model response from the response signal through operation of a speech synthesis model configured by the historical conversation data.
  • the method then communicates the synthesized audio response responsive to the caller audio signal through the audio interface during the call session.
  • the method of operating the speech synthesizing conversation agent may involve operating the speech synthesis model to generate a synthesized speech as the synthesized user model response for the caller audio signal, in response to receiving a text to speech response from the response logic engine.
  • the method of operating the speech synthesizing conversation agent may involve operating the speech synthesis model to generate listening response cues as the synthesized user model response for the caller audio signal, in response to receiving a non-verbal response from the response logic engine.
  • the method of operating the speech synthesizing conversation agent may involve receiving a user model input through the user interface switch from the user model, in response to the receiving the caller audio signal through the audio interface.
  • the method may involve receiving a user model input through the user interface switch from the user model, in response to the receiving the caller audio signal through the audio interface.
  • the method may then store the audio transcript, the caller audio signal, and the user model audio response as historical conversation data in a controlled memory data structure.
  • the method operates the speech synthesis model to generate the ambient signal from the background noise of user model responses in the historical conversation data.
  • the method of operating the speech synthesizing conversation agent may involve operating the speech synthesizer engine during a supervised training phase to receive call summary transcripts for a plurality of call sessions with the user model.
  • the call summary transcripts may comprise identified entities, call session intent, a sentiment score, and user model responses through the training interface of the response logic engine.
  • the method may identify a response state and generate response options with certainty scores through operation of the response logic engine, in response to receiving a response audit from the user model.
  • the method may receive a feedback control from the user model responsive to the response state, the response options, and the certainty scores.
  • a non-transitory computer-readable storage medium including instructions that when executed by a computer, cause the computer to operate an audio interface to receive caller audio signal during a call session.
  • the computer may generate an audio transcript comprising a sentiment score from the caller audio signal through operation of a sentiment analysis engine.
  • the computer may communicate the audio transcript to a user interface switch configured to receive inputs from a user model.
  • the computer may communicate a response control from the user interface switch to a speech synthesizer engine trained with historical conversation data from the user model.
  • the computer may operate the speech synthesizer engine.
  • the speech synthesizer engine may generate a response signal for the caller audio signal and the audio transcript through operation of response logic engine configured by historical conversation data.
  • the speech synthesizer engine may generate a synthesized audio response comprising an ambient signal and a synthesized user model response from the response signal through operation of a speech synthesis model configured by the historical conversation data.
  • the computer may then communicate the synthesized audio response responsive to the caller audio signal through the audio interface during the call session.
  • the instructions further configure the computer to operate the speech synthesis model to generate a synthesized speech as the synthesized user model response for the caller audio signal, in response to receiving a text to speech response from the response logic engine.
  • the instructions may further configure the computer to operate the speech synthesis model to generate listening response cues as the synthesized user model response for the caller audio signal, in response to receiving a non-verbal response from the response logic engine.
  • the instructions may further configure the computer to receive a user model input through the user interface switch from the user model, in response to the receiving the caller audio signal through the audio interface.
  • the computer may communicate a user model audio response, responsive to the caller audio signal, to the audio interface, the user model audio response comprising response audio and background noise.
  • the computer may then store the audio transcript, the caller audio signal, and the user model audio response as historical conversation data in a controlled memory data structure.
  • the instructions may configure the computer to operate the speech synthesizer engine during a supervised training phase.
  • the instruction may configure the computer to receive call summary transcripts for a plurality of call sessions with the user model, the call summary transcripts comprising identified entities, call session intent, the sentiment score, and user model responses through the training interface of the response logic engine.
  • the computer may identify a response state and generate response options with certainty scores through operation of the response logic engine, in response to receiving a response audit from the user model.
  • the computer may receive a feedback control from the user model responsive to the response state, the response options, and the certainty scores.
  • a computing apparatus may include a processor and a memory storing instructions that, when executed by the processor, configure the apparatus to operate an audio interface to receive caller audio signal during a call session.
  • the apparatus may generate an audio transcript comprising a sentiment score from the caller audio signal through operation of a sentiment analysis engine.
  • the apparatus may communicate the audio transcript to a user interface switch configured to receive inputs from a user model.
  • the apparatus may communicate a response control from the user interface switch to a speech synthesizer engine trained with historical conversation data from the user model.
  • the apparatus may operate the speech synthesizer engine.
  • the speech synthesizer engine may generate a response signal for the caller audio signal and the audio transcript through operation of response logic engine configured by the historical conversation data.
  • the speech synthesizer engine may generate a synthesized audio response comprising an ambient signal and a synthesized user model response from the response signal through operation of a speech synthesis model configured by the historical conversation data.
  • the apparatus may communicate the synthesized audio response responsive to the caller audio signal through the audio interface during the call session.
  • the instructions may further configure the apparatus to operate the speech synthesis model to generate a synthesized speech as the synthesized user model response for the caller audio signal, in response to receiving a text to speech response from the response logic engine.
  • the instructions may further configure the apparatus to operate the speech synthesis model to generate listening response cues as the synthesized user model response for the caller audio signal, in response to receiving a non-verbal response from the response logic engine.
  • the instructions may further configure the apparatus to receive a user model input through the user interface switch from the user model, in response to the receiving the caller audio signal through the audio interface.
  • the apparatus may communicate a user model audio response, responsive to the caller audio signal, to the audio interface, the user model audio response comprising response audio and background noise.
  • the apparatus may store the audio transcript, the caller audio signal, and the user model audio response as the historical conversation data in a controlled memory data structure.
  • the instructions may further configure the apparatus to operate the speech synthesis model to generate the ambient signal from the background noise of user model responses in the historical conversation data.
  • the instructions may further configure the apparatus to operate the speech synthesizer engine during a supervised training phase.
  • the apparatus may receive call summary transcripts for a plurality of call session with the user model, the call summary transcripts comprising identified entities, call session intent, the sentiment score, and user model responses through the training interface of the response logic engine.
  • the apparatus may identify a response state and generate response options with certainty scores through operation of the response logic engine, in response to receiving a response audit from the user model.
  • the apparatus may receive a feedback control from the user model responsive to the response state, the response options, and the certainty scores.
  • an example of an audio transcript generated by the sentiment analysis engine may appear as follows:
  • connection_id “69ffe872b5dbf929”, “start_timestamp_ms " : 1531345706000,
  • the audio transcript shows timings for the duration of each spoken word and the offset between words. Furthermore, the audio transcript illustrates a sentiment index score of 0.9.
  • the MSR bot may be able to perform a style transfer between multiple MSRs in a single model .
  • the MSR bot may be able to implicitly model sounds such as line noise, non speech vocalizations, and a realistic call center and telephony acoustic environment.
  • Caller’s emotions and sentiment during a call session may be utilized to condition text to speech (TTS) models to add realistic dynamics to the TTS prosody (for example, the synthesized voice may modulate its tone in response to anger or concerns detected during the call session.
  • TTS text to speech
  • the conversational agent or MSR bot
  • the conversational agent may be designed to only handle portions of a call and may summarize the caller’s intent and any resolution achieved in the automated portion of the call.
  • the conversation agent may produce a short summary artifact that would assist a human MSR in a handoff.
  • the summarization system may include in part or in full a dashboard that provides the transcript of the bot conversation preceding.
  • This dashboard may be a part of tooling utilized to train and interpret the conversational agent. Design of the bot management dashboard may allow an opportunity to explain why the bot chose certain response and provide a mechanism to correct behavior. Labelling special tokens like proper nouns may also allow for structured learning. Additionally, this dashboard may provide a reasonable summary of the conversation up to the hand off point in the call session.
  • a synthetic summarization may be built up on a compound summary from smaller conclusions about the call. This compound summary may work as the basis of the call summarizations.
  • the TTS portion of the conversational agent may have reasonably high data
  • the conversational agent may not to rely on audio, with the exception of modelling emotion and sentiment as a conditioning signal to the TTS .
  • any generative model that’s included in the architecture may be trained on the largest possible corpus of call transcripts .
  • the synthetic summary portion of the conversational agent may rely on small validation datasets; however, if classifiers are trained on calls to detect intent, then the dataset may include several thousand call transcripts per classifier.
  • training data from previous conversations may be searched using search terms from a caller and search phrase modifier as follows:
  • a single tilde may be utilized to match similar forms like plurals or conjugates. For instances:
  • Two tildes may be utilized to match synonymous words. For instance:
  • Three tildes may be utilized to match related phrasings. For instance:
  • phrases Operators To search within one speech segment for two things, a user may combine search terms with the operators‘ near’ ,‘or’, or‘then’ . For example:
  • a user may combine search terms with the operators and, or, or later.
  • the and operator looks for a conversation that contains both literals.
  • the new card should arrive in one to two weeks.
  • the or operator looks for a conversation that contains either literals or both. Its use is determined by context relative to the phrase scanner.
  • the new card should arrive in five days.
  • the later operator looks for a conversation that contains both literals in order.
  • Additional modifiers may be placed to the left of a segment to restrict it to a certain property or modify it in some other way.
  • Compound Queries Much more complex queries may be built using parentheses. Inner scanners are evaluated and then combined with outer scanners.
  • modifiers may be stacked too (although order can affect meaning).
  • Extractors are special phrases in curly braces " ⁇ ⁇ " that represent a concept. By default all extractors are treated as if they have two tildes and this may be omitted.
  • Some extractors currently in the scanner include:
  • a user may put time constraints on scanners .
  • the start and end of the call may be specified with special extractors for placing time constraints against the start or end of the call.
  • ⁇ end ⁇ can indicate the end of the call.
  • a user may also place constraints on call session metadata like the date, start time, duration, or user-provided metadata.
  • the metadata queries may be performed first, and then the scanner may be performed on the resulting subset.
  • FIG. 1 illustrates a system 100 operating a speech synthesizing conversation agent for generating synthesized audio responses modeled after a member support representative (MSR) during a call session.
  • the system 100 comprises an audio interface 1 10, a sentiment analysis engine 106, a user interface switch 1 16, a controlled memory data structure 1 18, and a speech synthesizer engine 1 14.
  • an individual calls the call center to speak with an MSR.
  • the caller's audio signals are received through the audio interface 1 10 as a caller audio signal 120.
  • the caller audio signal 120 comprises audio of the caller stating that they have a problem with their account.
  • the sentiment analysis engine 106 comprises a transcription service 102 and a sentiment model 1 04.
  • the transcription service 102 transcribes the caller audio signal 120 into an audio transcript 108, noting the timing of each spoken word and pause between each spoken word in the caller audio signal 120.
  • the sentiment model 104 is a trained language processing model that is utilized by the sentiment analysis engine 106 to determine a sentiment score 1 12 from the timings of audio transcript 108.
  • the audio transcript 108 with the sentiment score 1 12 is communicated to the user interface switch 1 16.
  • the user interface switch 1 16 allows the MSR (user model 136) to review the audio transcript 108 with the sentiment score 1 12 and determine if they would like to respond to the caller, personally, or allow the speech synthesizer engine 1 14 to generate a synthesized audio response 130.
  • the sentiment score 1 12 provides the user with insight into the prosodic features of the caller audio signal 120.
  • the user model 136 determines that they would like to handle further interactions and communicates a user model input to the user interface switch 1 16. The user model 136 may then vocalize a response such as "I'm sorry to hear that, let me see what I can do. " that is communicated to the audio interface 1 10 as the user model audio response 128.
  • the user model audio response 128 comprises the user model 136's vocalized response as the response audio 140 and includes background noise 142 from the call center environment.
  • the user model audio response 128, the caller audio signal 120, and the audio transcript 108 comprising the sentiment score 1 12 arc communicated to the controlled memory data structure 1 18 and stored as historical conversation data 1 32.
  • the user model 136 determines that the speech synthesizer engine 1 14 should handle the response to the caller audio signal 1 20.
  • the user model 1 36 determines that the speech synthesizer engine 1 14 should handle the response to the caller audio signal 1 20.
  • the speech synthesizer engine 1 14 comprises a speech synthesis model 124 and a response logic engine 122 for generating a synthesized audio response 130.
  • the response logic engine 122 utilizes a conversational response model trained using the historical conversation data 132 to determine appropriate responses to the caller audio signal 120.
  • the response logic engine 122 utilizes the caller audio signal 120 and the audio transcript 108 comprising the sentiment score 1 12 to generate a response signal, which the speech synthesis model 124 uses as a basis for the synthesized audio response 130.
  • the speech synthesis model 124 is modelled using the historical conversation data 132 of the user model 136.
  • the speech synthesis model 124 allows the speech synthesizer engine 1 14 to generate a synthesized user model response 138 from the response signal with
  • the synthesized audio response 130 includes an ambient signal 134 modeled after the background noise 142 detected in the user model audio response 128, adding a layer of authenticity to the synthesized audio response 130.
  • the response logic engine 122 may determine different conversational/response states during the call session. These conversational/response states include an active response state and a passive response state.
  • An active response state may be a state during the call session that the response logic engine 122 determines that synthesized speech response is appropriate for the caller audio signal 120.
  • the synthesized speech response may be appropriate as a response to the sentiment score 1 12.
  • the response logic engine 122 may determine different conversational/response states during the call session. These conversational/response states include an active response state and a passive response state.
  • An active response state may be a state during the call session that the response logic engine 122 determines that synthesized speech response is appropriate for the caller audio signal 120.
  • the synthesized speech response may be appropriate as a response to the sentiment score 1 12.
  • the response logic engine 122 In the active response state, the response logic engine 122
  • TTS response text to speech response 126
  • the speech synthesis model 124 comprising text to be converted to speech by the speech synthesizer engine 1 14.
  • the synthesized speech response may be appropriate for an identified intent in the conversation such as questions or specific statements identified in the caller audio signal 120.
  • a passive response state is a conversational state where the speech synthesizer engine 1 14 may determine that a non-verbal response 146 is appropriate for the caller audio signal 120.
  • the passive response state may include instances in a conversation when the response logic engine 122 determines that an appropriate response is to continue listening to the caller while indicating that MSR side of the conversation is still listening.
  • the speech synthesizer engine 1 14 utilizes non-verbal response 146 and the speech synthesis model 124 to generate a synthesized audio response 130 comprising the ambient signal 134 and listening response cues as the synthesized user model response 138 to indicate that the MSR side of the call session is listening.
  • the user model 136 may decide that the caller may require a personalized response and communicate to the user interface switch 1 16 that they would like to handle further interactions with caller. Due to the synthesized speech modeled after the user model 136 and the background noise in the ambient signal, the transition from the synthesized audio response 1 30 to the user model audio response 1 28 may appear seamless to the caller.
  • the speech synthesizer engine 1 14 is an artificial intelligence (A. I.) model trained to receive caller sentiment data and a response signal as determined by a response logic engine 122 and synthesize speech using the voice pattern of a user model to communicate the synthesized audio response 130 with the appropriate emotional modulation to be matched with the sentiment score 1 12.
  • A. I. artificial intelligence
  • a method 200 for operating a speech synthesizing conversation agent operates an audio interface to receive caller audio signal during a call session (block 202).
  • the method 200 generates an audio transcript comprising a sentiment score from the caller audio signal through operation of a sentiment analysis engine configured by a sentiment model.
  • the method 200 communicates the audio transcript to a user interface switch configured to receive inputs from a user model.
  • the method 200 communicates a response control from the user interface switch to a speech synthesizer engine trained with historical conversation data from the user model.
  • the method 200 operates the speech synthesizer engine.
  • the speech synthesizer engine In subroutine block 212, the speech synthesizer engine generate a response signal for the caller audio signal and the audio transcript through operation of a response logic engine configured by the historical conversation data. In subroutine block 214, the speech synthesizer engine generates a synthesized audio response comprising an ambient signal and a synthesized user model response from the response signal through operation of the speech synthesis model configured by the historical conversation data. In block 216, method 200 communicates the synthesized audio response responsive to the caller audio signal through the audio interface during the call session.
  • FIG. 3 illustrates a system 300 showing the information received and generated by a speech synthesizer engine 326.
  • the system 300 comprises a controlled memory data structure 324, a sentiment analysis engine 302, and the speech synthesizer engine 326.
  • the speech synthesizer engine 326 comprises a response logic engine 334 and a speech synthesis model 336.
  • the controlled memory data structure 324 functions as the storage location for hi storical conversation data 304 comprising scored conversation transcripts 322, such as audio transcripts with sentiment scores, and conversation audio 340, such as caller audio signals corresponding the user model audio responses from a plurality of call session.
  • the historical conversation data 304, an in particular the conversation audio 340 comprising the user model audio responses, may be provided to the speech synthesizer engine 326 to generate/improve the speech synthesis model 336.
  • the caller audio signal 308 may be provided to the sentiment analysis engine 302, the response logic engine 334, and the controlled memory data structure 324 as a wav sample 328.
  • a wav sample 328 refers to waveform audio file format (.wav) that is an uncompressed waveform audio that facilitates utilization by the sentiment analysis engine 302 and the speech synthesizer engine 326.
  • the sentiment analysis engine 302 comprises a transcription service 338 and a sentiment model 320.
  • the sentiment analysis engine 302 generates an audio transcript 306 with a sentiment score 312 from the caller audio signal 308 that is communicated to the response logic engine 334.
  • the response logic engine 334 receives the audio transcript 306 with the sentiment score 312 and the caller audio signal 308 and, in response, generates a response signal 344.
  • the speech synthesis model 336 is configured with the response signal 344 to generate the synthesized audio response 318.
  • the synthesized audio response 3 18 comprises a synthesized user model response 342 and an ambient signal 316.
  • the response logic engine 334 may generate a non-verbal response 332 or a text to speech response 330 as the response signal 344 to be communicated to the speech synthesis model 336.
  • the response logic engine 334 If an active conversational state is identified, the response logic engine 334 generates a text to speech response 330 to be utilized by the speech synthesis model 336 to generate synthesized speech 3 10. If a passive conversation state is identified, the response logic engine 334 generates a non verbal response 332 to be utilized by the speech synthesis model 336 to generate listening response cues 314.
  • the system 300 may be operated in accordance with the processed described in Figure 2, Figure 4, Figure 5.
  • a method 400 for operating a speech synthesizing conversation agent receives the caller audio signal through the audio interface (block 402).
  • the method 400 receives a user model input through the user interface switch from the user model, in response to the receiving caller audio signal through the audio interface.
  • the method 400 communicates a user model audio response, responsive to the caller audio signal, to the audio interface, the user model audio response comprising response audio and background noise.
  • method 400 stores the audio transcript, caller audio signal, and the user model audio response as historical conversation data in a controlled memory data structure.
  • the speech synthesis model operating the speech synthesis model to generates the ambient signal from the background noise of user model responses in the historical conversation data.
  • Figure 5 illustrates a process 500 for operating the speech synthesizing conversation agent.
  • the process 500 receives the caller audio through an audio interface, such as a telephonic switch board.
  • the process 500 determines the caller's intent and sentiment.
  • the determination of intent and sentiment of the caller may be accomplished by the sentiment analysis engine.
  • the determination of the intent may be accomplished as a result of the transcription of the caller audio and the context of the transcribed words.
  • the determination of the sentiment of the caller may be accomplished through the use of the sentiment model.
  • the determination of the intent and sentiment of the call may be communicated to a user interface switch as an audio transcript with sentiment score.
  • the caller intent/sentiment may be communicated to a user interface switch which prompts the user (MSR) for instructions on how to proceed (block 506).
  • the instructions from the user may be provided back to the user interface switch in the form of a user model input
  • the user may indicate instructions (decision block 508). If the user does not indicate an action, the speech synthesizing conversation agent may take over and preform a default set of actions (block 510). In some instances, the default set of actions performed by the speech synthesizing conversation agent may be to respond to the caller audio with a synthesized audio response.
  • the instructions may indicate approval by the user to allow the speech synthesizing conversation agent (bot) to handle the response to the caller audio (decision block 512). If the user indicates that they do not want to allow the bot to handle the response to the caller audio, the user may respond to the caller audio while the bot remains idle (block 514). If the user indicates that they do want to allow the bot to handle the response to the caller audio, the bot may start it's call management sequence for generating the synthesized audio response (block 516) .
  • Figure 6 illustrates a supervised training phase 600 for the response logic engine 604.
  • the supervised training phase 600 may prompt the response logic engine 604 for proposed responses 608 to a caller transcript.
  • the user model 602 may then evaluate the proposed responses 608 and communicate response scoring 606 to the training interface 610 of the response logic engine 604.
  • Figure 7 illustrates a supervised training phase 700 for the response logic engine 706.
  • the user model 702 may act as an expert user that trains the response logic engine 706 through a training interface 704.
  • the user model 702 may communicate a plurality of call summary transcripts 724 to the response logic engine 706 through the training interface 704.
  • the training interface 704 may receive call session intent and sentiment score 710 as well as identified entities 712 in a call summary transcripts 724 from call sessions involving the user model 702.
  • the call session intent and sentiment score 710 and the identified entities 712 may be identified within the call summary transcripts 724 as annotations, manually entered by the user model 702.
  • the annotation style utilized by the user model 702 may be particularly configured for understanding by the response logic engine 706.
  • the audio transcript may include transcribed user model responses 718 from the user model 702.
  • the response logic engine 706 may utilize the call session intent and sentiment score 710, the identified entities 712, and the user model responses 718 to build a conversational model for generating caller audio responses .
  • the user model 702 may communicate a response audit 714 for a portion of a call transcript to the response logic engine 706.
  • the response logic engine 706 may respond with the identified response state 716 of the current conversation, a response options 722 and certainty scores 708 for the response options.
  • the user model 702 may provide a feedback control 720 for validating or adjusting the response options.
  • FIG 8 illustrates a supervised training phase 800 for the speech synthesizing conversation agent.
  • the audio interface 808 receives a caller audio signal 818 that is communicated to the sentiment analysis engine 810 comprising the sentiment model 804 and the transcription service 802.
  • the transcription service 802 of the sentiment analysis engine 810 generates an audio transcript 814 comprising a sentiment score 816.
  • the audio transcript 814 is communicated to the user model 702 which communicates it to the training interface 704 of the response logic engine 706.
  • the response logic engine 706 identifies an identified response state 71 6 for the conversation as well as response options 722 and certainty scores 708 that are communicated to the user model 702.
  • the speech synthesizer engine 8 12 may utilize the generated responses from the speech synthesizer engine 812 with the speech synthesis model 806 to generate synthesized audio responses for the caller audio signal 818.
  • the user model 702 may provide a feedback control 720 to improve responses communicated to the audio interface 808.
  • FIG. 9 illustrates an exemplary convolutional neural network 900.
  • the convolutional neural network 900 arranges its neurons in three dimensions (width, height, depth), as visualized in convolutional layer 904. Every layer of the convolutional neural network 900 transforms a 3D volume of inputs to a 3D output volume of neuron activations.
  • the input layer 902 encodes the image, so its width and height would be the dimensions of the image, and the depth would be 3 (Red, Green, Blue channels).
  • the convolutional layer 904 further transforms the outputs of the input layer 902, and the output layer 906 transforms the outputs of the convolutional layer 904 into one or more classifications of the image content.
  • Figure 10 illustrates an exemplary convolutional neural network layers 1000 in more detail.
  • An example subregion of the input layer region 1004 of an input layer region 1002 region of an image is analyzed by a set of convolutional layer subregion 1008 in the
  • the input layer region 1002 is 32x32 neurons long and wide (e.g., 32x32 pixels), and three neurons deep (e.g., three color channels per pixel).
  • Each neuron in the convolutional layer 1006 is connected only to a local region in the input layer region 1002 spatially (in height and width), but to the full depth (i.e. all color channels if the input is an image). Note, there arc multiple neurons (5 in this example) along the depth of the
  • convolutional layer subregion 1 008 that analyzes the subregion of the input layer region 1 004 of the input layer region 1002, in which each neuron of the convolutional layer subregion 1008 may receive inputs from every neuron of the subregion of the input layer region 1004.
  • Figure 1 1 illustrates a popular form of a convolutional neural network (CNN) known as a VGG net 1 100.
  • the initial convolution layer 1 102 stores the raw image pixels and the final pooling layer 1 120 determines the class scores .
  • Each of the intermediate convolution layers (convolution layer 1 106, convolution layer 1 1 12, and convolution layer 1 1 16) and rectifier activations (RELU layer 1 104, RELUlayer 1 108, RELUlayer 1 1 14, and RELUlayer 1 1 18) and intermediate pooling layers (pooling layer 1 1 10, pooling layer 1 120) along the processing path is shown as a column.
  • the VGG net 1 100 replaces the large single-layer filters of basic CNNs with multiple 3x3 sized filters in series. With a given receptive field (the effective area size of input image on which output depends), multiple stacked smaller size filters may perform better at image feature classification than a single layer with a larger filter size, because multiple non-linear layers increase the depth of the network which enables it to learn more complex features. In a VGG net 1 100 each pooling layer may be only 2x2.
  • FIG. 1 2 illustrates a convolution layer filtering 1 200 that connects the outputs from groups of neurons in a convolution layer 1202 to neurons in a next layer 1206.
  • a receptive field is defined for the convolution layer 1202, in this example sets of 5x5 neurons.
  • the collective outputs of each neuron the receptive field are weighted and mapped to a single neuron in the next layer 1206.
  • This weighted mapping is referred to as the filter 1204 for the convolution layer 1202 (or sometimes referred to as the kernel of the convolution layer 1202).
  • the filter 1204 depth is not illustrated in this example (i.e., the filter 1204 is actually a cubic volume of neurons in the convolution layer 1202, not a square as illustrated).
  • FIG. 12 shows how the filter 1204 is stepped to the right by 1 unit (the "stride"), creating a slightly offset receptive field from the top one, and mapping its output to the next neuron in the next layer 1206.
  • the stride can be and often is other numbers besides one, with larger strides reducing the overlaps in the receptive fields, and hence further reducing the size of the next layer 1206. Every unique receptive field in the convolution layer 1202 that can be defined in this stepwise manner maps to a different neuron in the next layer 1206.
  • the next layer 1206 need only be 28x28x1 neurons to cover all the receptive fields of the convolution layer 1202. This is referred to as an activation map or feature map. There is thus a reduction in layer complexity from the filtering. There are 784 different ways that a 5 x 5 filter can uniquely fit on a 32 x 32 convolution layer 1202, so the next layer 1206 need only be 28 x 28. The depth of the convolution layer 1202 is also reduced from 3 to 1 in the next layer 1206.
  • the number of total layers to use in a CNN is examples of "hyperparameters" of the CNN.
  • FIG. 13 illustrates a pooling layer function 1300 with a 2x2 receptive field and a stride of two.
  • the pooling layer function 1300 is an example of the maxpool pooling technique.
  • the outputs of all the neurons in a particular receptive field of the input layer 1 302 are replaced by the maximum valued one of those outputs in the pooling layer 1304.
  • Other options for pooling layers are average pooling and L2-norm pooling. The reason to use a pooling layer is that once a specific feature is recognized in the original input volume (there will be a high activation value), its exact location is not as important as its relative location to the other features.
  • Pooling layers can drastically reduce the spatial dimension of the input layer 1302 from that point forward in the neural network (the length and the width change but not the depth). This serves two main purposes. The first is that the amount of parameters or weights is greatly reduced thus lessening the computation cost. The second is that it will control overfitting. Overfitting refers to when a model is so tuned to the training examples that it is not able to generalize well when applied to live data sets.
  • Figure 14 illustrates a diagram 1400 illustrating visualization of a WaveNet stack and its receptive fields.
  • the speech synthesizer engine may be configured from a Wavenet auto regressive model.
  • the diagram 1400 shows an input layer 1402 at the bottom feeding into a hidden layer 1404 with a dilation of one.
  • the hidden layer 1404 feeds into the hidden layer 1406 with a dilation of two.
  • the hidden layer 1406 feeds into the hidden layer 1408 with a dilation of four.
  • the hidden layer 1408 feeds into the output layer 1410 with a dilation of eight.
  • Raw audio data is typically very high-dimensional (e.g. 16,000 samples per second for 16kHz audio), and contains complex, hierarchical structures spanning many thousands of time steps, such as words in speech or melodies in music. Modelling such long-term dependencies with standard causal convolution layers would require a very deep network to ensure a sufficiently broad receptive field. WavcNct avoids this constraint by using dilated causal convolutions, which allow the receptive field to grow exponentially with depth.
  • WaveNet is a type of feedforward neural network known as a deep convolutional neural network (CNN). These consist of layers of interconnected nodes somewhat analogous to a brain’s neurons.
  • CNN deep convolutional neural network
  • the CNN takes a raw signal as an input and synthesizes an output one sample at a time. It does so by sampling from a softmax (i.e. categorical)
  • CTC loss function refers to connectionist temporal classification, a type of neural network output and associated scoring function, for training recurrent neural networks (RNNs) such as LSTM networks to tackle sequence problems where the timing is variable.
  • RNNs recurrent neural networks
  • a CTC network has a continuous output (e.g. softmax), which is fitted through training to model the probability of a label.
  • CTC does not attempt to learn boundaries and timings : Label sequences arc considered equivalent if they differ only in alignment, ignoring blanks. Equivalent label sequences can occur in many ways - which makes scoring a non-trivial task. Fortunately there is an efficient forward— backward algorithm for that. CTC scores can then be used with the back-propagation algorithm to update the neural network weights.
  • GRU Gated Recurrent Unit
  • HMM hidden Markov model
  • Beam search refers to a heuristic search algorithm that explores a graph by expanding the most promising node in a limited set. Beam search is an optimization of best-first search that reduces its memory requirements. Best-first search is a graph search which orders all partial solutions (states) according to some heuristic. But in beam search, only a predetermined number of best partial solutions are kept as candidates. It is thus a greedy algorithm. Beam search uses breadth-first search to build its search tree. At each level of the tree, it generates all successors of the states at the current level, sorting them in increasing order of heuristic cost. However, it only stores a predetermined number, b, of best states at each level (called the beam width).
  • Beam search is not optimal (that is, there is no guarantee that it will find the best solution). In general, beam search returns the first solution found. Beam search for machine translation is a different case: once reaching the configured maximum search depth (i.e. translation length), the algorithm will evaluate the solutions found during search at various depths and return the best one (the one with the highest probability).
  • the beam width can either be fixed or variable. One approach that uses a variable beam width starts with the width at a minimum. If no solution is found, the beam is widened and the procedure is repeated.
  • Adam optimizer refers to an optimization algorithm that can used instead of the classical stochastic gradient descent procedure to update network weights iterative based in training data.
  • Stochastic gradient descent maintains a single learning rate (termed alpha) for all weight updates and the learning rate does not change during training.
  • a learning rate is maintained for each network weight (parameter) and separately adapted as learning unfolds .
  • Adam as combining the advantages of two other extensions of stochastic gradient descent.
  • AdaGrad Adaptive Gradient Algorithm
  • RMSProp Root Mean Square Propagation
  • Adam instead of adapting the parameter learning rates based on the average first moment (the mean) as in RMSProp, Adam also makes use of the average of the second moments of the gradients (the uncentered variance) . Specifically, the algorithm calculates an exponential moving average of the gradient and the squared gradient, and the parameters betal and beta2 control the decay rates of these moving averages. The initial value of the moving averages and betal and beta2 values close to 1.0 (recommended) result in a bias of moment estimates towards zero. This bias is overcome by first calculating the biased estimates before then calculating bias-corrected estimates.
  • references to “one embodiment” or “an embodiment” do not necessarily refer to the same embodiment, although they may.
  • the words “comprise,” “comprising,” and the like are to be construed in an inclusive sense as opposed to an exclusive or exhaustive sense; that is to say, in the sense of “including, but not limited to.” Words using the singular or plural number also include the plural or singular number respectively, unless expressly limited to a single one or multiple ones.
  • the words “herein, “ “above,” “below” and words of similar import when used in this application, refer to this application as a whole and not to any particular portions of this application.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Theoretical Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Signal Processing (AREA)
  • Psychiatry (AREA)
  • Hospice & Palliative Care (AREA)
  • Child & Adolescent Psychology (AREA)
  • Machine Translation (AREA)
PCT/US2020/013160 2019-01-29 2020-01-10 Conversational speech agent Ceased WO2020159693A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
JP2021544347A JP7313455B2 (ja) 2019-01-29 2020-01-10 発話エージェント
EP20748597.0A EP3918508A4 (en) 2019-01-29 2020-01-10 CONVERSATIONAL LANGUAGE AGENT

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US15/929,095 US10770059B2 (en) 2019-01-29 2019-01-29 Conversational speech agent
US15/929,095 2019-01-29

Publications (1)

Publication Number Publication Date
WO2020159693A1 true WO2020159693A1 (en) 2020-08-06

Family

ID=71732630

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2020/013160 Ceased WO2020159693A1 (en) 2019-01-29 2020-01-10 Conversational speech agent

Country Status (4)

Country Link
US (1) US10770059B2 (https=)
EP (1) EP3918508A4 (https=)
JP (1) JP7313455B2 (https=)
WO (1) WO2020159693A1 (https=)

Families Citing this family (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10600408B1 (en) * 2018-03-23 2020-03-24 Amazon Technologies, Inc. Content output management based on speech quality
US11854538B1 (en) * 2019-02-15 2023-12-26 Amazon Technologies, Inc. Sentiment detection in audio data
JP7338039B2 (ja) * 2019-08-07 2023-09-04 ライブパーソン, インコーポレイテッド メッセージングを自動化に転送するためのシステムおよび方法
US11393471B1 (en) * 2020-03-30 2022-07-19 Amazon Technologies, Inc. Multi-device output management based on speech characteristics
WO2021205946A1 (ja) * 2020-04-06 2021-10-14 ソニーグループ株式会社 情報処理装置および情報処理方法
US11875797B2 (en) * 2020-07-23 2024-01-16 Pozotron Inc. Systems and methods for scripted audio production
US20220319505A1 (en) * 2021-02-12 2022-10-06 Ashwarya Poddar System and method for rapid improvement of virtual speech agent's natural language understanding
US11545141B1 (en) * 2021-04-16 2023-01-03 ConverzAI Inc. Omni-channel orchestrated conversation system and virtual conversation agent for realtime contextual and orchestrated omni-channel conversation with a human and an omni-channel orchestrated conversation process for conducting realtime contextual and fluid conversation with the human by the virtual conversation agent
US12272355B2 (en) * 2021-04-20 2025-04-08 Converzai, Inc. System and method for providing a virtual speech agent for simulated conversations and conversational feedback
US20230026945A1 (en) * 2021-07-21 2023-01-26 Wellspoken, Inc. Virtual Conversational Agent
US11856139B2 (en) * 2021-09-24 2023-12-26 International Business Machines Corporation Method and apparatus for dynamic tone bank and personalized response in 5G telecom network
CN114189587A (zh) * 2021-11-10 2022-03-15 阿里巴巴(中国)有限公司 通话方法、设备、存储介质及计算机程序产品
US12315495B2 (en) 2021-12-17 2025-05-27 Snap Inc. Speech to entity
US11936812B2 (en) * 2021-12-22 2024-03-19 Kore.Ai, Inc. Systems and methods for handling customer conversations at a contact center
US12361934B2 (en) 2022-07-14 2025-07-15 Snap Inc. Boosting words in automated speech recognition
EP4471694A1 (en) * 2023-06-01 2024-12-04 Airbus S.A.S. Method for assisting a worker in a production line, data processing apparatus and computer program
JP7808661B2 (ja) * 2023-09-20 2026-01-29 ソフトバンクグループ株式会社 システム
US20250118298A1 (en) * 2023-10-09 2025-04-10 Hishab Singapore Private Limited System and method for optimizing a user interaction session within an interactive voice response system
EP4599429A1 (en) * 2023-12-28 2025-08-13 Google LLC Dynamic adaptation of speech synthesis by an automated assistant during automated telephone call(s)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030215066A1 (en) 2002-05-16 2003-11-20 Craig Shambaugh Method and apparatus for agent optimization using speech synthesis and recognition
US20150066479A1 (en) * 2012-04-20 2015-03-05 Maluuba Inc. Conversational agent

Family Cites Families (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6018711A (en) * 1998-04-21 2000-01-25 Nortel Networks Corporation Communication system user interface with animated representation of time remaining for input to recognizer
US20040162724A1 (en) * 2003-02-11 2004-08-19 Jeffrey Hill Management of conversations
JP2005062240A (ja) * 2003-08-13 2005-03-10 Fujitsu Ltd 音声応答システム
US20080086690A1 (en) * 2006-09-21 2008-04-10 Ashish Verma Method and System for Hybrid Call Handling
US20180012595A1 (en) * 2016-07-07 2018-01-11 Intelligently Interactive, Inc. Simple affirmative response operating system
US9812151B1 (en) * 2016-11-18 2017-11-07 IPsoft Incorporated Generating communicative behaviors for anthropomorphic virtual agents based on user's affect
KR20190121758A (ko) * 2017-02-27 2019-10-28 소니 주식회사 정보 처리 장치, 정보 처리 방법, 및 프로그램
US9865260B1 (en) * 2017-05-03 2018-01-09 Google Llc Proactive incorporation of unsolicited content into human-to-computer dialogs
KR20190004495A (ko) * 2017-07-04 2019-01-14 삼성에스디에스 주식회사 챗봇을 이용한 태스크 처리 방법, 장치 및 시스템
US10504514B2 (en) * 2017-09-29 2019-12-10 Visteon Global Technologies, Inc. Human machine interface system and method for improving user experience based on history of voice activity
US10424302B2 (en) * 2017-10-12 2019-09-24 Google Llc Turn-based reinforcement learning for dialog management
CN107943896A (zh) * 2017-11-16 2018-04-20 百度在线网络技术(北京)有限公司 信息处理方法和装置
US10475451B1 (en) * 2017-12-06 2019-11-12 Amazon Technologies, Inc. Universal and user-specific command processing
JP7044415B2 (ja) * 2017-12-31 2022-03-30 美的集団股▲フン▼有限公司 ホームアシスタント装置を制御するための方法及びシステム
CN108600911B (zh) * 2018-03-30 2021-05-18 联想(北京)有限公司 一种输出方法及电子设备

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030215066A1 (en) 2002-05-16 2003-11-20 Craig Shambaugh Method and apparatus for agent optimization using speech synthesis and recognition
US20150066479A1 (en) * 2012-04-20 2015-03-05 Maluuba Inc. Conversational agent

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3918508A4

Also Published As

Publication number Publication date
US20200243062A1 (en) 2020-07-30
EP3918508A1 (en) 2021-12-08
JP7313455B2 (ja) 2023-07-24
US10770059B2 (en) 2020-09-08
EP3918508A4 (en) 2022-11-09
JP2022523504A (ja) 2022-04-25

Similar Documents

Publication Publication Date Title
US10770059B2 (en) Conversational speech agent
Oord et al. Parallel wavenet: Fast high-fidelity speech synthesis
US11645547B2 (en) Human-machine interactive method and device based on artificial intelligence
Tokozume et al. Learning environmental sounds with end-to-end convolutional neural network
Li et al. Learning fine-grained cross modality excitement for speech emotion recognition
Sarthak et al. Spoken language identification using convnets
Pinto et al. Analysis of MLP-based hierarchical phoneme posterior probability estimator
Wu et al. Improving interpretability and regularization in deep learning
CN111144097B (zh) 一种对话文本的情感倾向分类模型的建模方法和装置
EP1479069B1 (en) Method for accelerating the execution of speech recognition neural networks and the related speech recognition device
KR20220098991A (ko) 음성 신호에 기반한 감정 인식 장치 및 방법
CN114548423A (zh) 以全向处理为特征的机器学习注意力模型
Chen et al. Sequence-to-sequence modelling for categorical speech emotion recognition using recurrent neural network
KR102418260B1 (ko) 고객 상담 기록 분석 방법
Pieraccini AI assistants
Soliman et al. Isolated word speech recognition using convolutional neural network
May Kernel approximation methods for speech recognition
CN118197306A (zh) 一种语音对话方法、系统、电子设备及存储介质
KR102159988B1 (ko) 음성 몽타주 생성 방법 및 시스템
Cong et al. Unsatisfied customer call detection with deep learning
CN119152838A (zh) 语音合成方法、装置、计算机设备
CN115101091B (zh) 基于多维特征加权融合的声音数据分类方法、终端和介质
US12277382B2 (en) Method and system to modify speech impaired messages utilizing neural network audio filters
Karanasou et al. I-vectors and structured neural networks for rapid adaptation of acoustic models
Renkens et al. Incrementally learn the relevance of words in a dictionary for spoken language acquisition

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2021544347

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2020748597

Country of ref document: EP

Effective date: 20210830