WO2019103569A1 - Method for improving performance of voice recognition on basis of context, computer apparatus, and computer-readable recording medium - Google Patents
Method for improving performance of voice recognition on basis of context, computer apparatus, and computer-readable recording medium Download PDFInfo
- Publication number
- WO2019103569A1 WO2019103569A1 PCT/KR2018/014680 KR2018014680W WO2019103569A1 WO 2019103569 A1 WO2019103569 A1 WO 2019103569A1 KR 2018014680 W KR2018014680 W KR 2018014680W WO 2019103569 A1 WO2019103569 A1 WO 2019103569A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- present
- user
- stt
- text
- module
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/065—Adaptation
- G10L15/07—Adaptation to the speaker
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/1815—Semantic context, e.g. disambiguation of the recognition hypotheses based on word meaning
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/183—Speech classification or search using natural language modelling using context dependencies, e.g. language models
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/28—Constructional details of speech recognition systems
- G10L15/30—Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
Definitions
- the present invention relates to an interactive AI agent system, and more particularly to a method for improving the performance of speech recognition based on a context.
- 10-2013-0031231 discloses a technology for providing a user with a plurality of text conversion results for voice input so that the user can directly input a result of accurate text conversion
- Korean Patent Laid-Open Publication No. 10-2017-0099917 proposes a plurality of responses based on the context information for each of a plurality of text conversion results for speech input, Technology is disclosed.
- one service principal may provide an interactive AI agent system, but some functions may be serviced through an external optimized server.
- a function of converting user's voice into text can be provided in the form of an API, and a representative example is the Google Speech API.
- STT Sound-To-Text
- a service when a service is received from an external STT server, it transmits a voice input or transmits a file format and a syntax hint together with a voice loudspeaker, and receives at least one text conversion value associated with the transmitted voice input.
- a syntax hint is information that aids in the processing of a given audio, and may be a specific word or phrase.
- the external STT server can improve the accuracy of voice recognition of the transmitted voice file by using the syntax hint.
- an interactive AI agent system that receives speech of a free speech form and provides services of various domain based contexts
- the interactive AI agent system builds a hierarchical conversation flow management model including sufficient dialog management knowledge, for example, sequential conversation flow patterns for providing the corresponding service And manage and provide appropriate information when converting speech recognition to text.
- an interactive AI agent system that can more easily grasp the user's intention based on accurate user speech recognition and provide an appropriate response can be provided.
- FIG. 1 is a schematic diagram of a system environment in which an interactive AI agent system may be implemented, according to one embodiment of the present invention.
- FIG. 2 is a functional block diagram that schematically illustrates the functional configuration of the user terminal 102 of FIG. 1, in accordance with one embodiment of the present invention.
- FIG. 3 is a functional block diagram that schematically illustrates the functional configuration of the interactive AI agent server 106 of FIG. 1, according to one embodiment of the present invention.
- FIG. 4 is an exemplary operational flow diagram performed by the STT auxiliary module of FIG. 3, in accordance with an embodiment of the present invention.
- " module " or " module " means a functional part that performs at least one function or operation, and may be implemented by hardware or software or a combination of hardware and software. Also, a plurality of "modules” or “sub-modules” may be integrated into at least one software module and implemented by at least one processor, except for "module” or “sub-module” have.
- the 'interactive AI agent system' is a system in which a user interacts with a user via a natural word input (for example, a natural language) input from a user through interactive interaction via a natural language of voice and / (E.g., commands, statements, requests, questions, etc. from the user) to determine the intent of the user and to perform the necessary actions based on the user's intent, i.e., , And is not limited to any particular form of information processing system.
- the interactive AI agent system may be for providing a predetermined service, wherein the service may comprise a plurality of sub-task categories (e.g., , Product inquiries, brand inquiries, design inquiries, price inquiries, return inquiries, etc.).
- the operations performed by the " interactive AI agent system " include, for example, an interactive response and / or task performance, each of which is performed according to the intention of the user in a sequential flow of sub- Lt; / RTI >
- the interactive response provided by the " interactive AI agent system " may be in the form of a visual, auditory and / or tactile (e.g., voice, sound, text, video, image, symbol, emoticon, hyperlink, Animation, various notices, motion, haptic feedback, and the like), and the like.
- the task performed by the 'interactive AI agent system' may include, for example, searching for information, proceeding with purchase of goods, writing a message, writing an email, dialing, playing music, photographing, / Navigation services, and the like, as well as various types of tasks (including, but not limited to, examples).
- the 'interactive AI agent system' includes a chatbot system based on a messenger platform, for example, a chatbot system for exchanging messages with a user on a messenger, providing various information desired by the user, but it should be understood that the present invention is not limited thereto.
- FIG. 1 is a schematic diagram of a system environment 100 in which an interactive AI agent system may be implemented, in accordance with one embodiment of the present invention.
- the system environment 100 includes a plurality of user terminals 102a-102n, a communication network 104, an interactive AI agent server 106, an external service server 108 and an external STT service server 110, . ≪ / RTI >
- each of the plurality of user terminals 102a-102n may be any user electronic device having wired or wireless communication capability.
- Each of the user terminals 102a-102n may be a variety of wired or wireless communication terminals, including, for example, a smart speaker, a music player, a game console, a digital TV, a set top box, a smart phone, a tablet PC, a desktop, a laptop, It is to be understood that the invention is not limited to any particular form.
- each of the user terminals 102a-102n can communicate with the interactive AI agent server 106, i. E., Via the communication network 104, with the necessary information.
- each of the user terminals 102a-102n can communicate with the external service server 108 through the communication network 104, that is, send and receive necessary information.
- each of the user terminals 102a-102n may receive user input in the form of voice and / or text from the outside, and may interact with the interactive AI agent server 106 via the communication network 104, (E.g., providing a specific conversation response and / or providing a specific task response) obtained through communication with the external service server 108 and / or communication with the external service server 108 (and / or processing within the user terminals 102a-102n) And the like) to the user.
- an interactive response as a result of an operation corresponding to a user input provided by the user terminals 102a-102n may be, for example, a sequence of sub- May be provided in accordance with the conversation flow pattern of the sub-task classification corresponding to the user input at that time in the flow.
- each of the user terminals 102a-102n may provide a dialog response as a result of an operation corresponding to a user input, in a visual, audible and / or tactile form (e.g., Images, symbols, emoticons, hyperlinks, animations, various notices, motion, haptic feedback, and the like), and the like.
- task execution as an operation corresponding to a user input is performed by, for example, searching for information, proceeding to purchase goods, composing a message, creating an email, dialing, music playback, photographing, Services, and the like, as well as performing various types of tasks.
- the communication network 104 may include any wired or wireless communication network, e.g., a TCP / IP communication network.
- the communication network 104 may include, for example, a Wi-Fi network, a LAN network, a WAN network, an Internet network, and the like, and the present invention is not limited thereto.
- the communication network 104 may be any of a variety of wired or wireless, such as Ethernet, GSM, EDGE, CDMA, TDMA, OFDM, Bluetooth, VoIP, Wi- May be implemented using a communication protocol.
- the interactive AI agent server 106 may communicate with the user terminals 102a-102n via the communication network 104.
- the interactive AI agent server 106 sends and receives necessary information to and from the user terminals 102a-102n via the communication network 104, The operation result corresponding to the user input, i.e., the user's intention, can be provided to the user.
- the interactive AI agent server 106 receives voice-like user natural language input from the user terminal 102a-102n, for example via the communication network 104, Can be converted into user natural language input of a character form. According to an embodiment of the present invention, the interactive AI agent server 106 transmits user voice input received from the user terminals 102a - 102n to the external STT server 110, At least one text data corresponding to a user input in the form of a voice can be received. According to one embodiment of the present invention, the interactive AI agent server 106 receives at least one text data from the external STT server 110 and, based on the STT conversion assist database described below, Perform evaluation on each of them, and output at least one text data and an evaluation result.
- the interactive AI agent server 106 receives user natural language input in the form of speech and / or text from the user terminal 102a-102n, for example via the communication network 104, The received natural language input can be processed based on the models to determine the intent of the user.
- the interactive AI agent server 106 may communicate with the external service server 108 via the communication network 104, as described above.
- the external service server 108 may be, for example, a messaging service server, an online consultation center server, an online shopping mall server, an information search server, a map service server, a navigation service server, and the like.
- an interactive response based on the user's intent which is transmitted from the interactive AI agent server 106 to the user terminals 102a-102n, It should be noted that this may include content.
- the interactive AI agent server 106 is shown as being a separate physical server configured to communicate with the external service server 108 via the communication network 104, the present disclosure is not limited thereto. According to another embodiment of the present invention, the interactive AI agent server 106 may be included as part of various service servers such as an online consultation center server or an online shopping mall server.
- the interactive AI agent server 106 collects interactive logs (e.g., may include a plurality of users and / or system utterance records) over various paths, Automatically analyze the conversation logs, and create and / or update a conversation flow management model based on the analysis results.
- the interactive AI agent server 106 classifies each utterance record into one of the predetermined task categories, for example, through keyword analysis on the collected conversation logs, Can be analyzed stochastically.
- the external STT server 110 receives a voice input of a user through a communication module and converts the received voice input into at least one It can be converted into text data in a character form and transmitted.
- the external STT server 110 may receive the user's speech input and related syntax hints and convert the user's speech input into text data in at least one character form based thereon.
- the user terminal 102 includes a user input receiving module 202, a sensor module 204, a program memory module 206, a processing module 208, a communication module 210, 212).
- the user input receiving module 202 may receive various types of input from a user, for example, a natural language input such as a voice input and / or a text input (and additionally, Can be received.
- the user input receiving module 202 includes, for example, a microphone and an audio circuit, and can acquire a user audio input signal through a microphone and convert the obtained signal into audio data.
- the user input receiving module 202 may include various types of input devices such as various pointing devices such as a mouse, a joystick, and a trackball, a keyboard, a touch panel, a touch screen, , And can acquire a text input and / or a touch input signal inputted from a user through these input devices.
- the user input received at the user input receiving module 202 may be associated with performing a predetermined task, such as performing a predetermined application or searching for certain information, etc. However, It is not.
- the user input received at the user input receiving module 202 may require only a simple conversation response, regardless of the execution of a predetermined application or retrieval of information.
- the user input received at the user input receiving module 202 may relate to a simple statement for unilateral communication.
- the sensor module 204 includes one or more different types of sensors through which the status information of the user terminal 102, such as the physical state of the user terminal 102, Software and / or hardware status, or information regarding the environmental conditions of the user terminal 102, and the like.
- the sensor module 204 may include an optical sensor, for example, and may sense the ambient light condition of the user terminal 102 through the optical sensor.
- the sensor module 204 includes, for example, a movement sensor and can detect whether the corresponding user terminal 102 is moving through the movement sensor.
- the sensor module 204 includes, for example, a velocity sensor and a GPS sensor, and through these sensors, the position and / or orientation of the corresponding user terminal 102 can be detected.
- the sensor module 204 may include other various types of sensors, including temperature sensors, image sensors, pressure sensors, touch sensors, and the like.
- the program memory module 206 may be any storage medium that stores various programs that may be executed on the user terminal 102, such as various application programs and related data.
- program memory module 206 may include, for example, a telephone dialer application, an email application, an instant messaging application, a camera application, a music playback application, a video playback application, an image management application, , And data related to the execution of these programs.
- the program memory module 206 may be configured to include various types of volatile or non-volatile memory such as DRAM, SRAM, DDR RAM, ROM, magnetic disk, optical disk, .
- the processing module 208 may communicate with each component module of the user terminal 102 and perform various operations on the user terminal 102. According to one embodiment of the present invention, the processing module 208 can drive and execute various application programs on the program memory module 206. [ According to one embodiment of the present invention, the processing module 208 may receive signals from the user input receiving module 202 and the sensor module 204, if necessary, and perform appropriate processing on these signals have. According to one embodiment of the present invention, the processing module 208 may, if necessary, perform appropriate processing on signals received from the outside via the communication module 210.
- the communication module 210 is configured to allow the user terminal 102 to communicate with the interactive AI agent server 106 and / or the external service server 108 via the communication network 104 of FIG. 1 Communication.
- the communication module 212 may be configured to receive signals from, for example, the user input receiving module 202 and the sensor module 204 via the communication network 104 in accordance with a predetermined protocol, To server 106 and / or to external service server 108.
- the communication module 210 may provide various signals received from the interactive AI agent server 106 and / or the external service server 108 via the communication network 104, e.g., voice and / Or a response signal including a natural language response in the form of a text, or various control signals, and perform appropriate processing according to a predetermined protocol.
- signals received from the interactive AI agent server 106 and / or the external service server 108 via the communication network 104 e.g., voice and / Or a response signal including a natural language response in the form of a text, or various control signals, and perform appropriate processing according to a predetermined protocol.
- the response output module 212 may output a response corresponding to a user input in various forms such as time, auditory, and / or tactile sense.
- the response output module 212 includes various display devices such as a touch screen based on technology such as LCD, LED, OLED, QLED, and the like, Such as text, symbols, video, images, hyperlinks, animations, various notices, etc., to the user.
- the response output module 212 may include, for example, a speaker or a headset and may provide an audible response, e.g., voice and / or acoustic response corresponding to user input, can do.
- the response output module 212 includes a motion / haptic feedback generator, through which a tactile response, e.g., motion / haptic feedback, can be provided to the user.
- a tactile response e.g., motion / haptic feedback
- the response output module 212 may simultaneously provide any combination of two or more of a text response, a voice response, and a motion / haptic feedback corresponding to a user input.
- FIG. 3 is a functional block diagram that schematically illustrates the functional configuration of the interactive AI agent server 106 of FIG. 1, according to one embodiment of the present invention.
- the interactive AI agent server 106 includes a communication module 310, a Speech-To-Text (STT) auxiliary module 320, a Natural Language Understanding (NLU) A text-to-speech (TTS) module 340, a storage module 350, and a conversation flow management model building / updating module 360.
- STT Speech-To-Text
- NLU Natural Language Understanding
- TTS text-to-speech
- the communication module 310 is configured to allow the interactive AI agent server 106 to communicate with the user terminal 102 and / or via the communication network 104, in accordance with any wired or wireless communication protocol, To communicate with the external service server 108 and / or the external STT server 110.
- the communication module 310 can receive voice input and / or text input from the user, transmitted from the user terminal 102 via the communication network 104.
- the communication module 310 may communicate with the user terminal 102 via the communication network 104 with or without voice input and / or text input from the user, The status information of the user terminal 102 transmitted from the terminal 102 can be received.
- the status information may include various status information (e.g., the physical state of the user terminal 102) associated with the user terminal 102 at the time of speech input from the user and / Software and / or hardware status of the user terminal 102, environmental status information around the user terminal 102, etc.).
- communication module 310 may also include an interactive response (e. G., A native < / RTI > And / or control signals to the user terminal 102 via the communication network 104.
- the user terminal 102 may be connected to the user terminal 102 via the network 104,
- the STT auxiliary module 320 can receive the voice input from the user input received through the communication module 310 and transmit the received voice input to the external STT server 110. According to one embodiment of the present invention, the STT auxiliary module 320 can transmit the voice input received through the communication module 310 and the information related to the voice input to the external STT server 110 together. According to one embodiment of the present invention, the STT auxiliary module 320 receives at least one text data converted from the voice input transmitted through the communication module 310, and transmits the STT conversion assist database 350 The translation accuracy for each of the at least one text data can be evaluated on a basis.
- the NLU module 330 may receive text input from the communication module 310 or the STT auxiliary module 320.
- the textual input received at the NLU module 330 may be transmitted to the user terminal 102 via the user text input or communication module 310 received from the user terminal 102 via the communication network 104, (E.g., a sequence of words) received from the external STT server via the STT auxiliary module 320.
- the NLU module 330 may include status information associated with a corresponding user input, such as upon receipt of a textual input or thereafter, such as the status information of the user terminal 102 at the time of the user input And the like.
- the status information may include various status information (e.g., the physical state of the user terminal 102, the software status of the user terminal 102) associated with the user terminal 102 at the time of user input and / And / or hardware state, environmental state information around the user terminal 102, etc.).
- various status information e.g., the physical state of the user terminal 102, the software status of the user terminal 102 associated with the user terminal 102 at the time of user input and / And / or hardware state, environmental state information around the user terminal 102, etc.
- the NLU module 330 may map the received text input to one or more user intents. Where the user intent can be associated with a series of operations (s) that can be understood and performed by the interactive AI agent server 106 according to the user intention. According to one embodiment of the present invention, the NLU module 330 may refer to the status information described above in associating the received text input with one or more user intentions.
- the TTS module 340 may receive an interactive response that is generated to be transmitted to the user terminal 102.
- the interactive response received at the TTS module 340 may be a natural word or a sequence of words having a textual form.
- the TTS module 340 may convert the input of the above received text form into speech form according to various types of algorithms.
- the storage module 350 may include various databases. According to one embodiment of the present invention, the storage module 350 may include a user database 352, a conversation understanding knowledge base 354, an interaction log database 356, and a conversation flow management model 368.
- the user database 352 may be a database for storing and managing characteristic data for each user.
- the user database 352 may include, for example, previous conversation history of the user, pronunciation feature information of the user, user lexical preference, location of the user, And may include various user-specific information.
- the conversation understanding knowledge base 354 may include, for example, a predefined ontology model.
- an ontology model can be represented, for example, in a hierarchical structure between nodes, where each node is associated with an " intention " node corresponding to the user & Node (a sub-attribute node directly linked to an " intent " node or linked back to an " attribute " node of an " intent " node).
- " attribute " nodes directly or indirectly linked to an " intention " node and its " intent " node may constitute one domain, and an ontology may be composed of such a set of domains .
- the conversation understanding knowledge base 354 may be configured to include domains that each correspond to all intents, for example, an interactive AI agent system that can understand and perform corresponding actions have.
- the ontology model can be dynamically changed by addition or deletion of nodes or modification of relations between nodes.
- the intention nodes and attribute nodes of each domain in the ontology model may be associated with words and / or phrases associated with corresponding user intents or attributes, respectively.
- the conversation understanding knowledge base 354 includes an ontology model 354 that includes an ontology model including a hierarchy of nodes and a set of words and / or phrases associated with each node, , And the STT auxiliary module 320 can determine the user's intention based on the ontology model implemented in the lexical dictionary form.
- the STT assistance module 320 upon receipt of a text input or sequence of words, can determine which of the domains in the ontology model the respective words in the sequence are associated with , And can determine the corresponding domain, i. E. User intention, based on such a determination.
- the conversation log database 356 may be a database that classifies, stores, and manages conversation logs collected in any of various ways according to a predetermined criterion. According to an embodiment of the present invention, the conversation log database 356 may be stored in association with, for example, the number of times that the user of the service domain frequently uses words, phrases, sentences, and various other types of user input.
- the dialogue flow management model 358 may include a probabilistic distribution model for a sequential flow between a plurality of sub-task classes needed for providing a service in relation to a given service domain .
- the dialogue flow management model 358 may include, for example, a sequential flow between each sub-task category belonging to the service domain in the form of a probability graph.
- the dialogue flow management model 358 may include, for example, a probabilistic distribution of each task classification obtained on various sequential flows that may occur between each of the sub-task classes.
- the dialogue flow management model 358 may also include a library of dialog patterns belonging to each task category.
- each database contained in the storage module 350 may reside, for example, at the user terminal 102 and distributed to the user terminal 102 and the interactive AI agent server 106 And the like.
- the conversation flow management model building / updating module 360 automatically analyzes each conversation log stored in the conversation log database 356 collected by any of a variety of methods, And build and / or update the conversation flow management model.
- the dialogue flow management model build / update unit 360 generates a dialogue flow management model, for example, through keyword analysis on conversation logs stored in the conversation log database 356, One of the categories, and group the utterance records of the same sub-task category.
- the dialogue flow management model construction / update unit 360 can grasp, for example, a sequential flow between each group, i.e., each lower task category, as a probabilistic distribution.
- the dialogue flow management model construction / update unit 360 can construct a sequential flow between the sub-task categories on the service domain, for example, in the form of a probability graph.
- the dialogue flow management model building / updating unit 360 may be configured to determine, for example, all sequential flows that may occur between each of the sub-task classes, It is possible to determine the probability of occurrence of the flow between each job classification, and thereby obtain a stochastic distribution of each sequential flow between the above-mentioned lower job classes.
- the conversation flow management model construction / update unit 360 performs keyword analysis on the conversation logs collected in any of various ways, and stores each speech history on the conversation log in a predetermined operation It can be classified and tagged as one of the categories.
- the predetermined task classifications may be, for example, each of the sub classifications belonging to one service domain.
- the conversation flow management model building / updating unit 360 constructs the conversation flow management model building / updating unit 360 based on, for example, a sub-task classification of a product inquiry, a brand inquiry, a design inquiry, Quot; can be classified and tagged with any one of them.
- the dialogue flow management model construction / update unit 360 may previously select relevant keywords for each of the lower task categories, and, based on the selected keywords, Can be classified into classification.
- the conversation flow management model construction / update unit 360 can group speech data classified and tagged into any one of a plurality of job data categories among speech data of the same classification.
- the speech history groups grouped into the same category may be included in the dialogue flow management model as the conversation patterns of the category.
- the dialogue flow management model construction / update unit 360 can analyze the probabilistic distribution of the time series sequential between the respective lower task categories from the dialogue logs.
- a sub-task classification belongs to a product inquiry, a brand inquiry, a design inquiry, a price inquiry, and a return inquiry
- a working classification there is a probability of 70% of product inquiry, 20% of brand inquiry, 5% of design inquiry, 3% of price inquiry, and 2% of return inquiry.
- the price inquiry is 13%
- the return inquiry is 1% probability
- each of the sub work categories can be layered as the probability distribution of this sequential flow.
- the dialogue flow management model construction / update unit 360 may construct a sequential flow between lower task classes on a service domain, for example, in a stochastic graph form. According to an embodiment of the present invention, the dialogue flow management model construction / update unit 360 can recursively grasp the probabilistic relation of the sequential flow between the respective lower task classes, for example, Sequential flow can be configured.
- the dialogue flow management model construction / update unit 360 can delete a flow having a probability less than the threshold from the analysis result of the probabilistic distribution of the time series sequence between the lower task classes. For example, if the probability is 2%, if the probability that the return inquiry appears after the inquiry of the commodity is 1% in the service domain of the commodity purchase, the flow of the return inquiry after the commodity inquiry is deleted from the conversation flow management model .
- the interactive AI agent system is a client-server model between the user terminal 102 and the interactive AI agent server 106, And is based on a so-called " thin client-server model ", which delegates all other functions of the interactive AI agent system to the server, but the present invention is not limited thereto.
- the interactive AI agent system may be implemented as a distributed application between the user terminal and the server, or as a stand-alone application installed on the user terminal .
- the interactive AI agent system implements the functions of the interactive AI agent system distributed between the user terminal and the server according to an embodiment of the present invention
- the distribution of each function of the interactive AI agent system between the client and the server is It should be understood that the invention may be otherwise embodied.
- the specific module has been described as performing certain operations for convenience, the present invention is not limited thereto. According to another embodiment of the present invention, it is to be understood that the operations described as being performed by any particular module in the above description may be performed by separate and distinct modules, respectively.
- FIG. 4 is an exemplary operational flow diagram performed by the STT auxiliary module of FIG. 3, in accordance with an embodiment of the present invention.
- the STT assistance module 320 may receive a user's speech input including a natural language input composed of one or more words.
- the natural language input may be a voice input, e.g., received via the microphone of the user terminal 102a-102n and transmitted via the communication module 310.
- the STT assistance module 320 transmits the voice input of the user received in step 402 to the external STT server 110.
- the voice input may be in a voice file (e.g., wave file) or streaming format.
- the STT auxiliary module 320 may transmit information (e.g., a file format, an encoding format, and the like) and a syntax hint together with a voice input of a user.
- the phrase hint may be a specific word or phrase as information that aids the given audio processing.
- the STT assistance module 320 may receive at least one textual data associated with the voice file transmitted from the external STT server 110.
- the at least one text data may include a score (probability) given by an external STT server.
- the STT assistance module 320 may evaluate the conversion accuracy for each of the at least one textual data.
- the conversion accuracy may be a probability for each of the at least one text data or a relative rank for each of the at least one text data.
- the STT auxiliary module 320 may evaluate the conversion accuracy of each of the at least one text data, according to a predetermined criterion. In one embodiment of the present invention, the STT auxiliary module 320 may evaluate the conversion accuracy for each of the at least one text data in consideration of the scores given by the external STT server for each of the at least one text data.
- the STT auxiliary module 320 may evaluate the conversion accuracy of each of the at least one text data based on the STT conversion auxiliary database.
- the STT change assistance database includes a user database 352 for storing and managing user-specific feature data, an conversation log database 356 in which existing conversation logs of users are analyzed and stored, A dialogue understanding knowledge base 352 in which attributes associated with an intent to be included are stored, a dialogue flow 352 that is a probabilistic distribution model for a sequential flow between a plurality of lower task classes necessary for providing a service in association with the service domain, And a management model 358.
- the STT auxiliary module 320 may evaluate the conversion accuracy based on the number of occurrences of words contained in each of the at least one text conversion results.
- the number of occurrences of words can be calculated based on the conversation log database in which the number of occurrences of words per domain is stored. For example, when the corresponding domain is "finance" and the received text data is "one time” and "Japan", the number of occurrences for "one time” stored in the domain user database is 7200 and the number of occurrences for "Japan” 10 times, the probability of conversion accuracy can be determined to be higher than that of " one time "
- the STT auxiliary module 320 may evaluate the conversion accuracy based on the similarity between the sentences included in each of the at least one text conversion results and the sentences stored in the STT translation assistant database.
- a method of calculating the similarity between sentences includes a statistical method of constructing a vector with each word frequency included in a sentence and obtaining a cosine similarity between vectors, or a semantic similarity based on WordNet distance Various semantic methods can be used.
- the STT auxiliary module 320 receives at least one converted text data from the external STT server 110 via the communication module 310, and based on a predetermined knowledge model prepared in advance To determine the intent of the user corresponding to the user natural language input and to evaluate the conversion accuracy based on the determined intent.
- the STT assistance module 320 when determining the user's intent, may send the received text input to one or more It can correspond to a user intent.
- the STT auxiliary module 320 receives at least one converted text data from the external STT server 110 via the communication module 310, The conversion accuracy can be evaluated based on the hierarchical position.
- the STT auxiliary module 320 receives the hierarchical location information of the corresponding speech input from the conversation flow management model building / updating module 360, which configures a sequential flow on the service domain in the form of a probability graph .
- the STT auxiliary module 320 outputs at least one text conversion result.
- the STT auxiliary module 320 may output at least one text conversion result and an evaluation result together.
- a computer program according to an embodiment of the present invention may be stored in a storage medium readable by a computer processor or the like such as a nonvolatile memory such as EPROM, EEPROM, flash memory device, a magnetic disk such as an internal hard disk and a removable disk, CDROM disks, and the like. Also, the program code (s) may be implemented in assembly language or machine language. And all changes and modifications that fall within the true spirit and scope of the present invention are intended to be embraced by the following claims.
Abstract
Description
Claims (9)
- 컴퓨터 장치에 의해서 수행되는 방법으로서, 상기 방법은, 대화형 AI 에이전트 시스템을 위한 음성 텍스트 변환을 보조하는 것이며, A method performed by a computer device, the method being to assist speech text conversion for an interactive AI agent system,소정의 서비스 도메인에 관련된 STT 변환 보조 데이터베이스를 구축하는 단계;Constructing an STT conversion assistance database related to a predetermined service domain;외부의 음성-텍스트-변환(Speech-to-text) 서버로부터 적어도 하나의 텍스트 변환 결과를 수신하는 단계; Receiving at least one text conversion result from an external Speech-to-text server;상기 STT 변환 보조 데이터베이스에 기초하여, 상기 적어도 하나의 텍스트 변환 결과 각각에 대해 평가하는 단계; 및Evaluating each of the at least one text conversion result based on the STT conversion auxiliary database; And상기 적어도 하나의 텍스트 변환 결과와 평가 결과를 출력하는 단계Outputting the at least one text conversion result and the evaluation result를 포함하는 방법.≪ / RTI >
- 제1항에 있어서, The method according to claim 1,상기 STT 변환 보조 데이터베이스는 사용자별 특징 데이터를 저장 및 관리하는 사용자 데이터베이스, 사용자들의 기존 대화로그들이 분석되어 저장되어 있는 대화 로그 데이터베이스, 서비스 도메인에 포함되는 의도(intent)와 연관된 속성이 저장되어 있는 대화 이해 지식베이스, 상기 서비스 도메인과 관련하여 해당 서비스 제공을 위하여 필요한 복수의 하위 작업 분류들 간의 순차적 흐름에 관한 확률적 분포 모델이 저장되어 있는 대화 흐름 관리 모델 중 적어도 하나를 포함하는 방법.The STT conversion assistant database includes a user database for storing and managing user-specific feature data, an interactive log database in which existing conversation logs of users are analyzed and stored, a conversation in which attributes related to an intent included in the service domain are stored And a dialogue flow management model in which a probabilistic distribution model related to a sequential flow between a plurality of lower task classes necessary for providing a service in association with the service domain is stored.
- 제2항에 있어서, 3. The method of claim 2,상기 평가하는 단계는 The evaluating step상기 STT 변환 보조 데이터베이스에 저장되고, 상기 적어도 하나의 텍스트 변환 결과 각각에 포함된 단어들의 출현 횟수를 고려하여 평가하는 단계를 포함하는 방법.Evaluating the number of occurrences of words included in each of the at least one text conversion result stored in the STT conversion assist database.
- 제2항에 있어서, 3. The method of claim 2,상기 평가하는 단계는 The evaluating step상기 STT 변환 보조 데이터베이스에 저장된 문장들과 상기 적어도 하나의 텍스트 변환 결과에 포함된 문장과의 유사도를 기초로 평가하는 단계를 포함하는 방법.And evaluating based on the similarity between the sentences stored in the STT translation assistance database and the sentences included in the at least one text conversion result.
- 제2항에 있어서, 3. The method of claim 2,상기 평가하는 단계는 The evaluating step미리 준비된 소정의 지식 모델을 기초로 사용자의 의도(intent)를 결정하는 단계를 포함하는 방법And determining an intent of the user based on a predetermined knowledge model prepared in advance
- 제2항에 있어서, 3. The method of claim 2,상기 평가하는 단계는 The evaluating step계층적 대화 흐름 관리 모델을 기초로 사용자의 계층적 위치를 결정하는 단계를 포함하는, 방법.And determining a hierarchical location of the user based on a hierarchical conversational flow management model.
- 하나 이상의 명령어를 포함하는 컴퓨터 판독 가능 기록 매체로서,A computer-readable medium having stored thereon one or more instructions,상기 하나 이상의 명령어는, 컴퓨터에 위해 실행되는 경우, 상기 컴퓨터로 하여금, 제1항 내지 제6항 중 어느 한 항의 방법을 수행하게 하는, 컴퓨터 판독 가능 기록 매체.Wherein the one or more instructions cause the computer to perform the method of any one of claims 1 to 6 when executed on a computer.
- 문맥기반의 음성-텍스트-변환(Speech-to-text)을 제공하도록 구성된 컴퓨터 장치로서, A computer apparatus configured to provide context-based speech-to-text,사용자별 특징 데이터를 저장 및 관리하고, 사용자들의 기존 대화 로그들을 분석 및 저장하며, 서비스 도메인에 포함되는 의도(intent)와 연관된 속성을 저장하고, 상기 서비스 도메인과 관련하여 해당 서비스 제공을 위하여 필요한 복수의 하위 작업 분류들 간의 순차적 흐름에 관한 확률적 분포 모델을 저장 및 관리하도록 구성되는 저장 모듈; Storing and managing characteristic data for each user, analyzing and storing existing conversation logs of users, storing attributes associated with an intent included in the service domain, and storing a plurality of A storage module configured to store and manage a probabilistic distribution model relating to a sequential flow between subordinate classes of work;상기 대화 로그들을 자동으로 분석하여, 분석 결과에 따라 대화 흐름 관리 모델을 구축 및/또는 갱신하도록 구성되는 대화 흐름 관리 모델 구축/갱신 모듈; 및A dialogue flow management model building / updating module configured to automatically analyze the conversation logs and build and / or update a conversation flow management model according to the analysis result; AndSTT 보조 모듈을 포함하고, STT auxiliary module,상기 STT 보조 모듈은 The STT auxiliary module외부 STT 서버로부터 적어도 하나의 텍스트 변환 결과를 수신하고, Receiving at least one text conversion result from an external STT server,상기 저장 모듈에 저장된 데이터에 기초하여, 상기 적어도 하나의 텍스트 변환 결과 각각에 대해 평가하고,Evaluating each of the at least one text conversion result based on the data stored in the storage module,상기 적어도 하나의 텍스트 변환 결과와 평가 결과를 출력하도록 구성된, 컴퓨터 장치.And to output the at least one text conversion result and the evaluation result.
- 제8항에 있어서,9. The method of claim 8,상기 컴퓨터 장치는, 사용자 단말 또는 상기 사용자 단말과 통신 가능하게 결합된 서버를 포함하는, 컴퓨터 장치.Wherein the computer device comprises a user terminal or a server communicatively coupled to the user terminal.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR1020170159269A KR101970899B1 (en) | 2017-11-27 | 2017-11-27 | Method and computer device for providing improved speech-to-text based on context, and computer readable recording medium |
KR10-2017-0159269 | 2017-11-27 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2019103569A1 true WO2019103569A1 (en) | 2019-05-31 |
Family
ID=66282142
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/KR2018/014680 WO2019103569A1 (en) | 2017-11-27 | 2018-11-27 | Method for improving performance of voice recognition on basis of context, computer apparatus, and computer-readable recording medium |
Country Status (2)
Country | Link |
---|---|
KR (1) | KR101970899B1 (en) |
WO (1) | WO2019103569A1 (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020218659A1 (en) | 2019-04-26 | 2020-10-29 | (주)아크릴 | Automated query answering device for insurance product sales utilizing artificial neural network |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20140062656A (en) * | 2012-11-14 | 2014-05-26 | 한국전자통신연구원 | Spoken dialog management system based on dual dialog management using hierarchical dialog task library |
KR20140111538A (en) * | 2013-03-11 | 2014-09-19 | 삼성전자주식회사 | Interactive sever, display apparatus and control method thereof |
KR20160060335A (en) * | 2014-11-20 | 2016-05-30 | 에스케이텔레콤 주식회사 | Apparatus and method for separating of dialogue |
WO2016151698A1 (en) * | 2015-03-20 | 2016-09-29 | 株式会社 東芝 | Dialog device, method and program |
KR20170088164A (en) * | 2016-01-22 | 2017-08-01 | 한국전자통신연구원 | Self-learning based dialogue apparatus for incremental dialogue knowledge, and method thereof |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20160031231A (en) | 2014-09-12 | 2016-03-22 | 엘지전자 주식회사 | An outdoor unit for a an air conditioner |
US9836452B2 (en) | 2014-12-30 | 2017-12-05 | Microsoft Technology Licensing, Llc | Discriminating ambiguous expressions to enhance user experience |
-
2017
- 2017-11-27 KR KR1020170159269A patent/KR101970899B1/en active IP Right Grant
-
2018
- 2018-11-27 WO PCT/KR2018/014680 patent/WO2019103569A1/en active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20140062656A (en) * | 2012-11-14 | 2014-05-26 | 한국전자통신연구원 | Spoken dialog management system based on dual dialog management using hierarchical dialog task library |
KR20140111538A (en) * | 2013-03-11 | 2014-09-19 | 삼성전자주식회사 | Interactive sever, display apparatus and control method thereof |
KR20160060335A (en) * | 2014-11-20 | 2016-05-30 | 에스케이텔레콤 주식회사 | Apparatus and method for separating of dialogue |
WO2016151698A1 (en) * | 2015-03-20 | 2016-09-29 | 株式会社 東芝 | Dialog device, method and program |
KR20170088164A (en) * | 2016-01-22 | 2017-08-01 | 한국전자통신연구원 | Self-learning based dialogue apparatus for incremental dialogue knowledge, and method thereof |
Also Published As
Publication number | Publication date |
---|---|
KR101970899B1 (en) | 2019-04-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2019124647A1 (en) | Method and computer apparatus for automatically building or updating hierarchical conversation flow management model for interactive ai agent system, and computer-readable recording medium | |
WO2019088384A1 (en) | Method for providing rich-expression natural language conversation by modifying reply, computer device and computer-readable recording medium | |
KR101891498B1 (en) | Method, computer device and computer readable recording medium for multi domain service resolving the mixture of multi-domain intents in interactive ai agent system | |
WO2019147039A1 (en) | Method for determining optimal conversation pattern for goal achievement at particular time point during conversation session associated with conversation understanding ai service system, method for determining goal achievement prediction probability, and computer-readable recording medium | |
KR101959292B1 (en) | Method and computer device for providing improved speech recognition based on context, and computer readable recording medium | |
CN111026840B (en) | Text processing method, device, server and storage medium | |
KR102120751B1 (en) | Method and computer readable recording medium for providing answers based on hybrid hierarchical conversation flow model with conversation management model using machine learning | |
WO2019156536A1 (en) | Method and computer device for constructing or updating knowledge base model for interactive ai agent system by labeling identifiable, yet non-learnable, data from among learning data, and computer-readable recording medium | |
WO2019168235A1 (en) | Method and interactive ai agent system for providing intent determination on basis of analysis of same type of multiple pieces of entity information, and computer-readable recording medium | |
WO2019088638A1 (en) | Method, computer device and computer readable recording medium for providing natural language conversation by timely providing substantial reply | |
KR20190103951A (en) | Method, computer device and computer readable recording medium for building or updating knowledgebase models for interactive ai agent systen, by labeling identifiable but not-learnable data in training data set | |
WO2019156537A1 (en) | Interactive ai agent system and method for actively providing service related to security and like through dialogue session or separate session on basis of monitoring of dialogue session between users, and computer-readable recording medium | |
WO2019088383A1 (en) | Method and computer device for providing natural language conversation by providing interjection response in timely manner, and computer-readable recording medium | |
WO2019143170A1 (en) | Method for generating conversation template for conversation-understanding ai service system having predetermined goal, and computer readable recording medium | |
KR20190094087A (en) | User terminal including a user customized learning model associated with interactive ai agent system based on machine learning, and computer readable recording medium having the customized learning model thereon | |
WO2019103569A1 (en) | Method for improving performance of voice recognition on basis of context, computer apparatus, and computer-readable recording medium | |
WO2019142976A1 (en) | Display control method, computer-readable recording medium, and computer device for displaying conversation response candidate for user speech input | |
KR20210045704A (en) | Method, interactive ai agent system and computer readable recoding medium for providing intent determination based on analysis of a plurality of same type entity information | |
WO2019066132A1 (en) | User context-based authentication method having enhanced security, interactive ai agent system, and computer-readable recording medium | |
KR102120749B1 (en) | Method and computer readable recording medium for storing bookmark information to provide bookmark search service based on keyword | |
WO2019143141A1 (en) | Method for visualizing knowledge base for interactive ai agent system, and computer readable recording medium | |
WO2019098638A1 (en) | Method, interactive ai agent system and computer readable recording medium for providing semantic-free user voiceprint authentication having enhanced security | |
KR102120748B1 (en) | Method and computer readable recording medium for providing bookmark search service stored with hierachical dialogue flow management model based on context | |
KR20210045702A (en) | Method and computer readable recording medium for storing bookmark information to provide bookmark search service based on keyword | |
CN116091076A (en) | Dynamic dashboard management |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 18881839 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 18881839 Country of ref document: EP Kind code of ref document: A1 |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 22.01.2021) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 18881839 Country of ref document: EP Kind code of ref document: A1 |