WO2019103569A1

WO2019103569A1 - Method for improving performance of voice recognition on basis of context, computer apparatus, and computer-readable recording medium

Info

Publication number: WO2019103569A1
Application number: PCT/KR2018/014680
Authority: WO
Inventors: 장세영; 설재호
Original assignee: 주식회사 머니브레인
Priority date: 2017-11-27
Filing date: 2018-11-27
Publication date: 2019-05-31
Also published as: KR101970899B1

Abstract

The present invention provides a method which is performed by a computer apparatus and assists voice text conversion for an interactive AI agent system. The method may comprise the steps of: constructing an STT conversion assistance database related to a specific service domain; receiving at least one result of text conversion from an external speech-to-text server; evaluating each of the at least one result of the text conversion on the basis of the STT conversion assistance database; and outputting the at least one result of the text conversion and a result of the evaluation of each of the at least one result of the text conversion.

Description

Method for improving context-based speech recognition performance, computer device and computer readable recording medium

The present invention relates to an interactive AI agent system, and more particularly to a method for improving the performance of speech recognition based on a context.

Description of the Related Art [0002] In recent years, with the development of artificial intelligence fields, especially natural language understanding fields, it has become possible to move away from the machine operation according to the conventional machine-centered command input / output method and to allow users to use natural language in a more human-friendly manner such as voice and / Interactive AI agent systems are increasingly being developed and utilized to interactively manipulate machines and obtain desired services from machines. Accordingly, in various fields, including but not limited to, online consulting centers, online shopping malls, and the like, a user can provide desired services through an interactive AI agent system that provides natural language conversations in the form of voice and / or text It is possible to receive it.

As the interactive AI agent system is used in more and more fields, it requires a fairly high degree of accuracy in interpreting the user's intent and providing results consistent with it. In order to provide an accurate result in the interactive AI agent system, the user input must be grasped accurately. If the user input is a voice input, especially when the voice input includes a plurality of words continuously, the interactive AI agent system It may be difficult to provide an accurate result due to the input of a user who may have a result of text conversion of < RTI ID = 0.0 > In order to solve such a problem, Korean Patent Laid-Open Publication No. 10-2013-0031231 discloses a technology for providing a user with a plurality of text conversion results for voice input so that the user can directly input a result of accurate text conversion Korean Patent Laid-Open Publication No. 10-2017-0099917 proposes a plurality of responses based on the context information for each of a plurality of text conversion results for speech input, Technology is disclosed.

Meanwhile, one service principal may provide an interactive AI agent system, but some functions may be serviced through an external optimized server. For example, in recent years, a function of converting user's voice into text (Speech-To-Text: STT) can be provided in the form of an API, and a representative example is the Google Speech API. Generally, when a service is received from an external STT server, it transmits a voice input or transmits a file format and a syntax hint together with a voice loudspeaker, and receives at least one text conversion value associated with the transmitted voice input. A syntax hint is information that aids in the processing of a given audio, and may be a specific word or phrase. The external STT server can improve the accuracy of voice recognition of the transmitted voice file by using the syntax hint.

In recent years, beyond the conventional interactive AI agent system that provides only a simple dialogue type dialog service based on fixed scenarios, an interactive AI agent system that receives speech of a free speech form and provides services of various domain based contexts There is a growing demand. In order to provide various domain services by receiving speech input of free speech form, the interactive AI agent system builds a hierarchical conversation flow management model including sufficient dialog management knowledge, for example, sequential conversation flow patterns for providing the corresponding service And manage and provide appropriate information when converting speech recognition to text.

However, since the syntax hints to be transmitted when using the external STT server are so limited that only a small number of phrases are transmitted, it is difficult to provide various domain services.

In providing an interactive AI agent system, when providing a service of a complex domain that reflects knowledge that can be obtained from a large number of conversation logs, it is possible to more accurately provide conversion of voice to text using an external STT server An efficient and reliable method is required.

Providing an efficient and reliable method of providing a more accurate conversion of speech into text using an external STT server when providing services in a complex domain that reflects knowledge that can be obtained from a large number of conversation logs is provided.

Accordingly, an interactive AI agent system that can more easily grasp the user's intention based on accurate user speech recognition and provide an appropriate response can be provided.

1 is a schematic diagram of a system environment in which an interactive AI agent system may be implemented, according to one embodiment of the present invention.

FIG. 2 is a functional block diagram that schematically illustrates the functional configuration of the user terminal 102 of FIG. 1, in accordance with one embodiment of the present invention.

FIG. 3 is a functional block diagram that schematically illustrates the functional configuration of the interactive AI agent server 106 of FIG. 1, according to one embodiment of the present invention.

4 is an exemplary operational flow diagram performed by the STT auxiliary module of FIG. 3, in accordance with an embodiment of the present invention.

Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings. Hereinafter, when it is determined that there is a possibility that the gist of the present invention may be unnecessarily blurred, a detailed description of known functions and configurations will be omitted. In addition, it should be understood that the following description is only an embodiment of the present invention, and the present disclosure is not limited thereto.

The terminology used in this disclosure is used only to describe a specific embodiment and is not used to limit the invention. For example, an element expressed in singular < Desc / Clms Page number 5 > terms should be understood as including a plurality of elements unless the context clearly dictates a singular value. It is to be understood that the term " and / or " as used in this disclosure encompasses any and all possible combinations of one or more of the listed items. It should be understood that the terms " comprises " or " having ", etc. used in the present disclosure are intended to specify that there exist features, numbers, steps, operations, elements, It is not intended to exclude the presence or addition of one or more other features, numbers, steps, operations, components, parts, or combinations thereof, by use.

As used herein, the term " module " or " module " means a functional part that performs at least one function or operation, and may be implemented by hardware or software or a combination of hardware and software. Also, a plurality of "modules" or "sub-modules" may be integrated into at least one software module and implemented by at least one processor, except for "module" or "sub-module" have.

In the embodiment of the present invention, the 'interactive AI agent system' is a system in which a user interacts with a user via a natural word input (for example, a natural language) input from a user through interactive interaction via a natural language of voice and / (E.g., commands, statements, requests, questions, etc. from the user) to determine the intent of the user and to perform the necessary actions based on the user's intent, i.e., , And is not limited to any particular form of information processing system. In an embodiment of the present invention, the interactive AI agent system may be for providing a predetermined service, wherein the service may comprise a plurality of sub-task categories (e.g., , Product inquiries, brand inquiries, design inquiries, price inquiries, return inquiries, etc.). In an embodiment of the present invention, the operations performed by the " interactive AI agent system " include, for example, an interactive response and / or task performance, each of which is performed according to the intention of the user in a sequential flow of sub- Lt; / RTI >

In an embodiment of the present invention, the interactive response provided by the " interactive AI agent system " may be in the form of a visual, auditory and / or tactile (e.g., voice, sound, text, video, image, symbol, emoticon, hyperlink, Animation, various notices, motion, haptic feedback, and the like), and the like. In the embodiment of the present invention, the task performed by the 'interactive AI agent system' may include, for example, searching for information, proceeding with purchase of goods, writing a message, writing an email, dialing, playing music, photographing, / Navigation services, and the like, as well as various types of tasks (including, but not limited to, examples).

In the embodiment of the present invention, the 'interactive AI agent system' includes a chatbot system based on a messenger platform, for example, a chatbot system for exchanging messages with a user on a messenger, providing various information desired by the user, But it should be understood that the present invention is not limited thereto.

In addition, all terms used in the present disclosure, including technical or scientific terms, unless otherwise defined, have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs. It should be understood that commonly used predefined terms are to be interpreted as having a meaning consistent with the contextual meaning of the related art and are not to be interpreted excessively or extensively unless explicitly defined otherwise in this disclosure .

Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

1 is a schematic diagram of a system environment 100 in which an interactive AI agent system may be implemented, in accordance with one embodiment of the present invention. The system environment 100 includes a plurality of user terminals 102a-102n, a communication network 104, an interactive AI agent server 106, an external service server 108 and an external STT service server 110, . &Lt; / RTI >

According to one embodiment of the present invention, each of the plurality of user terminals 102a-102n may be any user electronic device having wired or wireless communication capability. Each of the user terminals 102a-102n may be a variety of wired or wireless communication terminals, including, for example, a smart speaker, a music player, a game console, a digital TV, a set top box, a smart phone, a tablet PC, a desktop, a laptop, It is to be understood that the invention is not limited to any particular form. In accordance with one embodiment of the present invention, each of the user terminals 102a-102n can communicate with the interactive AI agent server 106, i. E., Via the communication network 104, with the necessary information. According to one embodiment of the present invention, each of the user terminals 102a-102n can communicate with the external service server 108 through the communication network 104, that is, send and receive necessary information. In accordance with one embodiment of the present invention, each of the user terminals 102a-102n may receive user input in the form of voice and / or text from the outside, and may interact with the interactive AI agent server 106 via the communication network 104, (E.g., providing a specific conversation response and / or providing a specific task response) obtained through communication with the external service server 108 and / or communication with the external service server 108 (and / or processing within the user terminals 102a-102n) And the like) to the user.

According to one embodiment of the present invention, an interactive response as a result of an operation corresponding to a user input provided by the user terminals 102a-102n may be, for example, a sequence of sub- May be provided in accordance with the conversation flow pattern of the sub-task classification corresponding to the user input at that time in the flow. In accordance with one embodiment of the present invention, each of the user terminals 102a-102n may provide a dialog response as a result of an operation corresponding to a user input, in a visual, audible and / or tactile form (e.g., Images, symbols, emoticons, hyperlinks, animations, various notices, motion, haptic feedback, and the like), and the like. In the embodiment of the present invention, task execution as an operation corresponding to a user input is performed by, for example, searching for information, proceeding to purchase goods, composing a message, creating an email, dialing, music playback, photographing, Services, and the like, as well as performing various types of tasks.

According to one embodiment of the present invention, the communication network 104 may include any wired or wireless communication network, e.g., a TCP / IP communication network. According to an embodiment of the present invention, the communication network 104 may include, for example, a Wi-Fi network, a LAN network, a WAN network, an Internet network, and the like, and the present invention is not limited thereto. In accordance with one embodiment of the present invention, the communication network 104 may be any of a variety of wired or wireless, such as Ethernet, GSM, EDGE, CDMA, TDMA, OFDM, Bluetooth, VoIP, Wi- May be implemented using a communication protocol.

According to one embodiment of the present invention, the interactive AI agent server 106 may communicate with the user terminals 102a-102n via the communication network 104. [ In accordance with one embodiment of the present invention, the interactive AI agent server 106 sends and receives necessary information to and from the user terminals 102a-102n via the communication network 104, The operation result corresponding to the user input, i.e., the user's intention, can be provided to the user.

In accordance with one embodiment of the present invention, the interactive AI agent server 106 receives voice-like user natural language input from the user terminal 102a-102n, for example via the communication network 104, Can be converted into user natural language input of a character form. According to an embodiment of the present invention, the interactive AI agent server 106 transmits user voice input received from the user terminals 102a - 102n to the external STT server 110, At least one text data corresponding to a user input in the form of a voice can be received. According to one embodiment of the present invention, the interactive AI agent server 106 receives at least one text data from the external STT server 110 and, based on the STT conversion assist database described below, Perform evaluation on each of them, and output at least one text data and an evaluation result.

In accordance with one embodiment of the present invention, the interactive AI agent server 106 receives user natural language input in the form of speech and / or text from the user terminal 102a-102n, for example via the communication network 104, The received natural language input can be processed based on the models to determine the intent of the user.

According to one embodiment of the present invention, the interactive AI agent server 106 may communicate with the external service server 108 via the communication network 104, as described above. The external service server 108 may be, for example, a messaging service server, an online consultation center server, an online shopping mall server, an information search server, a map service server, a navigation service server, and the like. In accordance with one embodiment of the present invention, an interactive response based on the user's intent, which is transmitted from the interactive AI agent server 106 to the user terminals 102a-102n, It should be noted that this may include content.

Although the interactive AI agent server 106 is shown as being a separate physical server configured to communicate with the external service server 108 via the communication network 104, the present disclosure is not limited thereto. According to another embodiment of the present invention, the interactive AI agent server 106 may be included as part of various service servers such as an online consultation center server or an online shopping mall server.

According to one embodiment of the present invention, the interactive AI agent server 106 collects interactive logs (e.g., may include a plurality of users and / or system utterance records) over various paths, Automatically analyze the conversation logs, and create and / or update a conversation flow management model based on the analysis results. According to one embodiment of the present invention, the interactive AI agent server 106 classifies each utterance record into one of the predetermined task categories, for example, through keyword analysis on the collected conversation logs, Can be analyzed stochastically.

According to an embodiment of the present invention, the external STT server 110 receives a voice input of a user through a communication module and converts the received voice input into at least one It can be converted into text data in a character form and transmitted. According to one embodiment of the present invention, the external STT server 110 may receive the user's speech input and related syntax hints and convert the user's speech input into text data in at least one character form based thereon.

2 is a functional block diagram that schematically illustrates the functional configuration of the user terminal 102 shown in FIG. 1, in accordance with one embodiment of the present invention. The user terminal 102 includes a user input receiving module 202, a sensor module 204, a program memory module 206, a processing module 208, a communication module 210, 212).

According to an embodiment of the present invention, the user input receiving module 202 may receive various types of input from a user, for example, a natural language input such as a voice input and / or a text input (and additionally, Can be received. According to one embodiment of the present invention, the user input receiving module 202 includes, for example, a microphone and an audio circuit, and can acquire a user audio input signal through a microphone and convert the obtained signal into audio data. According to an embodiment of the present invention, the user input receiving module 202 may include various types of input devices such as various pointing devices such as a mouse, a joystick, and a trackball, a keyboard, a touch panel, a touch screen, , And can acquire a text input and / or a touch input signal inputted from a user through these input devices. According to one embodiment of the present invention, the user input received at the user input receiving module 202 may be associated with performing a predetermined task, such as performing a predetermined application or searching for certain information, etc. However, It is not. According to another embodiment of the present invention, the user input received at the user input receiving module 202 may require only a simple conversation response, regardless of the execution of a predetermined application or retrieval of information. According to another embodiment of the present invention, the user input received at the user input receiving module 202 may relate to a simple statement for unilateral communication.

In accordance with one embodiment of the present invention, the sensor module 204 includes one or more different types of sensors through which the status information of the user terminal 102, such as the physical state of the user terminal 102, Software and / or hardware status, or information regarding the environmental conditions of the user terminal 102, and the like. According to an embodiment of the present invention, the sensor module 204 may include an optical sensor, for example, and may sense the ambient light condition of the user terminal 102 through the optical sensor. According to an embodiment of the present invention, the sensor module 204 includes, for example, a movement sensor and can detect whether the corresponding user terminal 102 is moving through the movement sensor. According to one embodiment of the present invention, the sensor module 204 includes, for example, a velocity sensor and a GPS sensor, and through these sensors, the position and / or orientation of the corresponding user terminal 102 can be detected. In accordance with another embodiment of the present invention, it should be appreciated that the sensor module 204 may include other various types of sensors, including temperature sensors, image sensors, pressure sensors, touch sensors, and the like.

According to one embodiment of the present invention, the program memory module 206 may be any storage medium that stores various programs that may be executed on the user terminal 102, such as various application programs and related data. In accordance with one embodiment of the present invention, program memory module 206 may include, for example, a telephone dialer application, an email application, an instant messaging application, a camera application, a music playback application, a video playback application, an image management application, , And data related to the execution of these programs. According to one embodiment of the present invention, the program memory module 206 may be configured to include various types of volatile or non-volatile memory such as DRAM, SRAM, DDR RAM, ROM, magnetic disk, optical disk, .

According to one embodiment of the present invention, the processing module 208 may communicate with each component module of the user terminal 102 and perform various operations on the user terminal 102. According to one embodiment of the present invention, the processing module 208 can drive and execute various application programs on the program memory module 206. [ According to one embodiment of the present invention, the processing module 208 may receive signals from the user input receiving module 202 and the sensor module 204, if necessary, and perform appropriate processing on these signals have. According to one embodiment of the present invention, the processing module 208 may, if necessary, perform appropriate processing on signals received from the outside via the communication module 210.

According to one embodiment of the invention, the communication module 210 is configured to allow the user terminal 102 to communicate with the interactive AI agent server 106 and / or the external service server 108 via the communication network 104 of FIG. 1 Communication. According to one embodiment of the present invention, the communication module 212 may be configured to receive signals from, for example, the user input receiving module 202 and the sensor module 204 via the communication network 104 in accordance with a predetermined protocol, To server 106 and / or to external service server 108. [ In accordance with one embodiment of the present invention, the communication module 210 may provide various signals received from the interactive AI agent server 106 and / or the external service server 108 via the communication network 104, e.g., voice and / Or a response signal including a natural language response in the form of a text, or various control signals, and perform appropriate processing according to a predetermined protocol.

According to an embodiment of the present invention, the response output module 212 may output a response corresponding to a user input in various forms such as time, auditory, and / or tactile sense. According to one embodiment of the present invention, the response output module 212 includes various display devices such as a touch screen based on technology such as LCD, LED, OLED, QLED, and the like, Such as text, symbols, video, images, hyperlinks, animations, various notices, etc., to the user. According to one embodiment of the present invention, the response output module 212 may include, for example, a speaker or a headset and may provide an audible response, e.g., voice and / or acoustic response corresponding to user input, can do. According to one embodiment of the present invention, the response output module 212 includes a motion / haptic feedback generator, through which a tactile response, e.g., motion / haptic feedback, can be provided to the user. In accordance with one embodiment of the present invention, it should be appreciated that the response output module 212 may simultaneously provide any combination of two or more of a text response, a voice response, and a motion / haptic feedback corresponding to a user input.

FIG. 3 is a functional block diagram that schematically illustrates the functional configuration of the interactive AI agent server 106 of FIG. 1, according to one embodiment of the present invention. The interactive AI agent server 106 includes a communication module 310, a Speech-To-Text (STT) auxiliary module 320, a Natural Language Understanding (NLU) A text-to-speech (TTS) module 340, a storage module 350, and a conversation flow management model building / updating module 360.

According to one embodiment of the present invention, the communication module 310 is configured to allow the interactive AI agent server 106 to communicate with the user terminal 102 and / or via the communication network 104, in accordance with any wired or wireless communication protocol, To communicate with the external service server 108 and / or the external STT server 110. According to an embodiment of the present invention, the communication module 310 can receive voice input and / or text input from the user, transmitted from the user terminal 102 via the communication network 104. According to one embodiment of the present invention, the communication module 310 may communicate with the user terminal 102 via the communication network 104 with or without voice input and / or text input from the user, The status information of the user terminal 102 transmitted from the terminal 102 can be received. According to one embodiment of the present invention, the status information may include various status information (e.g., the physical state of the user terminal 102) associated with the user terminal 102 at the time of speech input from the user and / Software and / or hardware status of the user terminal 102, environmental status information around the user terminal 102, etc.). In accordance with one embodiment of the present invention, communication module 310 may also include an interactive response (e. G., A native < / RTI > And / or control signals to the user terminal 102 via the communication network 104. The user terminal 102 may be connected to the user terminal 102 via the network 104,

According to an embodiment of the present invention, the STT auxiliary module 320 can receive the voice input from the user input received through the communication module 310 and transmit the received voice input to the external STT server 110. According to one embodiment of the present invention, the STT auxiliary module 320 can transmit the voice input received through the communication module 310 and the information related to the voice input to the external STT server 110 together. According to one embodiment of the present invention, the STT auxiliary module 320 receives at least one text data converted from the voice input transmitted through the communication module 310, and transmits the STT conversion assist database 350 The translation accuracy for each of the at least one text data can be evaluated on a basis.

According to one embodiment of the present invention, the NLU module 330 may receive text input from the communication module 310 or the STT auxiliary module 320. The textual input received at the NLU module 330 may be transmitted to the user terminal 102 via the user text input or communication module 310 received from the user terminal 102 via the communication network 104, (E.g., a sequence of words) received from the external STT server via the STT auxiliary module 320. [ In accordance with one embodiment of the present invention, the NLU module 330 may include status information associated with a corresponding user input, such as upon receipt of a textual input or thereafter, such as the status information of the user terminal 102 at the time of the user input And the like. As described above, the status information may include various status information (e.g., the physical state of the user terminal 102, the software status of the user terminal 102) associated with the user terminal 102 at the time of user input and / And / or hardware state, environmental state information around the user terminal 102, etc.).

According to one embodiment of the present invention, the NLU module 330 may map the received text input to one or more user intents. Where the user intent can be associated with a series of operations (s) that can be understood and performed by the interactive AI agent server 106 according to the user intention. According to one embodiment of the present invention, the NLU module 330 may refer to the status information described above in associating the received text input with one or more user intentions.

According to one embodiment of the present invention, the TTS module 340 may receive an interactive response that is generated to be transmitted to the user terminal 102. [ The interactive response received at the TTS module 340 may be a natural word or a sequence of words having a textual form. According to one embodiment of the present invention, the TTS module 340 may convert the input of the above received text form into speech form according to various types of algorithms.

According to one embodiment of the present invention, the storage module 350 may include various databases. According to one embodiment of the present invention, the storage module 350 may include a user database 352, a conversation understanding knowledge base 354, an interaction log database 356, and a conversation flow management model 368.

According to an embodiment of the present invention, the user database 352 may be a database for storing and managing characteristic data for each user. According to an exemplary embodiment of the present invention, the user database 352 may include, for example, previous conversation history of the user, pronunciation feature information of the user, user lexical preference, location of the user, And may include various user-specific information.

According to one embodiment of the present invention, the conversation understanding knowledge base 354 may include, for example, a predefined ontology model. According to one embodiment of the present invention, an ontology model can be represented, for example, in a hierarchical structure between nodes, where each node is associated with an " intention " node corresponding to the user & Node (a sub-attribute node directly linked to an " intent " node or linked back to an " attribute " node of an " intent " node). According to one embodiment of the present invention, " attribute " nodes directly or indirectly linked to an " intention " node and its " intent " node may constitute one domain, and an ontology may be composed of such a set of domains . In accordance with one embodiment of the present invention, the conversation understanding knowledge base 354 may be configured to include domains that each correspond to all intents, for example, an interactive AI agent system that can understand and perform corresponding actions have. According to one embodiment of the present invention, it should be noted that the ontology model can be dynamically changed by addition or deletion of nodes or modification of relations between nodes.

According to one embodiment of the present invention, the intention nodes and attribute nodes of each domain in the ontology model may be associated with words and / or phrases associated with corresponding user intents or attributes, respectively. In accordance with one embodiment of the present invention, the conversation understanding knowledge base 354 includes an ontology model 354 that includes an ontology model including a hierarchy of nodes and a set of words and / or phrases associated with each node, , And the STT auxiliary module 320 can determine the user's intention based on the ontology model implemented in the lexical dictionary form. For example, according to one embodiment of the present invention, the STT assistance module 320, upon receipt of a text input or sequence of words, can determine which of the domains in the ontology model the respective words in the sequence are associated with , And can determine the corresponding domain, i. E. User intention, based on such a determination.

According to one embodiment of the present invention, the conversation log database 356 may be a database that classifies, stores, and manages conversation logs collected in any of various ways according to a predetermined criterion. According to an embodiment of the present invention, the conversation log database 356 may be stored in association with, for example, the number of times that the user of the service domain frequently uses words, phrases, sentences, and various other types of user input.

In accordance with one embodiment of the present invention, the dialogue flow management model 358 may include a probabilistic distribution model for a sequential flow between a plurality of sub-task classes needed for providing a service in relation to a given service domain . According to one embodiment of the present invention, the dialogue flow management model 358 may include, for example, a sequential flow between each sub-task category belonging to the service domain in the form of a probability graph. According to one embodiment of the present invention, the dialogue flow management model 358 may include, for example, a probabilistic distribution of each task classification obtained on various sequential flows that may occur between each of the sub-task classes. According to one embodiment of the present invention, although not specifically shown, the dialogue flow management model 358 may also include a library of dialog patterns belonging to each task category.

Although the storage module 350 including various databases is shown as being disposed in the interactive AI agent server 106 in the figure, the present invention is not limited thereto. According to another embodiment of the present invention, each database contained in the storage module 350 may reside, for example, at the user terminal 102 and distributed to the user terminal 102 and the interactive AI agent server 106 And the like.

In accordance with one embodiment of the present invention, the conversation flow management model building / updating module 360 automatically analyzes each conversation log stored in the conversation log database 356 collected by any of a variety of methods, And build and / or update the conversation flow management model. In accordance with one embodiment of the present invention, the dialogue flow management model build / update unit 360 generates a dialogue flow management model, for example, through keyword analysis on conversation logs stored in the conversation log database 356, One of the categories, and group the utterance records of the same sub-task category. According to one embodiment of the present invention, the dialogue flow management model construction / update unit 360 can grasp, for example, a sequential flow between each group, i.e., each lower task category, as a probabilistic distribution. According to one embodiment of the present invention, the dialogue flow management model construction / update unit 360 can construct a sequential flow between the sub-task categories on the service domain, for example, in the form of a probability graph. In accordance with one embodiment of the present invention, the dialogue flow management model building / updating unit 360 may be configured to determine, for example, all sequential flows that may occur between each of the sub-task classes, It is possible to determine the probability of occurrence of the flow between each job classification, and thereby obtain a stochastic distribution of each sequential flow between the above-mentioned lower job classes.

According to an embodiment of the present invention, the conversation flow management model construction / update unit 360 performs keyword analysis on the conversation logs collected in any of various ways, and stores each speech history on the conversation log in a predetermined operation It can be classified and tagged as one of the categories. According to one embodiment of the present invention, the predetermined task classifications may be, for example, each of the sub classifications belonging to one service domain. According to one embodiment of the present invention, the conversation flow management model building / updating unit 360 constructs the conversation flow management model building / updating unit 360 based on, for example, a sub-task classification of a product inquiry, a brand inquiry, a design inquiry, Quot; can be classified and tagged with any one of them. According to an embodiment of the present invention, the dialogue flow management model construction / update unit 360 may previously select relevant keywords for each of the lower task categories, and, based on the selected keywords, Can be classified into classification.

According to an embodiment of the present invention, the conversation flow management model construction / update unit 360 can group speech data classified and tagged into any one of a plurality of job data categories among speech data of the same classification. According to an embodiment of the present invention, the speech history groups grouped into the same category may be included in the dialogue flow management model as the conversation patterns of the category.

According to one embodiment of the present invention, the dialogue flow management model construction / update unit 360 can analyze the probabilistic distribution of the time series sequential between the respective lower task categories from the dialogue logs. According to an embodiment of the present invention, for example, in the case of a service domain of a goods purchase, when it is assumed that a sub-task classification belongs to a product inquiry, a brand inquiry, a design inquiry, a price inquiry, and a return inquiry, As a working classification, there is a probability of 70% of product inquiry, 20% of brand inquiry, 5% of design inquiry, 3% of price inquiry, and 2% of return inquiry. %, The price inquiry is 13%, the return inquiry is 1% probability, and each of the sub work categories can be layered as the probability distribution of this sequential flow. According to an embodiment of the present invention, the dialogue flow management model construction / update unit 360 may construct a sequential flow between lower task classes on a service domain, for example, in a stochastic graph form. According to an embodiment of the present invention, the dialogue flow management model construction / update unit 360 can recursively grasp the probabilistic relation of the sequential flow between the respective lower task classes, for example, Sequential flow can be configured.

According to one embodiment of the present invention, the dialogue flow management model construction / update unit 360 can delete a flow having a probability less than the threshold from the analysis result of the probabilistic distribution of the time series sequence between the lower task classes. For example, if the probability is 2%, if the probability that the return inquiry appears after the inquiry of the commodity is 1% in the service domain of the commodity purchase, the flow of the return inquiry after the commodity inquiry is deleted from the conversation flow management model .

In the embodiment of the present invention described above with reference to FIGS. 1 to 3, the interactive AI agent system is a client-server model between the user terminal 102 and the interactive AI agent server 106, And is based on a so-called " thin client-server model ", which delegates all other functions of the interactive AI agent system to the server, but the present invention is not limited thereto. According to another embodiment of the present invention, it should be understood that the interactive AI agent system may be implemented as a distributed application between the user terminal and the server, or as a stand-alone application installed on the user terminal . In addition, when the interactive AI agent system implements the functions of the interactive AI agent system distributed between the user terminal and the server according to an embodiment of the present invention, the distribution of each function of the interactive AI agent system between the client and the server is It should be understood that the invention may be otherwise embodied. In addition, although in the embodiments of the present invention described above with reference to FIGS. 1 to 3, the specific module has been described as performing certain operations for convenience, the present invention is not limited thereto. According to another embodiment of the present invention, it is to be understood that the operations described as being performed by any particular module in the above description may be performed by separate and distinct modules, respectively.

In step 402, the STT assistance module 320 may receive a user's speech input including a natural language input composed of one or more words. According to one embodiment of the present invention, the natural language input may be a voice input, e.g., received via the microphone of the user terminal 102a-102n and transmitted via the communication module 310. [

In step 404, the STT assistance module 320 transmits the voice input of the user received in step 402 to the external STT server 110. In one embodiment of the present invention, the voice input may be in a voice file (e.g., wave file) or streaming format. According to an embodiment of the present invention, the STT auxiliary module 320 may transmit information (e.g., a file format, an encoding format, and the like) and a syntax hint together with a voice input of a user. Here, the phrase hint may be a specific word or phrase as information that aids the given audio processing.

In step 406, the STT assistance module 320 may receive at least one textual data associated with the voice file transmitted from the external STT server 110. [ According to an embodiment of the present invention, the at least one text data may include a score (probability) given by an external STT server.

In step 408, the STT assistance module 320 may evaluate the conversion accuracy for each of the at least one textual data. In one embodiment of the present invention, the conversion accuracy may be a probability for each of the at least one text data or a relative rank for each of the at least one text data.

In one embodiment of the present invention, the STT auxiliary module 320 may evaluate the conversion accuracy of each of the at least one text data, according to a predetermined criterion. In one embodiment of the present invention, the STT auxiliary module 320 may evaluate the conversion accuracy for each of the at least one text data in consideration of the scores given by the external STT server for each of the at least one text data.

In one embodiment of the present invention, the STT auxiliary module 320 may evaluate the conversion accuracy of each of the at least one text data based on the STT conversion auxiliary database. In one embodiment of the present invention, the STT change assistance database includes a user database 352 for storing and managing user-specific feature data, an conversation log database 356 in which existing conversation logs of users are analyzed and stored, A dialogue understanding knowledge base 352 in which attributes associated with an intent to be included are stored, a dialogue flow 352 that is a probabilistic distribution model for a sequential flow between a plurality of lower task classes necessary for providing a service in association with the service domain, And a management model 358.

In one embodiment of the present invention, the STT auxiliary module 320 may evaluate the conversion accuracy based on the number of occurrences of words contained in each of the at least one text conversion results. According to an embodiment of the present invention, the number of occurrences of words can be calculated based on the conversation log database in which the number of occurrences of words per domain is stored. For example, when the corresponding domain is "finance" and the received text data is "one time" and "Japan", the number of occurrences for "one time" stored in the domain user database is 7200 and the number of occurrences for "Japan" 10 times, the probability of conversion accuracy can be determined to be higher than that of " one time "

In one embodiment of the present invention, the STT auxiliary module 320 may evaluate the conversion accuracy based on the similarity between the sentences included in each of the at least one text conversion results and the sentences stored in the STT translation assistant database. In one embodiment of the present invention, a method of calculating the similarity between sentences includes a statistical method of constructing a vector with each word frequency included in a sentence and obtaining a cosine similarity between vectors, or a semantic similarity based on WordNet distance Various semantic methods can be used.

According to one embodiment of the present invention, the STT auxiliary module 320 receives at least one converted text data from the external STT server 110 via the communication module 310, and based on a predetermined knowledge model prepared in advance To determine the intent of the user corresponding to the user natural language input and to evaluate the conversion accuracy based on the determined intent. In accordance with one embodiment of the present invention, the STT assistance module 320, when determining the user's intent, may send the received text input to one or more It can correspond to a user intent.

According to one embodiment of the present invention, the STT auxiliary module 320 receives at least one converted text data from the external STT server 110 via the communication module 310, The conversion accuracy can be evaluated based on the hierarchical position. In one embodiment of the present invention, the STT auxiliary module 320 receives the hierarchical location information of the corresponding speech input from the conversation flow management model building / updating module 360, which configures a sequential flow on the service domain in the form of a probability graph .

Finally, in step 410, the STT auxiliary module 320 outputs at least one text conversion result. According to one embodiment of the present invention, the STT auxiliary module 320 may output at least one text conversion result and an evaluation result together.

As will be appreciated by those skilled in the art, the present invention is not limited to the examples described herein, but can be variously modified, rearranged, and replaced without departing from the scope of the present invention. It should be understood that the various techniques described herein may be implemented in hardware or software, or a combination of hardware and software.

A computer program according to an embodiment of the present invention may be stored in a storage medium readable by a computer processor or the like such as a nonvolatile memory such as EPROM, EEPROM, flash memory device, a magnetic disk such as an internal hard disk and a removable disk, CDROM disks, and the like. Also, the program code (s) may be implemented in assembly language or machine language. And all changes and modifications that fall within the true spirit and scope of the present invention are intended to be embraced by the following claims.

Claims

A method performed by a computer device, the method being to assist speech text conversion for an interactive AI agent system,

Constructing an STT conversion assistance database related to a predetermined service domain;

Receiving at least one text conversion result from an external Speech-to-text server;

Evaluating each of the at least one text conversion result based on the STT conversion auxiliary database; And

Outputting the at least one text conversion result and the evaluation result

&Lt; / RTI >
The method according to claim 1,

The STT conversion assistant database includes a user database for storing and managing user-specific feature data, an interactive log database in which existing conversation logs of users are analyzed and stored, a conversation in which attributes related to an intent included in the service domain are stored And a dialogue flow management model in which a probabilistic distribution model related to a sequential flow between a plurality of lower task classes necessary for providing a service in association with the service domain is stored.
3. The method of claim 2,

The evaluating step

Evaluating the number of occurrences of words included in each of the at least one text conversion result stored in the STT conversion assist database.
3. The method of claim 2,

The evaluating step

And evaluating based on the similarity between the sentences stored in the STT translation assistance database and the sentences included in the at least one text conversion result.
3. The method of claim 2,

The evaluating step

And determining an intent of the user based on a predetermined knowledge model prepared in advance
3. The method of claim 2,

The evaluating step

And determining a hierarchical location of the user based on a hierarchical conversational flow management model.
A computer-readable medium having stored thereon one or more instructions,

Wherein the one or more instructions cause the computer to perform the method of any one of claims 1 to 6 when executed on a computer.
A computer apparatus configured to provide context-based speech-to-text,

Storing and managing characteristic data for each user, analyzing and storing existing conversation logs of users, storing attributes associated with an intent included in the service domain, and storing a plurality of A storage module configured to store and manage a probabilistic distribution model relating to a sequential flow between subordinate classes of work;

A dialogue flow management model building / updating module configured to automatically analyze the conversation logs and build and / or update a conversation flow management model according to the analysis result; And

STT auxiliary module,

The STT auxiliary module

Receiving at least one text conversion result from an external STT server,

Evaluating each of the at least one text conversion result based on the data stored in the storage module,

And to output the at least one text conversion result and the evaluation result.
9. The method of claim 8,

Wherein the computer device comprises a user terminal or a server communicatively coupled to the user terminal.