TWI734085B - Dialogue system using intention detection ensemble learning and method thereof - Google Patents
Dialogue system using intention detection ensemble learning and method thereof Download PDFInfo
- Publication number
- TWI734085B TWI734085B TW108108454A TW108108454A TWI734085B TW I734085 B TWI734085 B TW I734085B TW 108108454 A TW108108454 A TW 108108454A TW 108108454 A TW108108454 A TW 108108454A TW I734085 B TWI734085 B TW I734085B
- Authority
- TW
- Taiwan
- Prior art keywords
- data
- dialogue
- intent
- module
- supervised
- Prior art date
Links
Images
Landscapes
- Machine Translation (AREA)
Abstract
Description
本發明係關於對話系統中意圖辨識之技術,詳而言之,係關於一種使用半監督式意圖偵測集成學習的對話系統及其方法。 The present invention relates to the technology of intention recognition in a dialogue system, and in detail, it relates to a dialogue system and method using semi-supervised intent detection integrated learning.
近年來對話即平台(conversation as a platform)的概念在各大科技巨頭間興起,人機互動方式由圖形化介面轉向對話交互介面的方式,人類的各項任務,可經由對話描述來實現互動服務,因而自然語言成為人機互動介面的關鍵因子。使用對話機器人將大部分簡單的問題自動回覆,讓客服人員專注在較困難的問題上,可大幅降低文字客服的人力成本支出且對話機器人可以即時(Real-Time)回覆用戶,可以同時服務更多的用戶且每天24小時服務全年無休,因而為了避免機器人答非所問進而降低使用者使用意願,提升機器人對於用戶問句的語意意圖識別率是首要之事。 In recent years, the concept of conversation as a platform (conversation as a platform) has emerged among the major technology giants. The way of human-computer interaction has changed from a graphical interface to a conversational interactive interface. Various human tasks can be described by dialogue to achieve interactive services. Therefore, natural language has become a key factor in the human-computer interaction interface. Use dialog robots to automatically reply to most simple questions, allowing customer service staff to focus on more difficult questions, which can greatly reduce the labor cost of text customer service and the dialog robot can reply users in real-time (Real-Time), and can serve more at the same time Of users and 24 hours a day service throughout the year, so in order to prevent robots from answering unsolicited questions and reduce users’ willingness to use, improving the robot’s semantic intent recognition rate for user questions is the most important thing.
由過往研究可知,任務型對話系統需要分析問題意圖並給予知識庫定義之對應答案,但對於使用者的角度來說,當然是支援越多功能(意圖)越好,越少對話輪次即可找到答案越好。習知技術大多以建立規則方法或機器學習分類器等方式建構意圖識別器。隨著意圖種類越多,意圖辨識任務越困難。其中 很大的原因是因為標記資料不足與模型輸入特徵不足所造成。因此,如何大量運用未標記資料與增加輸入特徵的多樣性以提供精確的意圖辨識已成為任務型對話系統急需解決的問題。另一方面,面對日新月異的任務需求,快速精確的意圖更新也是對話機器人下一個階段不可或缺的功能。 Past research shows that task-based dialogue systems need to analyze the intent of the question and give the corresponding answer defined by the knowledge base, but from the perspective of the user, of course, the more multi-functional (intent) support, the better, and the fewer dialogue rounds. The better you find the answer. Conventional technologies mostly construct intent recognizers by methods such as establishing rules or machine learning classifiers. With more types of intentions, the task of intention recognition becomes more difficult. in The big reason is due to insufficient labeled data and insufficient model input features. Therefore, how to make extensive use of unlabeled data and increase the diversity of input features to provide accurate intent recognition has become an urgent problem for task-based dialogue systems. On the other hand, in the face of ever-changing task requirements, fast and accurate intention updating is also an indispensable function for the next stage of dialogue robots.
由上可知,若能找出一種提升對話系統中意圖辨識率之技術,特別是如何透過現有資料和既有資料來提升意圖推測,且還要能依據回饋機制進一步更新判斷準則,進而提高往後辨識結果,此將成為本技術領域人員急欲追求解決方案之目標。 It can be seen from the above that if we can find a technology to improve the intent recognition rate in the dialogue system, especially how to use the existing data and existing data to improve the intent prediction, and also to further update the judgment criteria based on the feedback mechanism, and then improve the future As a result of the identification, this will become the goal of those skilled in the art who are eager to pursue a solution.
本發明之目的係建立一個精確的意圖辨識與快速的意圖更新機制,透過本專利提出的半監督式對話主題意圖集成識別方法可以提升對話系統效能以降低使用者查找答案所需要的對話輪次。 The purpose of the present invention is to establish an accurate intent recognition and rapid intent update mechanism. The semi-supervised dialogue subject intent integrated recognition method proposed in this patent can improve the performance of the dialogue system and reduce the number of dialogue rounds required for users to find answers.
為達到上述目的與其他目的,本發明係提出一種使用意圖偵測集成學習之對話系統,包括:接收文字輸入模組,係用於接收文字內容;文字前處理模組,係接收該文字內容以定義為未標記資料,將該未標記資料以及既有的已標記資料進行文字前處理,以將詞彙轉為向量表示法;半監督式對話主題模組,係具有用於產出對話主題分布之半監督式對話主題模型,其中,該半監督式對話主題模型係依據來自該文字前處理模組之該未標記資料詞彙與向量以及該已標記資料詞彙與向量所組成之訓練資料而建模;意圖辨識集成學習模組,係具有樣本與特徵選擇器、意圖辨識器及意圖辨識集成器,該樣本與特徵選擇器針對該訓練資料進行樣本與特徵選 取以組成多組訓練資料子集,各該訓練資料子集分別對應一個該意圖辨識器以進行訓練,進而透過該意圖辨識集成器整合多個該意圖辨識器之意圖決策結果,以產出最終意圖分類的決策;知識庫搜尋模組,係透過該最終意圖分類的決策進行查詢,以由資料庫取得系統回覆內容;系統回覆模組,係傳送該系統回覆內容,以接收有關該系統回覆內容是否正確之回覆資料;以及系統標注模組,係於該回覆資料為該系統回覆內容不正確時,將該文字內容標注成正確的意圖類別並且導入至該文字前處理模組以產生新的詞彙向量,據此更新該半監督式對話主題模型以優化所輸出的主題機率分布,進而重新調整該意圖辨識集成學習模組所預測之意圖類別。 In order to achieve the above and other objectives, the present invention proposes a dialogue system using intent detection integrated learning, which includes: a text input module for receiving text content; a text preprocessing module for receiving the text content Defined as unlabeled data, the unlabeled data and the existing labeled data are pre-processed to convert the vocabulary into vector representation; the semi-supervised dialogue topic module is used to produce the dialogue topic distribution A semi-supervised dialogue topic model, wherein the semi-supervised dialogue topic model is modeled based on the training data composed of the unlabeled data vocabulary and vector from the text pre-processing module and the labeled data vocabulary and vector; The intent recognition integrated learning module has a sample and feature selector, an intent recognizer, and an intent recognition integrator. The sample and feature selector perform sample and feature selection for the training data It is taken to form multiple sets of training data subsets, each of the training data subsets corresponds to an intent recognizer for training, and then the intent recognition integrator is used to integrate the intent decision results of multiple intent recognizers to produce the final Intent classification decision; knowledge base search module is to query through the final intent classification decision to obtain system response content from the database; system response module is to send the system response content to receive related system response content Whether the reply data is correct; and the system marking module. When the reply data is incorrect, the text content is marked as the correct intention category and imported into the text pre-processing module to generate a new vocabulary According to the vector, the semi-supervised dialogue topic model is updated to optimize the output topic probability distribution, and then the intention category predicted by the intention recognition integrated learning module is re-adjusted.
於一實施例中,該文字前處理模組復包括:文句正規化單元,係用於濾除特定符號或語文,以及執行編碼轉換;文句斷詞單元,係用於將該文字內容以詞為單位進行分隔;以及詞彙向量化單元,係用於將斷詞後之詞彙轉為向量表示法。 In one embodiment, the text pre-processing module further includes: a sentence normalization unit, which is used to filter out specific symbols or languages, and to perform encoding conversion; and the sentence segmentation unit is used to use words as the text The unit is separated; and the vocabulary vectorization unit is used to convert the vocabulary after the word segmentation into a vector representation.
於另一實施例中,該意圖辨識集成學習模組係採用引導聚集算法(Bagging)或逐步提升算法(Boosting)或其組合進行集成學習,藉以降低模型變異與減少模型誤差。 In another embodiment, the intention recognition integrated learning module uses a guided aggregation algorithm (Bagging) or a step-by-step boosting algorithm (Boosting) or a combination thereof for integrated learning, thereby reducing model variation and reducing model errors.
於又一實施例中,該意圖辨識集成學習模組復包括對該意圖辨識集成器所提供之最終意圖分類與該文字內容的真實意圖類別進行衡量,以將錯誤分類的訓練樣本回饋至該樣本與特徵選擇器以提高該錯誤分類的訓練樣本被挑選之權重,並透過反覆疊代訓練直到正確率達到門檻值為止。 In another embodiment, the intent recognition integrated learning module further includes measuring the final intent classification provided by the intent recognition integrator and the true intent category of the text content, so as to feed back misclassified training samples to the sample And a feature selector to increase the weight of the misclassified training sample being selected, and iteratively train until the correct rate reaches the threshold.
另外,該系統回覆內容包括回覆用戶問題、持續性對話或用戶確認。 In addition, the system's replies include replying to user questions, ongoing conversations, or user confirmations.
於再一實施例中,該系統標注模組係利用該半監督式對話主題模組找出與錯誤資料相同主題意涵的相似文句,透過輔助標注或採用最近鄰居演算法方式以於對話主題空間中找出k個最相近的該已標記資料,並排除該錯誤資料原先被該意圖辨識集成學習模組預測的意圖類別,以將該錯誤資料標為投票最高的意圖類別。 In yet another embodiment, the system tagging module uses the semi-supervised dialog topic module to find similar sentences with the same topic meaning as the wrong data, and uses auxiliary tagging or the nearest neighbor algorithm in the dialog topic space. Find the k most similar marked data in the database, and exclude the wrong data originally predicted by the intention recognition integrated learning module intent category, so as to mark the wrong data as the intention category with the highest vote.
另外,該系統標注模組將該文字內容標注成正確的意圖類別並且導入至該文字前處理模組,以成為既有的該已標記資料。 In addition, the system marking module marks the text content into the correct intention category and imports it into the text pre-processing module to become the existing marked data.
本發明復提出一種使用半監督式意圖偵測集成學習的對話方法,包括:接收文字內容;定義該文字內容為未標記資料,將該未標記資料以及既有的已標記資料進行文字前處理,以將詞彙轉為向量表示法;將該未標記資料及該已標記資料導入半監督式對話主題模型以輸出對話主題分布;結合該對話主題分布與對話文字內容作為輸入,整合多個意圖決策結果以及強化錯誤樣本學習的方法,以產出最終意圖分類的決策;依據該最終意圖分類以決定系統回覆內容;回覆該系統回覆內容至用戶,以接收該用戶之回饋資料以判斷該回覆資料之正確性;以及對該文字內容及回饋資料進行對應標注及更新,進而將標注後資料匯入該半監督式對話主題模型以進行更新與學習。 The present invention further proposes a dialogue method using semi-supervised intention detection integrated learning, including: receiving text content; defining the text content as unmarked data, and pre-processing the unmarked data and the existing marked data, To convert the vocabulary into a vector representation; import the unlabeled data and the labeled data into a semi-supervised dialogue topic model to output the dialogue topic distribution; combine the dialogue topic distribution and dialogue text content as input to integrate multiple intent decision results And the method of strengthening the learning of the wrong sample to produce the final intention classification decision; according to the final intention classification to determine the system response content; reply the system response content to the user, to receive the user's feedback data to determine the correctness of the response data性; and correspondingly annotate and update the text content and feedback data, and then import the annotated data into the semi-supervised dialogue topic model for update and learning.
於一實施例中,該文字前處理包括文句正規化、文句斷詞及詞彙向量化。 In one embodiment, the text pre-processing includes sentence normalization, sentence segmentation, and vocabulary vectorization.
於另一實施例中,該最終意圖分類的決策之產出係包括針對該未標記資料詞彙與向量以及該已標記資料詞彙與向量進行樣本與特徵選取以組成多組訓練資料子集,各該訓練資料子集進行訓練後以整合該多個意圖決策結果。 In another embodiment, the output of the final intent classification decision includes selecting samples and features for the unlabeled data vocabulary and vector and the labeled data vocabulary and vector to form multiple sets of training data subsets, each of which The training data subset is trained to integrate the multiple intention decision results.
於又一實施例中,該強化錯誤樣本學習的方法係包括對該最終意圖分類與該文字內容的真實意圖類別進行衡量,回饋錯誤分類的訓練樣本以提高該錯誤分類的訓練樣本被挑選之權重,並透過反覆疊代訓練直到正確率達到門檻值為止。 In another embodiment, the method for enhancing the learning of error samples includes measuring the final intent classification and the true intention category of the text content, and feeding back misclassified training samples to increase the weight of the misclassified training samples being selected , And through repeated iterative training until the correct rate reaches the threshold.
另外,該系統回覆內容包括回覆用戶問題、持續性對話或用戶確認。 In addition, the system's replies include replying to user questions, ongoing conversations, or user confirmations.
於再一實施例中,對該文字內容及回饋資料進行對應標注及更新係指利用該半監督式對話主題模型找出與錯誤資料相同主題意涵的相似文句,透過輔助標注或採用最近鄰居演算法方式以於對話主題空間中找出k個最相近的該已標記資料,並排除該錯誤資料原先預測的意圖類別,以將該錯誤資料標為投票最高的意圖類別。 In still another embodiment, correspondingly annotating and updating the text content and feedback data refers to using the semi-supervised dialogue topic model to find similar sentences with the same topic meaning as the wrong data, through auxiliary annotation or using nearest neighbor calculation The method is to find the k closest marked data in the dialogue topic space, and eliminate the intention category predicted by the wrong data, so as to mark the wrong data as the intention category with the highest vote.
另外,對該文字內容及回饋資料進行對應標注及更新係包括將該文字內容標注成正確的意圖類別,以成為既有的該已標記資料。 In addition, correspondingly marking and updating the text content and the feedback data includes marking the text content as the correct intention category to become the existing marked data.
綜上可知,透過本發明所提出之使用意圖偵測集成學習之對話系統及其方法,透過使用半監督式對話主題模型增進意圖辨識以及知識庫更新機制,其中,半監督式係指大量運用未標注的對話文字內容建立半監督式對話主題模型並利用已標記資料引導未標記資料分析對話文句隱含的主題意涵,進而產生較有意義的分群結果,藉此能夠更精確的識別用戶意圖及對話處理,快速更新用戶回饋之意圖,以減少對話輪次,進而提升對話系統效能。 In summary, through the dialog system and method for integrated learning using intention detection proposed in the present invention, the intention recognition and knowledge base update mechanism are enhanced by using a semi-supervised dialog topic model, where semi-supervised refers to the extensive use of unintended The marked dialogue text content establishes a semi-supervised dialogue topic model and uses the marked data to guide the unmarked data to analyze the implied topic meaning of the dialogue sentence, and then generate more meaningful grouping results, which can more accurately identify the user's intention and dialogue Process and quickly update the user’s feedback intention to reduce the number of conversations and improve the performance of the conversation system.
1‧‧‧使用意圖偵測集成學習之對話系統 1‧‧‧The use of intent detection integrated learning dialogue system
11‧‧‧接收文字輸入模組 11‧‧‧Receive text input module
12‧‧‧文字前處理模組 12‧‧‧Text preprocessing module
1201‧‧‧文句正規化單元 1201‧‧‧Sentence Regularization Unit
1202‧‧‧文句斷詞單元 1202‧‧‧Sentence Segmentation Unit
1203‧‧‧詞彙向量化單元 1203‧‧‧Vocabulary Vectorization Unit
121‧‧‧未標記資料 121‧‧‧Unlabeled data
122‧‧‧已標記資料 122‧‧‧Marked data
123‧‧‧文句正規化 123‧‧‧Sentence regularization
124‧‧‧文句斷詞與去贅詞 124‧‧‧Sentence Segmentation and Elimination of Words
125‧‧‧詞彙向量化 125‧‧‧Vocabulary Vectorization
126‧‧‧未標記資料詞彙與向量 126‧‧‧Unlabeled data vocabulary and vector
127‧‧‧已標記資料詞彙與向量 127‧‧‧Marked data vocabulary and vector
13‧‧‧半監督式對話主題模組 13‧‧‧Semi-supervised dialogue theme module
131‧‧‧詞袋模型 131‧‧‧Bag of words model
132‧‧‧TFIDF模型 132‧‧‧TFIDF model
133‧‧‧半監督式LDA對話主題模型 133‧‧‧Semi-supervised LDA dialogue topic model
134‧‧‧對話主題分布 134‧‧‧Distribution of Dialogue Topics
14‧‧‧意圖辨識集成學習模組 14‧‧‧Intention recognition integrated learning module
141‧‧‧樣本與特徵選擇器 141‧‧‧Sample and feature selector
142‧‧‧意圖辨識器 142‧‧‧Intent Recognizer
143‧‧‧意圖辨識集成器 143‧‧‧Intent Recognition Integrator
15‧‧‧知識庫搜尋模組 15‧‧‧Knowledge base search module
16‧‧‧系統回覆模組 16‧‧‧System Response Module
17‧‧‧系統標注模組 17‧‧‧System Marking Module
S61~S67‧‧‧步驟 S61~S67‧‧‧Step
第1圖為本發明之使用意圖偵測集成學習之對話系統的系統架構 圖;第2圖為本發明所述系統中文字前處理模組的架構圖;第3圖為本發明所述系統中文字前處理模組的執行流程圖;第4圖為本發明所述系統中意圖辨識集成學習模組的執行流程圖;第5圖為本發明所述系統中半監督式對話主題模組的執行流程圖;以及第6圖為本發明之使用意圖偵測集成學習之對話方法的步驟圖。 Figure 1 is the system architecture of the dialogue system for integrated learning using intention detection of the present invention Figure; Figure 2 is a structural diagram of the text pre-processing module in the system of the present invention; Figure 3 is a flowchart of the execution of the text pre-processing module in the system of the present invention; Figure 4 is the system of the present invention Figure 5 is the execution flow chart of the semi-supervised dialogue topic module in the system of the present invention; and Figure 6 is the dialogue using the intent detection integrated learning module of the present invention Diagram of the steps of the method.
以下藉由特定的具體實施形態說明本發明之技術內容,熟悉此技藝之人士可由本說明書所揭示之內容輕易地瞭解本發明之優點與功效。然本發明亦可藉由其他不同的具體實施形態加以施行或應用。 The following describes the technical content of the present invention with specific specific embodiments. Those familiar with the art can easily understand the advantages and effects of the present invention from the content disclosed in this specification. However, the present invention can also be implemented or applied by other different specific embodiments.
第1圖為本發明之使用意圖偵測集成學習之對話系統的系統架構圖。如圖所示,本發明之使用意圖偵測集成學習之對話系統1係包括接收文字輸入模組11、文字前處理模組12、半監督式對話主題模組13、意圖辨識集成學習模組14、知識庫搜尋模組15、系統回覆模組16以及系統標注模組17。
Figure 1 is a system architecture diagram of the dialogue system using intention detection integrated learning of the present invention. As shown in the figure, the
接收文字輸入模組11用於接收用戶輸入之文字內容。使用意圖偵測集成學習之對話系統1透過接收文字輸入模組11接收用戶所輸入之文字內容。假若用戶實際輸入為語音,其應用前端可以銜接語音辨識器,將語音轉文字處理後,再銜接此接收文字輸入模組11。
The receiving
具體實施時,接收文字輸入模組11為一可供用戶輸入文字的接收
介面,可用用戶端-伺服端(client-server)網頁接收方式或是手機應用程式(APP)方式提供輸入畫面給用戶端(client)輸入對話問句,再透過網路傳輸至伺服端(server)或雲端(cloud),然後再將文字傳送給伺服端或雲端的文字前處理模組12處理。
In specific implementation, the receiving
文字前處理模組12接收該用戶輸入之文字內容以定義為未標記資料,將該未標記資料以及既有的已標記資料進行文字前處理,以將詞彙轉為向量表示法。簡言之,文字前處理模組12主要將用戶輸入之文字內容與標注系統17標注的內容進行文字前處理,具體而言,如第2圖所示,文字前處理模組12可包含但不限定於文句正規化單元1201、文句斷詞單元1202以及詞彙向量化單元1203,其中,文句正規化單元1201係將特定符號或語文濾除及編碼轉換,文句斷詞單元1202係將用戶輸入之文字內容以詞為單位做分隔,而詞彙向量化單元1203係將斷詞後的詞彙轉為向量表示法。於具體實施時,文字前處理模組12可為一伺服端或雲端的文字前處理程式。
The
半監督式對話主題模組13具有用於產出對話主題分布之半監督式對話主題模型,其中,該半監督式對話主題模型係依據來自該文字前處理模組之該未標記資料詞彙與向量以及該已標記資料詞彙與向量所組成之訓練資料而建模。半監督式對話主題模組13主要是利用已標記資料引導未標記資料以分析出對話文句(即文字內容)隱含的主題意涵,藉以產生較有意義的分群結果,具體而言,半監督式對話主題模組13是接收文字前處理模組12紀錄的未標記資料詞彙與向量以及已標記資料詞彙與向量來進行半監督式對話主題模型之建模。爾後,當前對話資料經文字前處理模組12處理後的結果,再輸入至半監督式對話主題模組13中,即可產出當前對話資料的對話主題分布。於具體實施時,半監督式對話主題模型可為一在伺服端或雲端預測對話主題分布的程式模型,其
輸入資料來源為文字前處理模組12的未標記資料詞彙與向量以及已標記資料詞彙與向量。
The semi-supervised
意圖辨識集成學習模組14具有樣本與特徵選擇器、意圖辨識器及意圖辨識集成器,該樣本與特徵選擇器針對該訓練資料進行樣本與特徵選取以組成多組訓練資料子集,各該訓練資料子集分別對應一個該意圖辨識器以進行訓練,並透過該意圖辨識集成器整合多個該意圖辨識器之意圖決策結果,以產出最終意圖分類的決策。意圖辨識集成學習模組14主要是應用集成學習方式,採用引導聚集算法(Bagging)或逐步提升算法(Boosting)或以上兩種混合之方式,以降低模型變異與減少模型誤差。
The intent recognition integrated
具體而言,意圖辨識集成學習模組14可透過樣本與特徵選擇器,針對訓練資料進行樣本與特徵選取以組成多組訓練資料子集,將每一批訓練資料分別對應一組意圖辨識器進行訓練,該意圖辨識器包含但不限定於使用:模式匹配、統計方法(回歸分析)、機器學習模型(SVM、類神經網路、決策樹…)、深度學習模型(RNN、LSTM、DNN、CNN…)等個別或其組合方式,之後再經意圖辨識集成器整合多組意圖辨識器之意圖決策結果,以產出最終意圖分類的決策。另外,該意圖辨識集成器包含但不限定於使用加權平均法、投票法等整合方式。
Specifically, the intent recognition integrated
綜上可知,意圖辨識集成學習模組14接收來自文字前處理模組12之未標記資料詞彙與向量以及已標記資料詞彙與向量以及半監督式對話主題模組13之對話主題分布,進而產出當前對話文字之意圖結果,並將結果傳送給知識庫搜尋模組。於具體實施時,意圖辨識集成學習模組14可為一在伺服端或雲端透過集成學習方式增進對話意圖預測能力的處理程式。
In summary, the intention recognition integrated
知識庫搜尋模組15係透過該最終意圖分類的決策進行查詢,以由資料庫取得系統回覆內容。知識庫搜尋模組15以意圖辨識集成學習模組14所接收之對話意圖來查詢知識庫,藉以決定系統回覆內容,之後將該系統回覆內容傳送給系統回覆模組16,其中,系統回覆內容類型包含但不限定於回覆用戶問題、持續性對話或用戶確認等內容。於具體實施時,知識庫搜尋模組15可為一在伺服端或雲端儲存對話意圖與回應句對應的資料庫模組,可提供索引、搜尋與模糊比對功能。
The knowledge
系統回覆模組16係將該系統回覆內容傳送至該用戶,並接收該用戶所回傳有關該系統回覆內容是否正確之回覆資料。系統回覆模組16將知識庫搜尋模組15產生的系統回覆內容回覆給用戶,接著進行回覆資料正確性判斷,確認對話系統是否回覆正確,其會將問題文句(Q)、答案文句(A)、意圖(Intent)與回覆評價(Reply)對應資料記錄下來,並且將回覆錯誤(即回覆評價差)的部分傳用給系統標注模組17進行資料標注。
The
於具體實施時,系統回覆模組16可為一回覆文字訊息給用戶的介面,可用用戶端-伺服端(client-server)網頁顯示方式或是手機應用程式(APP)方式提供回覆畫面給用戶端(client)觀看對話回應、點選文字超連結與播放語音或影像多媒體等。系統回覆模組16提供之用戶回應介面,其可設計為「喜歡」或「不喜歡」的按鈕或標記回應,再將這些問題文句(Q)、答案文句(A)、意圖(Intent)與回覆評價(Reply)對應資料記錄下來,回饋給系統標注模組17。
In specific implementation, the
系統標注模組17於該回覆資料為該系統回覆內容為不正確時,將該用戶輸入之文字內容標注成正確的意圖類別並且導入至該文字前處理模組以產生新的詞彙向量,據此更新該半監督式對話主題模型以優化所輸出的主題機
率分布,進而重新調整該意圖辨識集成學習模組所預測之意圖類別。系統標注模組17主要目的是更新意圖類別以增進意圖辨識精確度,其接收系統回覆模組16回饋為錯誤的用戶文字輸入資料並將此錯誤的資料標注成正確的意圖類別,如此便能新增多筆有標注的訓練資料導入文字前處理模組12以產生新的詞彙向量,進而更新半監督式對話主題模組13輸出新的主題機率分布,藉由重新調整意圖辨識集成學習模組14使其能更精準預測意圖類別,以於知識庫中搜尋出更合適的回覆內容回給用戶,以完成對話系統效能的精進。
When the reply data is incorrect, the system marking module 17 marks the text input by the user as the correct intention category and imports it into the text preprocessing module to generate a new vocabulary vector. Update the semi-supervised dialogue topic model to optimize the output topic machine
Rate distribution, and then readjust the intent category predicted by the intent recognition integrated learning module. The main purpose of the system labeling module 17 is to update the intention category to improve the accuracy of intent identification. It receives the wrong user text input data from the
系統標注模組17的標注方式包含但不限於人工標注、系統自動預測標注或其組合。系統標注模組17亦可利用半監督式對話主題模組13找出與此錯誤資料相同主題意涵的相似文句,藉此輔助人工標注或採用最近鄰居演算法方式在對話主題空間中找出k個最相近的已標記資料,並排除該錯誤資料原先被意圖辨識集成學習模組14預測的意圖類別,以自動將此錯誤資料標為投票最高的意圖類別。
The labeling method of the system labeling module 17 includes, but is not limited to, manual labeling, automatic system prediction labeling, or a combination thereof. The system labeling module 17 can also use the semi-supervised
於具體實施時,系統標注模組17可為一在伺服端或雲端進行資料標注的應用系統。系統標注模組17可提供人工標注與系統自動標注功能,人工標注提供一個操作管理介面供標注人員操作使用,其可為網頁資料庫呈現方式,包含但不限定於對話語句相關資料呈現、對話文句主題分布呈現介面、標記輸入介面、圖形化操作介面、圖表趨勢呈現、權限管理控制等,而系統自動標注為一演算法程式,能在半監督式對話主題模組13中找出與需標注的對話文句相近的前m個主題,並在這m主題中挑出與此對話文句相近的n筆有標記的對話資料(即問題文句(Q)與意圖(Intent)對應),並先排除此對話文句先前錯誤分類的意圖,再依據自然語言語意相似演算法找出最相似的k個對話文句,進行多數表
決以決定自動標注的意圖為何。在人工標注或系統自動標注後,可將這些已標注資料導入半監督式對話主題模組13之半監督式對話主題模型以及與意圖辨識集成學習模組14中進行模型更新,藉以精進整體對話意圖辨識能力。
In specific implementation, the system labeling module 17 may be an application system that performs data labeling on the server side or the cloud. The system labeling module 17 can provide manual labeling and system automatic labeling functions. The manual labeling provides an operation management interface for labeling personnel to operate and use. It can be a web page database presentation method, including but not limited to dialogue sentence related data presentation, dialogue sentence Theme distribution display interface, mark input interface, graphical operation interface, chart trend display, authority management control, etc., and the system is automatically marked as an algorithm program, which can be found and marked in the semi-supervised
第3圖為本發明所述系統中文字前處理模組的執行流程圖。請一併參考第1圖,如圖所示,從接收文字輸入模組11接收之文句會儲存於文字前處理模組12中的未標記資料121的資料庫,更具體來說,文字前處理模組12其主要輸入來源分為兩部分,一為接收文字輸入模組11收到的用戶輸入之文字問句(即未標記資料121的資料庫之資料來源),另一為經由系統標注模組17根據先前系統回饋與使用者回饋所產出的對話文字問句與該句對應標注的意圖類別(即已標記資料122的資料庫之資料來源)。將前述兩種來源所收到的對話文字問句內容導入文字前處理模組12,即可輸出文字前處理後的詞彙向量表示法。
Figure 3 is the execution flow chart of the text pre-processing module in the system of the present invention. Please also refer to Figure 1. As shown in the figure, the sentence received from the received
文字前處理模組12會執行包含但不限定於文句正規化123、文句斷詞與去贅詞124以及詞彙向量化125等程序,其中,文句正規化123會將特定符號或語文濾除及編碼轉換,文句斷詞與去贅詞124會將用戶輸入文字內容以詞為單位做分隔並依據停止詞表(stopwords)去除贅詞,而詞彙向量化125會將斷詞後的詞彙轉為向量表示法,最後,將詞彙與向量儲存在未標記資料詞彙與向量126的資料庫與已標記資料詞彙與向量127的資料庫中。
The
第4圖為本發明所述系統中意圖辨識集成學習模組的執行流程圖。請一併參考第1圖,意圖辨識集成學習模組14是應用集成學習方式採用引導聚集算法(Bagging)或逐步提升算法(Boosting)或以上兩種混合之方式來降低模型變異與減少模型誤差,其訓練方式是將文字前處理模組12處理後的詞彙與向量及匯入半監督式對話主題模組13後產生的主題機率分布結果串接當成輸入特徵組
成訓練資料,並導入意圖辨識集成學習模組14,藉此產出此文句的對話意圖。
Figure 4 is an execution flow chart of the integrated learning module for intention recognition in the system of the present invention. Please also refer to Figure 1. Intent identification integrated
如圖所示,意圖辨識集成學習模組14可包括樣本與特徵選擇器141、意圖辨識器142與意圖辨識集成器143,其輸入特徵則由對話文字向量與對話主題分布所組成,透過樣本與特徵選擇器141選擇多組訓練子集傳遞至多個意圖辨識器142進行訓練,最後再由意圖辨識集成器143整合最終意圖辨識結果,並回饋錯誤分類樣本給樣本與特徵選擇器141調整權重,透過反覆疊代訓練直到正確率收斂為止。具體來說,意圖辨識集成學習模組14具有樣本與特徵選擇器141可針對訓練資料進行樣本與特徵選取以組成多組訓練資料子集,將每批訓練資料分別對應一組意圖辨識器142進行訓練,各該意圖辨識器142包含但不限定於使用模式匹配、統計方法(回歸分析)、機器學習模型(SVM、類神經網路、決策樹…)、深度學習模型(RNN、LSTM、DNN、CNN…)等個別或其組合方式。接著,再由一意圖辨識集成器143整合多組意圖辨識器142的意圖決策結果,以產出最終意圖分類的決策。另外,意圖辨識集成器143包含但不限定於使用加權平均法、投票法等整合方式,最後,將意圖辨識集成器143預測的意圖分類與真實的意圖類別進行評價後,挑選出分類錯誤的樣本回饋給樣本與特徵選擇器141以更新訓練樣本選擇權重,接著,反覆疊代訓練直到精確度達到滿意的門檻值為止。待訓練完成後,此意圖辨識集成學習模組14即可辨識出對話文句之意圖。
As shown in the figure, the intent recognition integrated
第5圖為本發明所述系統中半監督式對話主題模組的執行流程圖。請一併參考第1圖,半監督式對話主題模型13主要是利用已標記資料引導未標記資料分析對話文句隱含的主題意涵,使用半監督式的原因是已標記資料有限且取得成本高,而未標記資料相對容易取得但單獨使用未標記資料建模效果較不顯著且難以解釋,故結合兩者資料進行建模能利用已知類別的已標記資料引
導未知類別的未標記資料,藉此產生較有意義的分群結果。
Figure 5 is the execution flow chart of the semi-supervised dialogue topic module in the system of the present invention. Please also refer to Figure 1. The semi-supervised
如圖所示,半監督式對話主題模型13組成元件包括詞袋模型131、詞頻-反向文件頻率(TFIDF,term frequency-inverse document frequency)模型132、半監督式線性判斷分析(LDA,Linear Discriminant Analysis)對話主題模型133,最後會產出對話主題分布134。據此,半監督式對話主題模型13根據先前文字前處理模組12所輸出的未標記資料(第3圖之未標記資料詞彙與向量126)與已標記資料(第3圖之已標記資料詞彙與向量127)所組成的訓練資料進行半監督式對話主題模型訓練建模,半監督式對話主題模型之建模方式包含但不限定於使用模式匹配、機器學習模型(LDA、LSI…)、深度學習模型(TopicRNN、LSTM+LDA…)等個別或其組合方式,當半監督式對話主題模型訓練完成後,將欲分析所屬主題的文句輸入此半監督式對話主題模型中,即可產生該段文字主題機率分布。
As shown in the figure, the components of the semi-supervised
第6圖為本發明之使用意圖偵測集成學習之對話方法的步驟圖。 Figure 6 is a step diagram of the dialogue method using intent detection integrated learning of the present invention.
於步驟S61中,接收用戶輸入之文字內容。本步驟即接收用戶所輸入之文字內容。於一實施例中,如用戶實際輸入為語音,其應用前端可以銜接語音辨識器,經語音轉文字處理後,再作後續處理。 In step S61, the text content input by the user is received. In this step, the text content entered by the user is received. In one embodiment, if the user's actual input is voice, the application front end can be connected to a voice recognizer, and after the voice is converted to text, the subsequent processing can be performed.
於步驟S62中,定義該用戶輸入之文字內容為未標記資料,將該未標記資料以及既有的已標記資料進行文字前處理,以將詞彙轉為向量表示法。本步驟即將用戶輸入之文字內容進行文字前處理並輸出文字前處理結果,文字內容來源可以是前面步驟S61所接收的用戶所輸入之文字內容(即未標記資料),也可以是已標記資料(後面步驟S67所產生者),文字前處理包含但不限定於文句正規化、文句斷詞以及詞彙向量化,文句正規化會將特定符號或語文濾除及編碼轉換,文句斷詞會將用戶輸入文字內容以詞為單位做分隔,詞彙向量化會將 斷詞後的詞彙轉為向量表示法,最後可將文字前處理後的結果分成已標記與未標記內容分別儲存於不同資料庫中。 In step S62, the text content input by the user is defined as unmarked data, and the unmarked data and the existing marked data are subjected to text pre-processing to convert the vocabulary into a vector representation. In this step, the text content input by the user is pre-processed and the text pre-processing result is output. The source of the text content can be the text content input by the user received in the previous step S61 (that is, the unmarked data), or the marked data ( The following step S67), the text pre-processing includes but is not limited to sentence normalization, sentence segmentation, and vocabulary vectorization. The sentence regularization will filter out specific symbols or language and convert the code. The sentence segmentation will input the user The text content is separated by words, and vocabulary vectorization will After the word segmentation, the vocabulary is converted into a vector representation, and finally the results of the pre-processing of the text can be divided into marked and unmarked contents and stored in different databases.
於步驟S63中,將該未標記資料及該已標記資料導入半監督式對話主題模型以輸出對話主題分布。本步驟是將已標記資料與未標記資料導入半監督式對話主題模型,進而取得對話主題分布,如前所述,大量運用未標注的對話文字內容建立主題模型,並利用已標記資料引導未標記資料而分析對話文句隱含的主題意涵,藉以產生較有意義的分群結果,也就是說,最後產生對話主題分布可為之後訓練資料(即透過意圖辨識器)的主題特徵。 In step S63, the unmarked data and the marked data are imported into the semi-supervised dialogue topic model to output the dialogue topic distribution. This step is to import the marked and unmarked data into the semi-supervised dialogue topic model, and then obtain the dialogue topic distribution. As mentioned above, a large number of unmarked dialogue text content is used to establish the topic model, and the marked data is used to guide the unmarked The data is analyzed for the implicit theme meaning of the dialogue sentence to produce a more meaningful grouping result, that is, the final distribution of the dialogue theme can be the theme feature of the later training data (that is, through the intention recognizer).
於步驟S64中,結合該對話主題分布與對話文字內容作為輸入,整合多個意圖決策結果以及強化錯誤樣本學習的方法,以產出最終意圖分類的決策。於本步驟中,結合對話主題分布與對話文字內容為輸入,整合多個意圖辨識器與強化錯誤樣本學習的方法,針對訓練資料進行樣本與特徵選取以組成多組訓練資料子集。將每批訓練資料分別對應一組意圖辨識器進行訓練,再整合多個意圖決策結果以產出最終意圖分類的決策。另外,最後在對最終意圖與真實的意圖進行衡量,回饋錯誤分類的訓練樣本以提高它們下次被挑選的權重,再透過反覆疊代訓練直到正確率達到滿意的門檻值為止,據上,訓練完後,輸入當前對話文字內容與步驟S64產出的對話主題分布情況,即可辨別出該對話意圖。 In step S64, the dialogue topic distribution and dialogue text content are combined as input, multiple intention decision results and methods of strengthening error sample learning are integrated to produce a final intention classification decision. In this step, combining the dialogue topic distribution and dialogue text content as input, integrating multiple intent recognizers and methods of strengthening error sample learning, and selecting samples and features for training data to form multiple sets of training data subsets. Each batch of training data is trained for a set of intent recognizers, and then multiple intent decision results are integrated to produce the final intent classification decision. In addition, in the end, the final intention and the true intention are measured, and the misclassified training samples are fed back to increase their weight for the next selection, and then iterative training is repeated until the accuracy reaches a satisfactory threshold. According to the above, the training After that, input the current dialogue text content and the dialogue topic distribution situation produced in step S64 to identify the dialogue intention.
於步驟S65中,依據該最終意圖分類以決定系統回覆內容。於本步驟中,可透過資料庫查詢,系統回覆內容類型係包含但不限定於回覆用戶問題、持續性對話、與用戶確認等部分。 In step S65, the system's reply content is determined according to the final intention classification. In this step, you can query through the database. The system reply content type includes, but is not limited to, replying to user questions, continuous dialogue, and user confirmation.
於步驟S66中,回覆該系統回覆內容至該用戶,並接收該用戶之 回饋資料以判斷該回覆資料之正確性。本步驟即將系統產生的系統回覆內容回覆給用戶,接著進行回覆資料正確性判斷,確認對話系統是否回覆正確。 In step S66, reply the system reply content to the user, and receive the user’s Feedback data to determine the correctness of the response data. In this step, the system reply content generated by the system will be replies to the user, and then the correctness of the reply data will be judged to confirm whether the dialogue system reply is correct.
於步驟S67中,對該用戶之文字內容及回饋資料進行對應標注及更新,進而將標注後資料匯入該半監督式對話主題模型以進行更新與學習。本步驟即對於使用者問句以及使用者回饋進行對應標注及更新,並將標注後資料匯入半監督式對話主題模型進行模型更新與學習以精進系統。 In step S67, the text content and feedback data of the user are correspondingly annotated and updated, and then the annotated data is imported into the semi-supervised dialogue topic model for updating and learning. In this step, the user’s question and user feedback are correspondingly labeled and updated, and the labeled data is imported into the semi-supervised dialogue topic model for model updating and learning to refine the system.
以下將參考第1-4圖以一具體範例說明本發明之使用半監督式意圖偵測集成學習的對話系統於一實施例中個組件的運作情況。 Hereinafter, referring to FIGS. 1-4, a specific example will be used to illustrate the operation of each component in an embodiment of the dialog system using semi-supervised intention detection integrated learning of the present invention.
首先,接收文字輸入模組11會透過文字輸入介面接收使用者所輸入於此對話系統之文字內容,例如此實施例為:「我要查日本的囯際漫遊方案@#$@」。上述會傳遞給文字前處理模組12。接著,文字前處理模組12會先將用戶文字輸入進行文句正規化123,只保留中文英文及少部分標點符號,並且將中文轉成繁體字。此時,實施例文字轉為「我要查日本的國際漫遊方案」。
First, the receiving
接著進行文句斷詞與去贅詞124,將語句依據斷詞器分隔成一個一個詞彙並將列於停止詞表(stopwords)中的詞彙去除,於本實施例中,會先將文句斷詞為「我 要查 日本 國際漫遊 方案」,並假設詞彙「我」跟「方案」出現在停止詞表中,因而最終去贅後的斷詞結果為「要查 日本 國際漫遊」。接著,進行詞彙向量化125,將斷詞後的詞彙轉化為其代表向量,此實施例採用One-Hot向量表示法,向量長度為辭典大小,每個維度代表辭典裡的一個詞,每個詞彙的One-Hot向量只有在其唯一代表維度是1,其他維度都是0,例如:「日本」的一種One-Hot向量表示為[0,1,0,0,0,0,0]。最後,將文字正規化及斷詞結果以及詞彙向量化結果整合作為文字前處理結果,傳遞給半監督式對話主題模組13內
的半監督式對話主題模型。因此,此實施例輸入「我要查日本的囯際漫遊方案@#$@」於文字前處理模組12,則文字前處理結果將包含文字正規化及斷詞去贅後結果:「要查 日本 國際 漫遊」以及詞彙向量化結果「要查=[1,0,0,0,0,0,0];日本=[0,1,0,0,0,0,0];國際=[0,0,1,0,0,0,0];漫遊=[0,0,0,1,0,0,0]」。
Next, the sentence segmentation and deduplication 124 are performed, the sentence is divided into a vocabulary according to the word breaker and the words listed in the stopwords list (stopwords) are removed. In this embodiment, the sentence segmentation is first "I want to check the Japan international roaming Solution", and assuming that words "I" with the "program" appears to stop vocabulary, so the final result after going off the word is superfluous "to investigate the Japan international roaming." Next,
再來,半監督式對話主題模型會透過預先蒐集好的訓練資料,即未標記資料詞彙與向量126與已標記資料詞彙與向量127,訓練一個主題模型。於此實施例中,以半監督式的隱含狄利克雷分布(Latent Dirichlet allocation,LDA)為例,由於LDA模型須採用以頻率為主的表示方式進行,故需在對訓練資料進行詞袋模型(bag of words,BOW)131轉換,而為避免主題常常被高頻詞佔據也可再經TFIDF模型132進行轉換。
Next, the semi-supervised dialogue topic model trains a topic model through pre-collected training data, that is, unlabeled data vocabulary and
傳統LDA模型生成的實施步驟如下:(1)從狄利克雷分布α中採樣生成文章m的主題分布θ m ;(2)從主題的多項式分布θ m 中採樣生成文章第n個詞的主題Z m,n ;(3)從狄利克雷分布β中採樣生成主題Z m,n 對應的詞彙分布;(4)從詞彙的多項式分布中採樣最終生成詞彙W m,n 。 The implementation steps of the traditional LDA model generation are as follows: (1) Sampling from the Dirichlet distribution α to generate the topic distribution θ m of the article m; (2) Sampling from the topic polynomial distribution θ m to generate the topic Z of the nth word of the article m,n ; (3) Sampling from Dirichlet distribution β to generate the vocabulary distribution corresponding to topic Z m,n ; (4) From the polynomial distribution of the vocabulary Medium sampling finally generates vocabulary W m,n .
上述方式為無監督式的學習演算法(即無使用標記資料),加入標記資料於此無監督式模型轉化為半監督式模型的一個簡單方式為增強某些重要詞彙於特定主題中的出現機率。本實施例假設有一主題Z i 為「日本國際漫遊」,而我們認為字彙組{"日本","出國"}與此主題高度相關,故在資料標注時將「日本國際漫遊」的主題Z i 的關鍵字標注為W key ={ 日本,出國 },並使得該關鍵字出現於該主題的機率P(Z i |W key )=1。如此,在LDA的訓練過程中,將這些有標記的資料輸入,每當對應到有標記的主題與關鍵字匹配時,便能增強此關鍵字屬於此主題的機率。另外,也可設定一機率門檻值控制標記資料的影響程度,即隨 機產生一0~1之間隨機數,小於此門檻值才執行此方式。 The above method is an unsupervised learning algorithm (that is, no labeled data is used). A simple way to add labeled data to this unsupervised model is to increase the probability of certain important words in a specific topic. . In this embodiment , it is assumed that a topic Z i is " Japan International Roaming ", and we think that the vocabulary group {"日本","Go Abroad"} is highly related to this topic, so the topic Z i of "Japan International Roaming " is added when the data is marked. The keyword of is marked as W key ={ Japan,Going Abroad }, and the probability of the keyword appearing in the topic is P( Z i | W key )=1. In this way, in the LDA training process, input these labeled data, and whenever the labeled topic matches the keyword, the probability that the keyword belongs to the topic can be enhanced. In addition, a probability threshold can also be set to control the degree of influence of the marked data, that is, a random number between 0 and 1 is randomly generated, and this method is executed when the threshold is less than this threshold.
訓練完成半監督式LDA對話主題模型133後,即可得到每個主題由哪些關鍵字依權重所組成,例如主題Z i 由{0.5*日本+0.3*出國+0.1*國際+0.1*漫遊}所組成,因此,我們可以說此主題應可命名「日本國際漫遊」。假設我們預先設定共有100個主題,預測時輸入文句「我要出國到日本」經文字前處理後斷句成「出國 日本」,在分別對所有主題Z 0~Z 99計算此句屬於該主題的分數後,可得到100維預測此句對話主題分布134為Topics 100=[0.12,0.3,0.8,...]。
After training the semi-supervised LDA
接著,意圖辨識集成學習模組14會事先透過預先建立好的訓練資料即是將文字前處理模組12處理後的詞彙向量及匯入半監督式對話主題模型後產生的主題機率分布結果串接當成輸入特徵而組成訓練資料,訓練資料導入意圖辨識集成學習模組14,產出此文句的對話意圖。此實施例如第4圖所示,訓練資料透過樣本與特徵選擇器141進行樣本與特徵選取,其初始的選取方式可以抽取放回的隨機選取方式,抽取n組訓練資料大小為m的子集T 1~T n 。將每批訓練資料分別對應一組意圖辨識器142進行訓練,此實施例以SVM為意圖辨識器142的分類方法,因此,需分別訓練n組SVM模型SVM 1~SVM n ,訓練後同一輸入文句對每組SVM i 模型可各別辨識該組所屬的意圖Intent i 。
Then, the intention recognition integrated
再經過一個意圖辨識集成器143整合各組SVM i 的意圖Intent i 結果來產出最終意圖分類的決策。於此,該意圖辨識集成器143的算法可使用投票法(Majority vote)來決定最後意圖。即收集所有意圖決策Intent 1~n 找出票數最多的那一類即為最終意圖。最後,在對最終意圖與該筆文句真實的意圖進行衡量,將錯誤分類的訓練樣本回饋給樣本與特徵選擇器141,以提高它們下次被挑選的權重。於此實施例中,權重的更新方式可以下方Adaboost的權重更新方式來實
行:
接著,反覆疊代訓練此模型直到模型正確率達到滿意的門檻值為止,門檻值可例如0.95。訓練完成後,此意圖辨識集成學習模組14即可辨識出對話文句之意圖。
Then, iteratively train this model until the accuracy of the model reaches a satisfactory threshold, which can be, for example, 0.95. After the training is completed, the intention recognition integrated
接著,知識庫搜尋模組15藉由意圖辨識集成學習模組14產生的對話意圖查詢知識庫,再決定系統回覆內容。系統回覆內容類型係包含但不限定於回覆用戶問題、持續性對話以及用戶確認等部分。舉例來說,意圖辨識為「日本國際漫遊」查找知識庫後可回給用戶「請使用APP登入選取辦理國際漫遊選項->日本國際漫遊,選擇所需的使用日期,送出即可」。
Then, the knowledge
系統回覆模組16主要功用為接收知識庫搜尋模組15所產生的系統回覆內容並回覆給用戶,接著,進行回覆資料正確性判斷以確認對話系統是否回覆正確,其會將回覆錯誤的用戶文字輸入資料紀錄下來傳給系統標注模組17進行資料標注。於此實施例中,會將回答文句「請使用APP登入選取辦理國際漫遊選項,選擇所需的使用日期,送出即可」傳送給用戶並接收用戶評價。系統可設計一評分介面於回覆訊息上,由用戶於接收畫面點選評價結果「喜歡」或「不喜歡」,而系統回覆模組16可將這些問題文句(Q)、答案文句(A)、意圖(Intent)與回覆評價(Reply)對應資料記錄下來,例如:{Q:「我要出國到日本」,A:「請使用APP登入選取辦理國際漫遊選項->日本國際漫遊,選擇所需的使用日期,送出即可」,Intent:「日本國際漫遊」,Reply:喜歡}。 The main function of the
系統標注模組17為更新意圖類別增進意圖辨識精確度的精進模組,其接收系統回覆模組16回饋為錯誤的用戶文字輸入資料並將此錯誤的資料標注成正確的意圖類別,如此,便能新增多筆有標注的訓練資料導入文字前處理模組12以產生新的詞彙向量,更新半監督式對話主題模型輸出新的主題分布,重新調整意圖辨識集成學習模組14以預測更精準的意圖類別,並能在知識庫中搜尋出更合適的回覆內容回給用戶,以完成對話系統效能的精進。
The system labeling module 17 is an advanced module that updates the intent category to improve the accuracy of intent recognition. It receives the
系統標注模組17的標注方式包含但不限於人工標注、系統自動預測標注或其組合之方法,還可利用半監督式對話主題模型找出與此錯誤資料相同主題意涵的相似文句用以輔助人工標注或採用最近鄰居演算法方式在對話主題空間中找出k個最相近的已標記資料,並排除該錯誤資料原先被意圖辨識集成學習模組14預測的意圖類別,藉此自動將此錯誤資料標為投票最高的意圖類別。於此實施例中,假設系統回覆模組16回傳一評價為錯誤(用戶不喜歡)的資料對{Q:「我要打電話到日本」,A:「使用APP登入選取辦理國際漫遊選項->日本國際漫遊,選擇所需的使用日期,送出即可」,Intent:「日本國際漫遊」,Reply:不喜歡}。而系統標注模組使用人工標注更新了{Q:「我要打電話到日本」,Intent:「撥打國際電話」},將Q的斷詞結果「打電話 日本」透過半監督式對話主題模型找出最接近的兩個主題是Z i {0.5*日本+0.3*出國+0.1*國際+0.1*漫遊}(分數為0.5)與Z j {0.4*打電話+0.4*國際+0.1*撥打}(分數為0.4),並分別列出兩主題的語句(Z i :我要辦理日本國際漫遊,Z j :我要打國際電話)供人工標注參考。當系統標注模組17使用自動標注則找出Z i 和Z j 中有標注的樣本,假設有5筆為S 1{Q:我要辦理日本國際漫遊,Intent:「日本國際漫遊」},S 2{Q:我要去日本需要漫遊,Intent:「日本國際漫遊」},S 3{Q:我要打國際電話,Intent:
「撥打國際電話」},S 4{Q:我要撥打電話到韓國,Intent:「撥打國際電話」},S 5{Q:我要撥打電話到台北,Intent:「撥打市話」},排除原本錯誤的標記「日本國際漫遊」的S 1,2後剩S 3,4,5三筆,使用最近鄰居演算法後得出自動標注的意圖為「撥打國際電話」。
The labeling methods of the system labeling module 17 include, but are not limited to, manual labeling, system automatic prediction labeling, or a combination thereof. A semi-supervised dialogue topic model can also be used to find similar sentences with the same theme meaning as the wrong data for assistance Manually label or use the nearest neighbor algorithm to find the k closest labeled data in the dialogue topic space, and exclude the wrong data from the intention category predicted by the intention recognition integrated
綜上所述,本發明所提出之使用意圖偵測集成學習之對話系統及其方法,應用半監督式對話主題模型及意圖偵測集成學習來提升對話系統精確度,相較於過往習知技術,其具備以下特點及功效:第一,大量運用未標注的對話文字內容建立主題模型產生訓練資料的主題特徵,可增進意圖識別器輸入特徵的多樣性以提供精確的意圖辨識,減少對話輪次;第二,具備快速回饋及精進機制,可根據用戶回饋與系統回饋透過半監督式對話主題模型輔助AI訓練師於標注系統標記資料並且可根據用戶答案與機器人識別結果之差異對模型加強訓練,使系統效能更進步;以及第三,結合集成學習的更新模式,降低對話意圖主題模型的誤差與變異,以訓練出有效且即時更新的模型。 In summary, the dialogue system and method using intent detection integrated learning proposed in the present invention apply semi-supervised dialogue topic models and intent detection integrated learning to improve the accuracy of the dialogue system, compared with the previous known techniques It has the following characteristics and functions: First, a large number of unlabeled dialogue text content is used to build a theme model to generate the theme characteristics of the training data, which can increase the diversity of the input characteristics of the intent recognizer to provide accurate intent recognition and reduce the number of conversations. ; Second, it has a rapid feedback and refinement mechanism, which can assist AI trainers in tagging data in the labeling system through a semi-supervised dialogue topic model based on user feedback and system feedback, and can strengthen the training of the model based on the difference between user answers and robot recognition results. Make the system performance more improved; and third, combine the update mode of integrated learning to reduce the error and variation of the dialogue intention topic model, so as to train an effective and real-time update model.
上述實施形態僅例示性說明本發明之原理及其功效,而非用於限制本發明。任何熟習此項技藝之人士均可在不違背本發明之精神及範疇下,對上述實施形態進行修飾與改變。因此,本發明之權利保護範圍,應如後述之申請專利範圍所列。 The above-mentioned embodiments only exemplarily illustrate the principles and effects of the present invention, and are not intended to limit the present invention. Anyone who is familiar with this technique can modify and change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Therefore, the scope of protection of the rights of the present invention should be listed in the scope of patent application described later.
1‧‧‧使用意圖偵測集成學習之對話系統 1‧‧‧The use of intent detection integrated learning dialogue system
11‧‧‧接收文字輸入模組 11‧‧‧Receive text input module
12‧‧‧文字前處理模組 12‧‧‧Text preprocessing module
13‧‧‧半監督式對話主題模組 13‧‧‧Semi-supervised dialogue theme module
14‧‧‧意圖辨識集成學習模組 14‧‧‧Intention recognition integrated learning module
15‧‧‧知識庫搜尋模組 15‧‧‧Knowledge base search module
16‧‧‧系統回覆模組 16‧‧‧System Response Module
17‧‧‧系統標注模組 17‧‧‧System Marking Module
Claims (12)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
TW108108454A TWI734085B (en) | 2019-03-13 | 2019-03-13 | Dialogue system using intention detection ensemble learning and method thereof |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
TW108108454A TWI734085B (en) | 2019-03-13 | 2019-03-13 | Dialogue system using intention detection ensemble learning and method thereof |
Publications (2)
Publication Number | Publication Date |
---|---|
TW202034207A TW202034207A (en) | 2020-09-16 |
TWI734085B true TWI734085B (en) | 2021-07-21 |
Family
ID=73643827
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
TW108108454A TWI734085B (en) | 2019-03-13 | 2019-03-13 | Dialogue system using intention detection ensemble learning and method thereof |
Country Status (1)
Country | Link |
---|---|
TW (1) | TWI734085B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114416931A (en) * | 2020-10-28 | 2022-04-29 | 华为云计算技术有限公司 | Label generation method and device and related equipment |
US20230135625A1 (en) * | 2021-10-28 | 2023-05-04 | International Business Machines Corporation | Automated generation of dialogue flow from documents |
CN114936561A (en) * | 2022-04-11 | 2022-08-23 | 阿里巴巴(中国)有限公司 | Voice text processing method and device, storage medium and processor |
TWI847393B (en) * | 2022-11-28 | 2024-07-01 | 犀動智能科技股份有限公司 | Language data processing system and method and computer program product |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103268312A (en) * | 2013-05-03 | 2013-08-28 | 同济大学 | Training corpus collection system and method based on user feedback |
CN105068661A (en) * | 2015-09-07 | 2015-11-18 | 百度在线网络技术(北京)有限公司 | Man-machine interaction method and system based on artificial intelligence |
CN105654130A (en) * | 2015-12-30 | 2016-06-08 | 成都数联铭品科技有限公司 | Recurrent neural network-based complex image character sequence recognition system |
CN107273897A (en) * | 2017-07-04 | 2017-10-20 | 华中科技大学 | A kind of character recognition method based on deep learning |
-
2019
- 2019-03-13 TW TW108108454A patent/TWI734085B/en active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103268312A (en) * | 2013-05-03 | 2013-08-28 | 同济大学 | Training corpus collection system and method based on user feedback |
CN105068661A (en) * | 2015-09-07 | 2015-11-18 | 百度在线网络技术(北京)有限公司 | Man-machine interaction method and system based on artificial intelligence |
CN105654130A (en) * | 2015-12-30 | 2016-06-08 | 成都数联铭品科技有限公司 | Recurrent neural network-based complex image character sequence recognition system |
CN107273897A (en) * | 2017-07-04 | 2017-10-20 | 华中科技大学 | A kind of character recognition method based on deep learning |
Non-Patent Citations (1)
Title |
---|
自然語言處理入門- Word2vec小實作,2017年10月16日,https://medium.com/pyladies-taiwan/%E8%87%AA%E7%84%B6%E8%AA%9E%E8%A8%80%E8%99%95%E7%90%86%E5%85%A5%E9%96%80-word2vec%E5%B0%8F%E5%AF%A6%E4%BD%9C-f8832d9677c8 * |
Also Published As
Publication number | Publication date |
---|---|
TW202034207A (en) | 2020-09-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110765244B (en) | Method, device, computer equipment and storage medium for obtaining answering operation | |
CN115238101B (en) | Multi-engine intelligent question-answering system oriented to multi-type knowledge base | |
CN110196901B (en) | Method and device for constructing dialog system, computer equipment and storage medium | |
CN110175227B (en) | Dialogue auxiliary system based on team learning and hierarchical reasoning | |
CN106649561B (en) | Intelligent question-answering system for tax consultation service | |
TWI734085B (en) | Dialogue system using intention detection ensemble learning and method thereof | |
WO2019153737A1 (en) | Comment assessing method, device, equipment and storage medium | |
CN110888990B (en) | Text recommendation method, device, equipment and medium | |
CN112800170A (en) | Question matching method and device and question reply method and device | |
CN107491435B (en) | Method and device for automatically identifying user emotion based on computer | |
CN109086265B (en) | Semantic training method and multi-semantic word disambiguation method in short text | |
US11120268B2 (en) | Automatically evaluating caption quality of rich media using context learning | |
CN112395421B (en) | Course label generation method and device, computer equipment and medium | |
CN115470338B (en) | Multi-scenario intelligent question answering method and system based on multi-path recall | |
CN112101042A (en) | Text emotion recognition method and device, terminal device and storage medium | |
Uddin et al. | Depression analysis of bangla social media data using gated recurrent neural network | |
CN112579666A (en) | Intelligent question-answering system and method and related equipment | |
CN111460114A (en) | Retrieval method, device, equipment and computer readable storage medium | |
CN110955767A (en) | Algorithm and device for generating intention candidate set list set in robot dialogue system | |
CN113722492A (en) | Intention identification method and device | |
CN111858875A (en) | Intelligent interaction method, device, equipment and storage medium | |
CN114741471A (en) | Personalized mixed recommendation method based on text mining and multi-view fusion | |
CN118152570A (en) | Intelligent text classification method | |
CN113486143A (en) | User portrait generation method based on multi-level text representation and model fusion | |
CN116932736A (en) | Patent recommendation method based on combination of user requirements and inverted list |