TW202234285A - Dialogue data processing system and method thereof and computer readable medium - Google Patents

Dialogue data processing system and method thereof and computer readable medium Download PDF

Info

Publication number
TW202234285A
TW202234285A TW110106716A TW110106716A TW202234285A TW 202234285 A TW202234285 A TW 202234285A TW 110106716 A TW110106716 A TW 110106716A TW 110106716 A TW110106716 A TW 110106716A TW 202234285 A TW202234285 A TW 202234285A
Authority
TW
Taiwan
Prior art keywords
sentence
intent
dialogue
data
intention
Prior art date
Application number
TW110106716A
Other languages
Chinese (zh)
Other versions
TWI761090B (en
Inventor
楊宗憲
Original Assignee
中華電信股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中華電信股份有限公司 filed Critical 中華電信股份有限公司
Priority to TW110106716A priority Critical patent/TWI761090B/en
Application granted granted Critical
Publication of TWI761090B publication Critical patent/TWI761090B/en
Publication of TW202234285A publication Critical patent/TW202234285A/en

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The present invention is a dialogue data processing system and method thereof. The system is used to collect user dialogue records, filter out intent sentences that can be marked from the dialogue records according to the filter index, select representative sentences by using the original data with tags in each intent category, analyze the average semantic distance between the intent sentence and each representative sentence, and then mark the intent sentence as the intent category of the corresponding representative sentence, or mark it as a new intent category. In addition, the system can enhance the intent category with insufficient sample number by checking the intent category itself and the balance of the number of samples between each intent category. According to this, the present invention can automatically mark the user’s dialogue record, and check and enhance the balance of the intent category. The present invention further provides a computer-readable medium for performing a dialogue data processing method.

Description

對話資料處理系統、其方法及電腦可讀媒介 Conversational data processing system, method therefor, and computer-readable medium

本發明係關於資料處理之技術,尤指一種對話資料處理系統、其方法及電腦可讀媒介。 The present invention relates to data processing technology, and more particularly, to a dialog data processing system, method and computer-readable medium.

以往對用戶之對話資料進行標記之系統,僅能在接收到大量的用戶對話紀錄之資料後,針對用戶回饋是否滿意進行資料意圖類別分類標籤之標注,例如於某一筆對話資料中,用戶回饋結果為滿意,則該筆對話資料無需更改意圖類別標籤,反之,當對話資料中用戶回饋為不滿意時,人工智慧訓練師(AI訓練師)則需將該筆對話資料修改為其他合適的意圖類別,以進行標記,或為該筆對話資料新增一個新的意圖類別,將該筆對話資料標記新的意圖類別標籤。 In the past, the system for tagging the user's dialogue data could only mark the data intent category label according to whether the user's feedback is satisfied or not after receiving a large amount of user's dialogue records. For example, in a certain dialogue data, the user's feedback result In order to be satisfied, the dialog data does not need to change the intent category label. On the contrary, when the user feedback in the dialog data is unsatisfactory, the artificial intelligence trainer (AI trainer) needs to modify the dialog data to another appropriate intent category , to tag it, or add a new intent category to the conversation data to mark the conversation data with a new intent class label.

惟,大部分用戶不會主動回饋是否滿意之訊息,因此,實際上所能收集到的對話資料量相對稀少。此外,用戶通常會將其滿意與否之訊息表露於對話之語句中,因而可透過擷取用戶之對話紀錄,針對對話紀錄中的語句進行分析,藉以獲知用戶之滿意度,然而,由於用戶之對話紀錄之資料往往相常龐雜,若直接針對所有對話資料無差別地回饋給AI訓練師進行標記,則所需標 記的資料量又過於龐大,恐有執行上困難。另外,AI訓練師往往不會針對標記之分類群組中的資料量之平衡性進行檢查,亦即於分類群組中可能存在資料量差異甚大的情況,此亦會影響後續的資料訓練結果。 However, most users do not actively report whether they are satisfied or not, so the actual amount of conversation data that can be collected is relatively sparse. In addition, users usually reveal their satisfaction or dissatisfaction in the sentences of the conversation, so the user's satisfaction can be known by retrieving the user's conversation record and analyzing the sentences in the conversation record. However, due to the user's The data of the dialogue records are often complex. If all the dialogue data are directly fed back to the AI trainer for marking, the required markings are required. The amount of data recorded is too large, and it may be difficult to implement. In addition, AI trainers often do not check the balance of the amount of data in the labeled taxonomic groups, that is, there may be large differences in the amount of data among the taxonomic groups, which will also affect the subsequent data training results.

綜上,若能找出一種資料處理之技術,能針對對話紀錄進行有效分類及標記,且能檢查各分類群組之資料量之平衡性,將有利於後續之模型訓練,此將成為本技術領域人員急欲追求解決方案之目標。 To sum up, if we can find a data processing technology that can effectively classify and mark dialogue records, and can check the balance of the amount of data in each classification group, it will be beneficial to the subsequent model training, which will become the technology of the present invention. People in the field are eager to pursue the goal of a solution.

有鑑於上述問題,本發明提出一種對話資料處理系統,其包括:對話資料擷取模組,係用於蒐集具有語句之對話紀錄,以依據過濾指標過濾該對話紀錄,獲得意圖語句;以及對話資料分群標記模組,係用於從多個意圖類別內具有標籤的原有資料中各自選出代表語句,再將該意圖語句分別與各該代表語句進行比對,以由該意圖語句中取得與各該代表語句間的平均語意距離最大者,俾於該平均語意距離最大者小於預定門檻值時,以各該代表語句為中心將該意圖語句與隱藏該標籤之該原有資料進行分群與標記,使同一群組內之該意圖語句及該原有資料具有相同標籤,或是於該平均語意距離最大者超過該預定門檻值時,令該平均語意距離最大者之意圖語句為新的意圖類別之新代表語句,且以所有代表語句與新代表語句為中心對該意圖語句與隱藏該標籤之該原有資料進行分群與標記,使同一群組內之該意圖語句及該原有資料具有相同標籤。 In view of the above problems, the present invention proposes a dialogue data processing system, which includes: a dialogue data acquisition module, which is used for collecting dialogue records with sentences, so as to filter the dialogue records according to the filtering index to obtain intent sentences; and dialogue data The grouping labeling module is used to select representative sentences from original data with labels in a plurality of intention categories, and then compare the intention sentences with the representative sentences, so as to obtain and compare the intention sentences with each representative sentence. If the average semantic distance between the representative sentences is the largest, so that when the average semantic distance is smaller than a predetermined threshold value, the intention sentence and the original data hiding the label are grouped and marked with each representative sentence as the center, Make the intent sentence and the original data in the same group have the same label, or when the largest average semantic distance exceeds the predetermined threshold, the intent sentence with the largest average semantic distance is a new intent category. A new representative sentence, and the intention sentence and the original data with the label hidden are grouped and labeled with all representative sentences and new representative sentences as the center, so that the intention sentence and the original data in the same group have the same label .

於一實施例中,該對話資料分群標記模組對該多個意圖類別中之各者的多個語句所對應之多個語意向量取群心,以自該多個語意向量中取得與該群心距離最小者,作為該代表語句。 In one embodiment, the dialog data grouping and marking module takes a cluster center of a plurality of semantic vectors corresponding to a plurality of sentences of each of the plurality of intention categories, so as to obtain and The one with the smallest distance between the group centers is used as the representative sentence.

於另一實施例中,該平均語意距離之計算係先計算該意圖語句與各該代表語句之間的距離,再將各該距離取平均值。 In another embodiment, the average semantic distance is calculated by first calculating the distance between the intent sentence and each of the representative sentences, and then averaging the distances.

於另一實施例中,該對話資料處理系統復包括用於資料擴增之對話資料增強模組,係於各該意圖類別之樣本數不足或是比較所有該意圖類別之間的樣本數差異後,對樣本數少者進行資料增強。 In another embodiment, the dialogue data processing system further includes a dialogue data enhancement module for data augmentation, after the number of samples of each of the intent categories is insufficient or the difference in the number of samples among all the intent categories is compared. , and data enhancement is performed for the small sample size.

於另一實施例中,該資料增強係使用同義詞替換、隨機插入、隨機交換、隨機刪除、基於機器學習與深度學習之資料增強方法或其任意組合。 In another embodiment, the data augmentation uses synonym replacement, random insertion, random exchange, random deletion, data augmentation methods based on machine learning and deep learning, or any combination thereof.

於另一實施例中,該過濾指標係包括滿意度回饋值、語句情緒正負向極性、對話文字意圖信心度或是否有轉接請求。 In another embodiment, the filter index includes a satisfaction feedback value, positive and negative polarity of sentence sentiment, confidence in the intention of the dialogue text, or whether there is a transfer request.

於另一實施例中,該對話資料擷取模組復包括情緒識別單元,係利用情緒識別模型識別該意圖語句之情緒極性,以產生該語句情緒正負向極性。 In another embodiment, the dialogue data capturing module further includes an emotion recognition unit, which uses an emotion recognition model to recognize the emotion polarity of the intent sentence, so as to generate the positive and negative emotion polarity of the sentence.

於又一實施例中,該對話資料擷取模組復包括意圖識別單元,係利用意圖識別模型計算該意圖語句之意圖識別信心度,以產生該對話文字意圖信心度。 In another embodiment, the dialogue data capturing module further includes an intention recognition unit, which uses an intention recognition model to calculate the intention recognition confidence level of the intention sentence to generate the dialogue text intention confidence level.

本發明復提出一種對話資料處理方法,係包括:蒐集具有語句之對話紀錄;依據過濾指標過濾該對話紀錄以獲得意圖語句;從多個意圖類別內具有標籤的原有資料中各自選出代表語句;以及將該意圖語句分別與各該代表語句進行比對,以由該意圖語句中取得與各該代表語間的平均語意距離最大 者,俾於該平均語意距離最大者小於預定門檻值時,以各該代表語句為中心將該意圖語句與隱藏該標籤之該原有資料進行分群與標記,使同一群組內之該意圖語句及該原有資料具有相同標籤,或是於該平均語意距離最大者超過該預定門檻值時,令該平均語意距離最大者之意圖語句為新的意圖類別之新代表語句,且以所有代表語句及新代表語句為中心對該意圖語句與隱藏該標籤之該原有資料進行分群與標記,使同一群組內之該意圖語句及該原有資料具有相同標籤。 The present invention further proposes a dialogue data processing method, which includes: collecting dialogue records with sentences; filtering the dialogue records according to filtering indicators to obtain intent sentences; selecting representative sentences from original data with tags in multiple intent categories; and compare the intent sentence with each of the representative sentences respectively, so as to obtain the largest average semantic distance between the intent sentence and each of the representative sentences When the maximum average semantic distance is less than a predetermined threshold value, the intention sentence and the original data hiding the tag are grouped and marked with each representative sentence as the center, so that the intention sentence in the same group is grouped and marked. and the original data have the same label, or when the largest average semantic distance exceeds the predetermined threshold, the intent sentence with the largest average semantic distance is a new representative sentence of the new intent category, and all representative sentences are and the new representative sentence as the center to group and mark the intent sentence and the original data with the label hidden, so that the intent sentence and the original data in the same group have the same label.

於另一實施例中,該選出代表語句之步驟係對該多個意圖類別中之各者的多個語句所對應之多個語意向量取群心,以自該多個語意向量中取得與該群心距離最小者,作為該代表語句。 In another embodiment, the step of selecting a representative sentence is to obtain a cluster center of a plurality of semantic vectors corresponding to a plurality of sentences of each of the plurality of intent categories to obtain from the plurality of semantic vectors The one with the smallest distance from the group center is used as the representative sentence.

於另一實施例中,該平均語意距離之計算係先計算該意圖語句與各該代表語句之間的距離,再將各該距離取平均值。 In another embodiment, the average semantic distance is calculated by first calculating the distance between the intent sentence and each of the representative sentences, and then averaging the distances.

於另一實施例中,該依據過濾指標過濾該對話紀錄以獲得意圖語句之步驟復包括於各該意圖類別之樣本數不足或是比較所有該意圖類別之間的樣本數差異後,對樣本數少的意圖類別進行資料增強。 In another embodiment, the step of filtering the dialogue record to obtain the intent sentence according to the filtering index further includes that after the number of samples of each of the intent categories is insufficient or the difference of the number of samples among all the intent categories is compared, the number of samples is compared. Data augmentation with fewer intent categories.

於另一實施例中,該資料增強係使用同義詞替換、隨機插入、隨機交換、隨機刪除、基於機器學習與深度學習之資料增強方法或其任意組合。 In another embodiment, the data augmentation uses synonym replacement, random insertion, random exchange, random deletion, data augmentation methods based on machine learning and deep learning, or any combination thereof.

於另一實施例中,該過濾指標係包括滿意度回饋值、語句情緒正負向極性、對話文字意圖信心度或是否有轉接請求。 In another embodiment, the filter index includes a satisfaction feedback value, positive and negative polarity of sentence sentiment, confidence in the intention of the dialogue text, or whether there is a transfer request.

於另一實施例中,該語句情緒正負向極性係利用情緒識別模型識別該意圖語句之情緒極性所產生者。 In another embodiment, the positive and negative polarity of the sentence emotion is generated by using an emotion recognition model to identify the emotional polarity of the intent sentence.

於又一實施例中,該對話文字意圖信心度係利用意圖識別模型計算該意圖語句之意圖識別信心度所產生者。 In yet another embodiment, the dialog text intention confidence level is generated by calculating the intention recognition confidence level of the intention sentence by using an intention recognition model.

本發明復提供一種電腦可讀媒介,應用於計算裝置或電腦中,係儲存有指令,以執行上述之對話資料處理方法。 The present invention further provides a computer-readable medium, which is applied to a computing device or a computer and stores an instruction to execute the above-mentioned method for processing dialogue data.

綜上,本發明之對話資料處理系統、其方法及電腦可讀媒介,係於蒐集用戶之對話紀錄後,能依據設定之過濾指標自對話紀錄中過濾出可進行標記之意圖語句,並自原有意圖類別中挑選出代表語句,進行意圖語句與代表語句之語意分析,藉此決定意圖語句標記為與代表語句相同之意圖類別,或是應另標記為新的意圖類別,因此,本發明能達到自動分類用戶之對話紀錄並進行標記之目的。另外,本發明於意圖語句完成分類後,亦能進行各意圖類別中之樣本數以及所有意圖類別中之樣本數差異的檢查,以對樣本數不足之意圖類別進行增強,藉以達到避免意圖類別之資料量差異所致之不平衡問題之目的。 To sum up, the dialogue data processing system, method and computer-readable medium of the present invention are capable of filtering out the intent sentences that can be marked from the dialogue records according to the set filtering index after collecting the user's dialogue records, The representative sentence is selected from the intent category, and semantic analysis of the intent sentence and the representative sentence is performed to determine whether the intent sentence should be marked as the same intent category as the representative sentence, or should be marked as a new intent category. Therefore, the present invention can To achieve the purpose of automatically classifying users' conversation records and marking them. In addition, after the intent sentence is classified, the present invention can also check the number of samples in each intent category and the difference in the number of samples in all intent categories, so as to enhance the intent categories with insufficient samples, so as to avoid the occurrence of intent categories. The purpose of the imbalance problem caused by the difference in the amount of data.

10、10’:對話資料處理系統 10, 10': Dialogue data processing system

11:對話資料擷取模組 11: Dialogue data capture module

111:情緒識別單元 111: Emotion Recognition Unit

112:意圖識別單元 112: Intent recognition unit

12:對話資料分群標記模組 12: Dialogue data group marking module

13:對話資料增強模組 13: Dialogue data enhancement module

S401~S405:步驟 S401~S405: Steps

601~608:流程 601~608: Process

圖1係本發明之對話資料處理系統之示意架構圖。 FIG. 1 is a schematic structural diagram of the dialogue data processing system of the present invention.

圖2係本發明之對話資料擷取模組之結構示意圖。 FIG. 2 is a schematic diagram of the structure of the dialogue data capturing module of the present invention.

圖3係本發明之對話資料處理系統之另一實施例之示意架構圖。 FIG. 3 is a schematic structural diagram of another embodiment of the dialogue data processing system of the present invention.

圖4係本發明之對話資料處理方法之步驟流程圖。 FIG. 4 is a flow chart of the steps of the dialogue data processing method of the present invention.

圖5係本發明之對話資料處理方法之其他實施例之步驟流程圖。 FIG. 5 is a flow chart of steps of another embodiment of the dialogue data processing method of the present invention.

圖6係本發明對話資料處理方法之流程圖。 FIG. 6 is a flow chart of the method for processing dialogue data according to the present invention.

以下藉由特定的具體實施形態說明本發明之技術內容,熟悉此技藝之人士可由本說明書所揭示之內容輕易地瞭解本發明之優點與功效。然本發明亦可藉由其他不同的具體實施形態加以施行或應用。 The following describes the technical content of the present invention through specific embodiments, and those skilled in the art can easily understand the advantages and effects of the present invention from the content disclosed in this specification. However, the present invention can also be implemented or applied by other different specific embodiments.

圖1為本發明之對話資料處理系統之示意架構圖。如圖所示,本發明之對話資料處理系統10係包括對話資料擷取模組11以及對話資料分群標記模組12,藉由對話資料擷取模組11取得用戶之對話紀錄後進行過濾,再透過對話資料分群標記模組12進行分析計算,以於判斷經過濾後之對話紀錄中的語句之類別後並標記之。關於本發明之說明如下。 FIG. 1 is a schematic structural diagram of the dialogue data processing system of the present invention. As shown in the figure, the dialogue data processing system 10 of the present invention includes a dialogue data capture module 11 and a dialogue data grouping and marking module 12. The dialogue data capture module 11 obtains the user's dialogue records and then filters them. The analysis and calculation are performed through the dialogue data grouping and marking module 12, so as to mark the sentences in the filtered dialogue records after judging the type. The description of the present invention is as follows.

對話資料擷取模組11用以接收或蒐集具有語句之對話紀錄,其中,對話紀錄係指對話資料擷取模組11擷取用戶於系統上所陳述(例如說明、敍述或詢問)之語句所形成之對話文字資料。對話紀錄可包括用戶所陳述之單一語句或用戶於一個事件中或一段時間內所敍述之多個語句。另外,本發明之對話資料處理系統10復可設置用以儲存資料之資料庫,對話資料擷取模組11於蒐集對話紀錄後,即可儲存於資料庫中。 The dialogue data capture module 11 is used for receiving or collecting dialogue records with sentences, wherein the dialogue records refer to the sentences in which the dialogue data capture module 11 captures the sentences (such as explanation, narration or inquiry) stated by the user on the system. Formed dialogue text data. Conversation records may include a single statement stated by the user or multiple statements narrated by the user during an event or over a period of time. In addition, the dialogue data processing system 10 of the present invention can further be provided with a database for storing data, and the dialogue data capturing module 11 can store the dialogue records in the database after collecting the dialogue records.

於蒐集用戶之對話紀錄後,對話資料擷取模組11能依據預先設定之過濾指標過濾用戶之對話紀錄,據以獲得意圖語句。具體而言,過濾指標係包括滿意度回饋值、語句情緒正負向極性、對話文字意圖信心度或是否有轉接請求或,亦即,對話資料擷取模組11依據用戶滿意度回饋值、用戶對話文字情緒正負向極性、系統預測對話文字意圖之模型信心度、用戶是否轉接專人等指標來過濾篩選對話紀錄中適合標記的語句,以形成意圖語句。 After collecting the user's dialog records, the dialog data capturing module 11 can filter the user's dialog records according to a preset filter index, so as to obtain the intent sentence. Specifically, the filtering indicators include satisfaction feedback value, positive and negative polarity of sentence emotion, confidence level of dialogue text intention, or whether there is a transfer request or not. The positive and negative polarity of the dialogue text sentiment, the model confidence of the system's prediction of the dialogue text intention, whether the user transfers to a special person, etc. are used to filter and filter the sentences suitable for marking in the dialogue record to form the intention sentence.

對話資料分群標記模組12係用從多個意圖類別內具有標籤的原有資料中分別挑選出代表語句,具體而言,每一個意圖類別中可包括多個語句, 且各語句具有對應之語意向量,對話資料分群標記模組12挑選代表語句之方式係對各意圖類別中所有的語意向量取平均,以得到對應之群心,再計算各語意向量與群心之間的距離,藉以挑選距離最小者之語意向量所對應之語句,作為代表語句,該對話資料分群標記模組12係依據所挑選之代表語句計算對話資料擷取模組11所獲得之意圖語句與各代表語句的距離,以由該意圖語句中取得與各該代表語句間的平均語意距離最大者,其中,對話資料分群標記模組12先計算意圖語句與各代表語句之間的距離,再對各距離取平均,以獲得平均語意距離。 The dialog data grouping and labeling module 12 selects representative sentences from original data with labels in multiple intent categories. Specifically, each intent category may include multiple sentences. And each sentence has a corresponding semantic vector, and the method of selecting the representative sentence by the dialogue data grouping and marking module 12 is to average all the semantic vectors in each intention category to obtain the corresponding group center, and then calculate each semantic vector and The distance between the group hearts is used to select the sentence corresponding to the semantic vector of the smallest distance as the representative sentence. The dialogue data grouping and marking module 12 calculates the dialogue data acquisition module 11 according to the selected representative sentence. The distance between the intended sentence and each representative sentence is the one with the largest average semantic distance between the intended sentence and each representative sentence. The dialogue data grouping and marking module 12 first calculates the distance between the intended sentence and each representative sentence. distance, and then average the distances to obtain the average semantic distance.

再者,本發明之對話資料處理系統10能預先設定平均語意距離之門檻值,據此,於得到平均語意距離最大者後,透過比對該平均語意距離與預定門檻值,以判斷意圖語句與各代表語句之間之語意相似度,進而對意圖語句進行分類。亦即,該意圖語句中平均語意距離最大者小於或等於預定門檻值時,以各該代表語句為中心將該意圖語句與隱藏該標籤之該原有資料進行分群與標記,亦即,將該意圖語句分類於與之語意相近的代表語句所標記之意圖類別內並進行標記;或於該意圖語句中平均語意距離最大者超過或等於預定門檻值時,令該平均語意距離最大者之意圖語句為新的意圖類別之代表語句,且以所有代表語句為中心對該意圖語句與隱藏該標籤之該原有資料進行分群與標記,使同一群組內之該意圖語句及該原有資料具有相同標籤,亦即將該意圖語句分類為新的意圖類別後進行標記,最後,經標記後之該意圖語句與該原有資料成為模型訓練資料,以供意圖識別模型訓練使用。據此,本發明可於蒐集大量的用戶之對話紀錄後,進行過濾篩選,以得到適合進行標記之對話紀錄,如此能達到減少須進行標記之對話紀錄的數量,進而針對篩選後之對話紀錄進行 標記,達到自動標記對話紀錄之目的,據之能避免習知AI訓練師於面對須進行標記之大量對話紀錄之資料時,所可能造成標記資料品質不佳之問題。 Furthermore, the dialogue data processing system 10 of the present invention can preset the threshold value of the average semantic distance. According to this, after obtaining the largest average semantic distance, the average semantic distance is compared with the predetermined threshold to determine the difference between the intended sentence and the predetermined threshold. The semantic similarity between the representative sentences is used to classify the intent sentences. That is, when the largest average semantic distance in the intent sentence is less than or equal to a predetermined threshold value, the intent sentence and the original data hiding the label are grouped and marked with each representative sentence as the center, that is, the The intent sentence is classified into the intent category marked by the representative sentence with similar semantics and marked; or when the intent sentence with the largest average semantic distance exceeds or equals to a predetermined threshold, the intent sentence with the largest average semantic distance It is the representative sentence of the new intention category, and the intention sentence and the original data with the label hidden are grouped and marked with all representative sentences as the center, so that the intention sentence and the original data in the same group have the same Labeling means that the intent sentence is classified into a new intent category and then marked, and finally, the marked intent sentence and the original data become model training data for training the intent recognition model. Accordingly, the present invention can filter and filter after collecting a large number of conversation records of users, so as to obtain conversation records suitable for marking, so as to reduce the number of conversation records that need to be marked, and then perform filtering on the filtered conversation records. Marking, to achieve the purpose of automatically marking the dialogue records, according to this, it can avoid the problem of poor quality of marked data that may be caused by conventional AI trainers when faced with a large number of dialogue records that need to be marked.

圖2為本發明之對話資料擷取模組之結構示意圖。如圖所示,對話資料擷取模組11包括情緒識別單元111及意圖識別單元112,其中,情緒識別單元111係利用情緒識別模型識別意圖語句之情緒極性,以產生語句情緒正負向極性,例如情緒識別單元111能利用深度學習之長短期記憶網路訓練情緒識別模型,以識別語句之語句情緒正負向極性,另外,意圖識別單元112係利用意圖識別模型計算意圖語句之意圖識別信心度,以產生對話文字意圖信心度,例如意圖識別單元112能利用關鍵詞比對分數、機器學習或深度學習之意圖模型分類機率值,以計算及預測語句之意圖信心度。 FIG. 2 is a schematic structural diagram of the dialogue data capturing module of the present invention. As shown in the figure, the dialogue data capturing module 11 includes an emotion recognition unit 111 and an intention recognition unit 112, wherein the emotion recognition unit 111 uses the emotion recognition model to recognize the emotion polarity of the intention sentence, so as to generate the positive and negative polarity of the sentence emotion, for example The emotion recognition unit 111 can use the deep learning long short-term memory network to train the emotion recognition model to recognize the positive and negative polarity of the sentence emotion of the sentence. In addition, the intention recognition unit 112 uses the intention recognition model to calculate the intention recognition confidence of the intention sentence, so as to To generate the confidence level of the intention of the dialogue text, for example, the intention recognition unit 112 can use the keyword comparison score, the intention model classification probability value of machine learning or deep learning to calculate and predict the intention confidence level of the sentence.

圖3為本發明之對話資料處理系統的另一實施例之示意架構圖。如圖所示,本實施例之對話資料處理系統10’與第一實施例之對話資料處理系統10大致相同,不同之處在於本實施例中,對話資料處理系統10’復包括對話資料增強模組13,其中,對話資料擷取模組11以及對話資料分群標記模組12同於上述而不贅述。 FIG. 3 is a schematic structural diagram of another embodiment of the dialogue data processing system of the present invention. As shown in the figure, the dialogue data processing system 10' of this embodiment is substantially the same as the dialogue data processing system 10 of the first embodiment, the difference is that in this embodiment, the dialogue data processing system 10' further includes a dialogue data enhancement module Group 13, wherein the dialogue data capturing module 11 and the dialogue data group marking module 12 are the same as those described above and will not be described in detail.

對話資料增強模組13係於各意圖類別之樣本數不足或是比較所有意圖類別之間的樣本數差異後,對樣本數少者進行資料增強,亦即除了檢視單一意圖類別內樣本數數量是否足夠外,還比較各原意圖類別或是原意圖類別與新的意圖類別之間,各自的語句之樣本數差異,以針對樣本數不足之意圖類別進行資料增強。另外,對話資料增強模組13亦可透過設定樣本數門檻值,以於樣本數差異超過樣本數門檻值時,對具有較少語句資料之意圖類別進行資料增強,其中,對話資料增強模組13可利用同義詞替換、隨機插入、隨機交換、隨 機刪除、基於機器學習與深度學習之資料增強方法或是前述方式的組合,以進行語句資料增強,俾以達到平衡各群組之樣本數之目的,而達到提供較佳之訓練資料之目的。 The dialogue data enhancement module 13 is to perform data enhancement for the small number of samples after the number of samples in each intent category is insufficient or after comparing the difference in the number of samples among all intent categories, that is, in addition to checking whether the number of samples in a single intent category is In addition to being sufficient, the difference in the number of samples of each sentence between each original intent category or between the original intent category and the new intent category is also compared, so as to perform data enhancement for the intent category with insufficient samples. In addition, the dialogue data enhancement module 13 can also set a threshold value of the number of samples, so that when the difference in the number of samples exceeds the threshold value of the number of samples, data enhancement is performed on the intent category with less sentence data, wherein the dialogue data enhancement module 13 Can use synonym replacement, random insertion, random exchange, random Machine deletion, data enhancement methods based on machine learning and deep learning, or a combination of the aforementioned methods, are used to enhance sentence data, so as to achieve the purpose of balancing the number of samples in each group and to provide better training data.

換言之,本發明於完成語句之標記後,對話資料增強模組13依據各意圖類別內含之訓練樣本數多寡進行資料擴增,以維持各意圖類別之資料樣本數量比例差異不大於門檻值。如各類別資料樣本數量比例差異大於門檻值,則將較少樣本之意圖類別中的語句資料進行擴增,以維持一適當之比例。藉此,本發明能維持各意圖類別樣本數量之平衡,且自動擴增樣本數量,使AI訓練師只需進行有效且少量的資料標記,即可達到完成大量且有品質的標注資料之功效,使得最終之模型訓練資料能提供更佳訓練效果。 In other words, after the sentence is marked in the present invention, the dialogue data enhancement module 13 performs data augmentation according to the number of training samples contained in each intent category, so as to maintain that the difference in the proportion of the data samples of each intent category is not greater than the threshold value. If the difference in the proportion of the number of data samples of each category is greater than the threshold value, the sentence data in the intent category with fewer samples will be expanded to maintain an appropriate proportion. In this way, the present invention can maintain the balance of the number of samples of each intent category, and automatically expand the number of samples, so that the AI trainer only needs to label effectively and a small amount of data to achieve the effect of completing a large amount of high-quality labeling data, This enables the final model training data to provide better training effects.

圖4為本發明之對話資料處理方法之步驟流程圖。 FIG. 4 is a flow chart of the steps of the dialog data processing method of the present invention.

於步驟S401中,蒐集具有語句之對話紀錄。本發明可透過設置用戶使用介面供用戶與系統進行對話,以蒐集用戶之對話紀錄,其中,對話紀錄包括用戶所陳述之語句。具體而言,本發明蒐集用戶與系統對話之對話紀錄係於用戶之線上對話中進行用戶之語句的蒐集,其中,可設定於一段時間區間內(例如每一或幾小時、每日、每星期、每月)進行用戶語句之蒐集,以累積時間區間內用戶所有之語句,而形成對話紀錄。 In step S401, a conversation record with sentences is collected. In the present invention, a user interface can be set for the user to communicate with the system, so as to collect the user's dialog record, wherein the dialog record includes the sentences stated by the user. Specifically, the present invention collects the dialogue records of the dialogue between the user and the system by collecting the user's sentences in the online dialogue of the user, which can be set within a period of time (for example, every or several hours, every day, every week , monthly) to collect user sentences, to accumulate all sentences of users within the time interval, and form conversation records.

於步驟S402中,依據過濾指標過濾該對話紀錄以獲得意圖語句。其中,本發明利用語句情緒正負向極性、對話文字意圖信心度、滿意度回饋值、用戶是否提出轉接請求等過濾指標或前述過多個濾指標所組合之組合指標,對所蒐集之對話紀錄進行過濾,以自對話紀錄之所有用戶語句中篩選出可用以進行標記之意圖語句。 In step S402, the dialog record is filtered according to the filter index to obtain the intent sentence. Among them, the present invention uses filtering indicators such as the positive and negative polarity of the sentence emotion, the confidence level of the dialogue text intention, the satisfaction feedback value, whether the user makes a transfer request, or a combination indicator of the above-mentioned filtering indicators to perform the collected dialogue records. Filter to filter out intent sentences that can be used for marking from all user sentences in the conversation record.

過濾指標中之語句情緒正負向極性係利用深度學習之長短期記憶網路訓練情緒識別模型,以識別所蒐集之語句的情緒極性。詳言之,本發明透過包括制定情緒詞分數並比對情緒詞、機器學習或深度學習之情緒模型等深度學習之長短期記憶網路訓練情緒識別模型,以計算所蒐集的所有語句之情緒正負向極性,進而識別對話紀錄中之語句的情緒極性。 The positive and negative polarity of sentence emotion in the filter index is to use the deep learning long short-term memory network to train an emotion recognition model to identify the emotional polarity of the collected sentences. In detail, the present invention trains an emotion recognition model through a deep learning long short-term memory network including formulating emotional word scores and comparing emotional words, machine learning or deep learning emotional models, etc., to calculate the positive and negative emotions of all the sentences collected. Polarity, and then identify the emotional polarity of the sentences in the dialogue records.

過濾指標中之對話文字意圖信心度係利用關鍵詞比對分數、機器學習或深度學習之意圖模型分類機率值,以計算該意圖語句之意圖信心度。亦即,本發明透過使用制定關鍵詞比對分數、機器學習或深度學習之意圖模型分類機率值等方法或其結合,以計算用戶語句之意圖識別信心度,據此,可識別所蒐集之所有語句之意圖識別信心度。 The dialog text intent confidence in the filtering index is calculated by using the keyword comparison score, the intent model classification probability value of machine learning or deep learning to calculate the intent confidence of the intent sentence. That is, the present invention calculates the confidence level of intent recognition of user sentences by using methods such as formulating keyword comparison scores, machine learning or deep learning intent model classification probability values, or a combination thereof. Intent recognition confidence of the sentence.

於步驟S403中,從多個意圖類別內具有標籤的原有資料中各自選出代表語句。其中,挑選代表語句係對每個原意圖類別中之多個語句所對應之多個語意向量取平均,以得到語意向量之群心,進而自多個語意向量中挑選與該群心距離最小者,作為代表語句。 In step S403, representative sentences are selected from original data with tags in a plurality of intent categories. Among them, the selection of the representative sentence is to average multiple semantic vectors corresponding to multiple sentences in each original intent category to obtain the cluster center of the semantic vector, and then select the cluster center from the multiple semantic vectors. The one with the smallest distance is used as the representative sentence.

於步驟S404中,將該意圖語句分別與各該代表語句進行比對,以由該意圖語句中取得與各該代表語句間的平均語意距離最大者,俾於該平均語意距離最大者小於預定門檻值時,以各該代表語句為中心將該意圖語句與隱藏該標籤之該原有資料進行分群與標記,使同一群組內之該意圖語句及該原有資料具有相同標籤,或是於該平均語意距離最大者超過該預定門檻值時,令該平均語意距離最大者之意圖語句為新的意圖類別之代表語句,且以所有代表語句為中心對該意圖語句與隱藏該標籤之該原有資料進行分群與標記,使同一群組內之該意圖語句及該原有資料具有相同標籤。本步驟係計算意圖語句與各代表 語句之間的平均語意距離,並找出意圖語句中與各代表語句間的平均語意距離最大者,其中,平均語意距離之計算係先計算意圖語句與各代表語句之間的距離,再對各該距離取平均,簡言之,於平均語意距離最大者其距離小於預定門檻值時,將意圖語句分類於各代表語句中與意圖語句語意相近者所對應之原有意圖類別,並進行標記,另外,若於平均語意距離最大者其距離超過預定門檻值時,則將該意圖語句分類為新的意圖類別且執行後續標記。再者,經標記後之該意圖語句與該原有資料則供意圖識別模型訓練使用。在一實施例中,新的意圖類別及其代表語句於後續再次進行對話資料處理,至步驟S403時,新的意圖類別即可成為多個意圖類別之一者,且依其步驟選出代表語句。 In step S404, the intent sentence is compared with each of the representative sentences, so as to obtain the one with the largest average semantic distance from the intended sentence and each of the representative sentences, so that the one with the largest average semantic distance is smaller than a predetermined threshold When the value is set, the intent statement and the original data that hide the label are grouped and labeled with each representative statement as the center, so that the intent statement and the original data in the same group have the same label, or the original data in the same group have the same label. When the one with the largest average semantic distance exceeds the predetermined threshold value, the intent sentence with the largest average semantic distance is set as the representative sentence of the new intent category, and the intent sentence and the original one that hides the label are centered on all representative sentences. The data is grouped and labeled so that the intent statement and the original data in the same group have the same label. This step is to calculate the intent statement and each representative The average semantic distance between sentences, and find out the one with the largest average semantic distance between the intent sentence and each representative sentence. The distance is averaged. In short, when the distance of the largest average semantic distance is less than the predetermined threshold, the intent sentence is classified into the original intent category corresponding to the semantically similar intent sentence in each representative sentence, and marked. In addition, if the distance of the one with the largest average semantic distance exceeds a predetermined threshold value, the intent sentence is classified into a new intent category and subsequent marking is performed. Furthermore, the marked intent sentence and the original data are used for training the intent recognition model. In one embodiment, the new intent category and its representative sentence are subsequently processed again for dialogue data, and in step S403, the new intent category can become one of multiple intent categories, and the representative sentence is selected according to its steps.

舉例而言,本發明區分意圖語句所屬之意圖類別時,首先於先前已標記過之原有標記資料A中之各意圖類別內分別選出一句代表語句R,若所有意圖類別共有N類,則將選出N句代表各自意圖類別之代表語句R1~N。接著,將所挑選出之未經標記的新資料B中之意圖語句分別與代表語句R1~N進行語意比對,以計算其平均語意距離,且於平均語意距離超出語意距離之門檻值時,為該意圖語句另增一新意圖類別,另外,若篩選出多個意圖語句時,則於計算各意圖語句分別對應代表語句之平均語意距離後,自各平均語意距離中找出平均語意距離最大的一句意圖語句O,如果O之平均語意距離超過預設之門檻值,則挑選O為另一個新意圖的代表語句RN+1。接著,隱藏原有標記資料A之資料標記,且加入新資料B之意圖語句,以代表語句R1~N+1為中心進行非監督式或半監督式分群計算,將包括原有標記資料A及新資料B之所有語句資料區分為N+1群,其中,前N群為原有之意圖類別的數量,第N+1群為可能新增之意圖類別,再將語意相近的語句區分至相同之意圖類別。最後,顯示原有標記資料A中所 有之原始資料標籤,再針對新資料B中所有未標記過的意圖語句,透過利用同其群組內的相近且已知資料標籤進行自動化標記,以形成N或N+1組意圖類別,據以達到自動標記對話紀錄中之對話語句之目的。 For example, when the present invention distinguishes the intent category to which the intent sentence belongs, first, a representative sentence R is selected from each intent category in the original marked data A that has been marked before, and if there are N categories in all intent categories, the N sentences are selected to represent the representative sentences R 1~N of their respective intention categories. Next, semantically compare the selected intent sentences in the unmarked new data B with the representative sentences R 1~N to calculate the average semantic distance, and when the average semantic distance exceeds the threshold value of the semantic distance , to add a new intent category to the intent sentence. In addition, if multiple intent sentences are filtered out, after calculating the average semantic distance of each intent sentence corresponding to the representative sentence, find the largest average semantic distance from the average semantic distances. If the average semantic distance of O exceeds the preset threshold value, select O as another representative sentence R N+1 of the new intent. Next, hide the data mark of the original marked data A, and add the intent sentence of the new data B, and perform unsupervised or semi-supervised grouping calculation centered on the representative sentences R 1~N+1 , including the original marked data A And all sentence data of new data B are divided into N+1 groups, among which, the first N groups are the number of original intent categories, the N+1th group is the possible new intent categories, and then the sentences with similar semantics are divided into The same intent category. Finally, display all the original data labels in the original marked data A, and then automatically mark all the unmarked intent sentences in the new data B by using the similar and known data labels in the same group to form N Or N+1 groups of intent categories, to achieve the purpose of automatically marking the dialogue sentences in the dialogue record.

圖5為本發明之對話資料處理方法的其他實施例之步驟流程圖。如圖所示,本實施例S401~S404與前一實施例之步驟相同,其不同之處在於,本實施例復包括步驟S405。於步驟S405中,於各該意圖類別之樣本數不足或是比較所有該意圖類別之間的樣本數差異後,對樣本數少的意圖類別進行資料增強。本步驟比較所有意圖類別(包括原意圖類別或是再加入新的意圖類別)各自之樣本數間之差異,針對樣本數少的意圖類別進行資料增強。其中,資料增強之方法包括同義詞替換、隨機插入、隨機交換、隨機刪除、基於機器學習與深度學習之資料增強方法或上述任意組合。 FIG. 5 is a flow chart of steps of another embodiment of the dialogue data processing method of the present invention. As shown in the figure, steps S401 to S404 in this embodiment are the same as those in the previous embodiment, and the difference is that this embodiment further includes step S405. In step S405, after the number of samples of each of the intent categories is insufficient or the difference in the number of samples among all the intent categories is compared, data enhancement is performed on the intent categories with a small number of samples. This step compares the difference between the respective sample numbers of all intent categories (including the original intent category or adding a new intent category), and performs data enhancement for the intent category with a small number of samples. The methods of data enhancement include synonym replacement, random insertion, random exchange, random deletion, data enhancement methods based on machine learning and deep learning, or any combination of the above.

據此,本發明藉由針對各意圖類別中之語句資料進行擴增,其中,先檢視各意圖類別中之語句的樣本數是否足夠,再對各意圖類別之間樣本數之差異進行比較,針對樣本數少的意圖類別進行資料增強,以持續擴增至各意圖類別內樣本數達到平衡為止,亦即,使得樣本數差異小於預先設定之樣本數門檻值α。 Accordingly, the present invention augments the sentence data in each intent category, wherein it is first checked whether the number of samples of sentences in each intent category is sufficient, and then the difference in the number of samples between each intent category is compared, and the Data enhancement is performed on the intent categories with a small number of samples, so as to continue to expand until the number of samples in each intent category reaches a balance, that is, the difference in the number of samples is less than a preset threshold value α of the number of samples.

圖6為本發明對話資料處理方法之流程圖。如圖所示,一具體實施例之流程如下說明。 FIG. 6 is a flowchart of a method for processing dialogue data according to the present invention. As shown in the figure, the flow of a specific embodiment is described as follows.

於流程601中,首先,藉由本發明之對話資料處理系統中的對話資料擷取模組蒐集對話紀錄,對話資料擷取模組在用戶上線詢問相關資訊時,進行用戶語句之蒐集,並累積一段時間中之所有語句以形成對話紀錄,另外,亦可透過接收外部資料庫或其他對話資料擷取裝置所取得之用戶的對話資料。 具體而言,本發明可於前端設計輸入介面,例如web或通訊軟體等,用以供用戶能藉此輸入自然語言之語句,本發明之系統再將具有用戶語句之對話紀錄傳送至後端資料庫進行紀錄、儲存。 In the process 601, first, the dialogue data capture module in the dialogue data processing system of the present invention collects the dialogue records. The dialogue data capture module collects user sentences when the user asks for relevant information online, and accumulates a paragraph All sentences in the time can be used to form a conversation record. In addition, it can also receive the user's conversation data obtained from an external database or other conversation data retrieval devices. Specifically, the present invention can design an input interface on the front end, such as web or communication software, etc., for the user to input natural language sentences, and the system of the present invention transmits the conversation record with the user's sentence to the back-end data library for recording and storage.

舉例來說,例如用戶於前端之輸入介面輸入「我的帳單有問題」;系統回應:「有何問題」;用戶問:「錢算錯」;機器人回:「不好意思我不懂你的意思」;用戶問:『太爛了吧』。於上述對話中,本發明將T=[“我的帳單有問題”;“錢算錯”;“太爛了吧”]等用戶對系統所詢問之語句記錄下來。進而經分詞後,利用文字轉向量方式(例如使用word2vec模型)將各語句轉換成語句向量。以“我的帳單有問題”為例,分詞後可分為「“我”,“的”,“帳單”,“有”,“問題”」等5個詞語,再將每一個詞語輸入word2vec模型中,以取得相對應的語句向量,據此,即可將語句轉換成二維向量「“我”:[0.123,0.456],“的”:[0.233,0.536],“帳單”:[0.322,0.689],“有”:[0.111,0.422],“問題”:[0.777,0.543]」,以進行表示。 For example, for example, the user enters "I have a problem with my bill" in the front-end input interface; the system responds: "What's the problem?"; The user asks: "The money is wrongly calculated"; meaning"; the user asked: "It sucks". In the above dialogue, the present invention records the sentences inquired by the user to the system, such as T=["There is a problem with my bill"; "The money is miscalculated"; "It's too bad"]. Then, after word segmentation, each sentence is converted into sentence vector by using the method of text turning vector (for example, using the word2vec model). Take "I have a problem with my bill" as an example. After the word segmentation, it can be divided into 5 words such as "I", "of", "billing", "yes", "problem"", and then enter each word in In the word2vec model, the corresponding sentence vector can be obtained. According to this, the sentence can be converted into a two-dimensional vector ""I": [0.123, 0.456], "De": [0.233, 0.536], "Bill": [0.322, 0.689], "has": [0.111, 0.422], "problem": [0.777, 0.543]", to express.

於流程602中,情緒識別模組識別對話文字情緒極性,其中,情緒識別模組可藉深度學習之長短期記憶網路(Long Short-Term Memory,LSTM)訓練情緒識別模型,以識別對話文字情緒極性,將前述所蒐集之所有用戶對話語句透過情緒識別模型計算每句語句之情緒正負向極性,以產生介於0-1之間的情緒極性的機率分布,其中,機率值越接近1者,表示其情緒愈正向,反之,越接近0者,表示其情緒愈負向,藉由分析對話資料擷取模組所蒐集之語句T=[“我的帳單有問題”;“錢算錯”;“太爛了吧”],以得到情緒機率E=[0.4,0.5,0.1]。 In the process 602, the emotion recognition module recognizes the emotional polarity of the dialogue text, wherein the emotion recognition module can use the deep learning Long Short-Term Memory (LSTM) network to train the emotion recognition model to recognize the dialogue text emotion Polarity, calculate the positive and negative polarity of emotion of each sentence through the emotion recognition model of all the user dialogue sentences collected above, so as to generate a probability distribution of emotional polarity between 0 and 1, where the probability value is closer to 1, the Indicates that the more positive their emotions are, on the contrary, the closer to 0, the more negative their emotions are. By analyzing the sentences collected by the dialogue data acquisition module T=["There is a problem with my bill"; "The money is wrongly calculated. ";"It sucks"], to get the emotional probability E=[0.4,0.5,0.1].

於流程603中,意圖識別模組識別對話文字意圖信心度,意圖識別模組係採用監督式機器學習類神經網路(Neural Network)訓練意圖識別模型,以來識別對話文字意圖信心度,該意圖識別模組用以識別對話資料擷取模組所蒐集用戶所有之對話語句,透過意圖識別模型計算每句語句之意圖識別信心度,以對應各意圖類別產生介於0-1之間的信心度機率分布,且所有意圖類別之機率和為1。具體地,本發明可利用softmax函數以進行分析運算,例如假設標記資料具有三個意圖類別,即[“帳單”,”手機”,”網路”],其輸出值為[1,2,3],則經過如下函數進行計算,softmax函數如下所示: In the process 603, the intention recognition module identifies the confidence level of the dialogue text intention, and the intention recognition module uses a supervised machine learning neural network (Neural Network) to train the intention recognition model to identify the dialogue text intention confidence level. The intention recognition The module is used to identify all the dialogue sentences of the user collected by the dialogue data acquisition module, and calculate the intention recognition confidence of each sentence through the intention recognition model, so as to generate a confidence probability between 0 and 1 corresponding to each intention type distribution, and the probability sum of all intent categories is 1. Specifically, the present invention can use the softmax function to perform analysis operations. For example, assuming that the marked data has three intent categories, namely [“Billing”, “Mobile Phone”, “Internet”], its output value is [1, 2, 3], then it is calculated by the following function, and the softmax function is as follows:

Figure 110106716-A0101-12-0014-2
上述計算得到[0.09,0.245,0.665]之機率分布,其中,意圖類別“網路”之機率為三者中最高,則將語句之意圖分類為“網路”,是以,所得到之機率值即為意圖分類之信心度。舉例而言,以對話資料擷取模組所蒐集之T=[“我的帳單有問題”;“錢算錯”;“太爛了吧”]對話紀錄為例,對話紀錄T中之每一個語句將各自產生三維向量,以分別代表[“帳單”,“手機”,“網路”]三種意圖類別,且其總和為1,最後集合為I=[[0.8,0.1,0.1],[0.4,0.3,0.3],[0.3,0.4,0.3]],據此,每個語句之意圖信心度為各個意圖類別中最大之機率值α=[0.8,0.4,0.4]。
Figure 110106716-A0101-12-0014-2
The above calculation obtains the probability distribution of [0.09, 0.245, 0.665], in which, the probability of the intent category "Internet" is the highest among the three, then the intent of the sentence is classified as "Internet". Therefore, the obtained probability value It is the confidence of intention classification. For example, take T=["There is a problem with my bill";"The money is wrongly calculated";"Itsucks"] dialogue records collected by the dialogue data acquisition module as an example. A statement will each generate three-dimensional vectors to represent the three intent categories ["Billing", "Mobile", "Internet"], and their sum is 1, and the final set is I=[[0.8,0.1,0.1], [0.4, 0.3, 0.3], [0.3, 0.4, 0.3]], according to which, the intention confidence of each sentence is the maximum probability value α=[0.8, 0.4, 0.4] in each intention category.

於流程604中,對話資料擷取模組過濾對話紀錄,亦即,對話資料擷取模組依據過濾指標來過濾對話紀錄,其中,過濾指標係指用戶回饋、用戶是否轉接專人、經情緒識別模組所識別之用戶對話語句情緒極性正負向、經意圖識別模組所識別之對話語句意圖信心度,據此,對話資料擷取模組以過濾指標作為挑選用戶之對話語句的參考依據。 In the process 604, the dialogue data capture module filters the dialogue records, that is, the dialogue data capture module filters the dialogue records according to the filter index, wherein the filter index refers to the user feedback, whether the user transfers to a special person, and the emotion recognition method. The emotional polarity of the user's dialogue sentence identified by the module is positive and negative, and the intentional confidence of the dialogue sentence identified by the intention recognition module is based on this, the dialogue data extraction module uses the filter index as a reference for selecting the user's dialogue sentence.

舉例言之,本發明之挑選準則例示如下。於例示一中,用戶有回饋不滿意之句子,例如用戶反應:「電話無法接聽」,而系統回答:「您的網路測速正常」,是以,系統之反應與用戶所欲詢問之“手機”問題有所出入,故用戶給出不滿意之回饋。於例示二中,用戶轉接專人時與專人交談的語句,即用戶在與系統對話之過程中提出“轉接專人”之需求,此時,對話紀錄之挑選即為用戶與專人交談之對話語句。於例示三中,情緒識別模組產生的情緒正負向極性的機率低於0.3,承上所述之對話紀錄之T=[“我的帳單有問題”;“錢算錯”;“太爛了吧”]之語意情緒機率E=[0.4,0.5,0.1],故語句“太爛了吧”之情緒機率值0.1,低於門檻值0.3,是以,挑選該句語句。於例示四中,意圖識別模組之對話語句意圖信心度低於0.7,是以,對話紀錄T=[“我的帳單有問題”;“錢算錯”;“太爛了吧”]之語意的意圖信心度為α=[0.8,0.4,0.4],則將挑選[“錢算錯”;“太爛了吧”]兩句語句。據此,利用經對話資料擷取模組依據前述之四個準則所挑選出的語句,以進行後續步驟。 By way of example, the selection criteria of the present invention are exemplified as follows. In example 1, the user has a dissatisfied sentence. For example, the user responded: "The phone cannot be answered", and the system replied: "Your network speed is normal". Therefore, the system's response is related to the "mobile phone" that the user wants to ask. "The problem is different, so the user gives unsatisfactory feedback. In example 2, when the user transfers the special person, the sentence that the user talks to the special person, that is, the user puts forward the requirement of "transferring the special person" during the dialogue with the system. At this time, the selection of the dialogue record is the dialogue sentence between the user and the special person. . In Example 3, the probability of positive and negative emotions generated by the emotion recognition module is lower than 0.3, and the above-mentioned dialogue record T=[“There is a problem with my bill”; “The money is wrong”; “Too bad. The semantic emotional probability E=[0.4, 0.5, 0.1], so the emotional probability value of the sentence “too bad” is 0.1, which is lower than the threshold value of 0.3, so this sentence is selected. In Example 4, the confidence level of the dialogue sentence of the intention recognition module is lower than 0.7, so the dialogue record T=["There is a problem with my bill"; "The money is wrong"; "It's too bad"] If the semantic intent confidence is α=[0.8, 0.4, 0.4], the two sentences [“money count is wrong”; “too bad”] will be selected. Accordingly, the sentences selected by the dialogue data capturing module according to the aforementioned four criteria are used to carry out the subsequent steps.

於流程605中,對話資料分群標記模組區分對話語句所屬群組。首先,對系統先前已標記之原有標記資料A中之各意圖類別內各選出一句代表語句R,代表語句之選取方式可對各意圖類別中對話語句之語意向量取平均, 以求出群心μ,透過距離公式

Figure 110106716-A0101-12-0015-4
進一步求出各對話語句與 群心μ之距離,再自各意圖類別中選出離群心最近的對話語句作為代表語句R。 In the process 605, the dialogue data grouping and marking module identifies the group to which the dialogue sentence belongs. First, a representative sentence R is selected from each intent category in the original marked data A previously marked by the system. The representative sentence can be selected by averaging the semantic vectors of the dialogue sentences in each intent category to obtain a cluster. center μ , through the distance formula
Figure 110106716-A0101-12-0015-4
The distance between each dialogue sentence and the group center μ is further obtained, and the dialogue sentence closest to the group center is selected as the representative sentence R from each intention category.

具體言之,如下面表1所示,以“帳單”意圖類別為例,假設以二維向量代表每個語句在語意空間中之位置,亦即,以向量「[0.1,0.2],[0.3, 0.4],[0.5,0.6]」分別代表意圖類別「帳單」中「“帳單錯了”,“我想查帳單資訊”,“要看繳費資訊”」語句之語意向量,首先,計算各維度之平均值以得到群心,群心為[(0.1+0.3+0.5)/3=0.3,(0.2+0.4+0.6)/3=0.4],故μ=[0.3,0.4],據以求出語句“帳單錯了”與群心μ之距離d(r1 )=

Figure 110106716-A0101-12-0016-5
,且依次求出d(r2 )=0,d(r3 )=
Figure 110106716-A0101-12-0016-6
,藉以選出距離群心μ最小之對應語句為代表語句R,因而R=“我想查帳 單資訊”。另外,原有標記資料A中之意圖類別共有三類,即N=3類,故代表語句R1~3分為[“我想查帳單資訊”,“有新的手機嗎”,“網路速度太慢”],而各自代表之語意向量為[[0.3,0.4],[0.7,0.7],[0.9,0.9]]。 Specifically, as shown in Table 1 below, taking the "billing" intent category as an example, it is assumed that a two-dimensional vector represents the position of each sentence in the semantic space, that is, the vector "[0.1, 0.2], [ 0.3, 0.4], [0.5, 0.6]” respectively represent the semantic vector of the sentences in the intent category “Billing”, ““The bill is wrong”, “I want to check the billing information”, and “See the payment information””, First, calculate the average of each dimension to get the group center, the group center is [(0.1+0.3+0.5)/3=0.3, (0.2+0.4+0.6)/3=0.4], so μ =[0.3 , 0.4] , according to the distance d(r 1 ) =
Figure 110106716-A0101-12-0016-5
, and in turn obtain d(r 2 , μ )=0, d(r 3 , μ )=
Figure 110106716-A0101-12-0016-6
, so that the corresponding sentence with the smallest distance from the cluster center μ is selected as the representative sentence R, so R=“I want to check the billing information”. In addition, there are three categories of intent categories in the original marked data A, namely N=3 categories, so the representative sentences R 1~3 are divided into [“I want to check the billing information”, “Do you have a new mobile phone”, “Internet The speed of the road is too slow”], and the semantic vectors represented by each are [[0.3, 0.4], [0.7, 0.7], [0.9, 0.9]].

Figure 110106716-A0101-12-0016-7
Figure 110106716-A0101-12-0016-7

進一步地,將對話資料擷取模組所挑選出之新資料T=[“我的帳單有問題”;“錢算錯”;“太爛了吧”]中之語句,分別與代表語句R1~3=[“我想查帳單資訊”,“有新的手機嗎”,“網路速度太慢”]進行語意比對,以找出平均語意距離最大的一句O。具體而言,平均語意距離之計算方法係將T中每一語 句對R求距離

Figure 110106716-A0101-12-0016-8
,再將同一語句Ti對代表語句R所求得之各 距離取平均,以T1=“我的帳單有問題”之語句而言,假設其語句向量為[0.3, 0.3],可得到T1語句與代表語句R1~3之間的距離為d(t_1,r_1)=√0.01,d(t_1,r_2)=√0.32,d(t_1,r_3)=√0.72,如此T1語句之平均語意距離為d(t_1,r)=√(0.01+0.32+0.72)÷3=0.34,另外,T2=“錢算錯”之語句設其語句向量為[0.2,0.2],則其平均語意距離為d(t_2,r)=√(0.05+0.5+0.99)÷3=0.41,以及T3=“太爛了吧”之語句若其語句向量為[0.1,0.1],則其平均語意距離為d(t_3,r)=√(0.13+0.72+1.28)÷3=0.49。據此,可得到對話資料T之語句中與R距離最大之語句O即為“太爛了吧”,其中,可設定預設之門檻值=0.3,因而,語句O的平均語意距離超過預設之門檻值=0.3,則挑選O為另一新意圖類別的代表語句RN+1,亦即,新的意圖類別之代表語句為RN+1=“太爛了吧”。 Further, the sentences in the new data T=[“There is a problem with my bill”; “The money is wrongly calculated”; “It’s too bad”] selected by the dialogue data capture module are respectively and the representative sentence R 1~3 = ["I want to check the billing information", "Do you have a new mobile phone", "The network speed is too slow"] to perform semantic comparison to find the sentence O with the largest average semantic distance. Specifically, the calculation method of the average semantic distance is to calculate the distance between each sentence in T and R
Figure 110106716-A0101-12-0016-8
, and then average the distances obtained by the same sentence T i on the representative sentence R. For the sentence T 1 = "I have a problem with my bill", assuming that the sentence vector is [0.3, 0.3], we can get The distance between the T 1 statement and the representative statement R 1~3 is d(t_1,r_1)=√0.01, d(t_1,r_2)=√0.32, d(t_1,r_3) = √0.72, so the T1 statement The average semantic distance is d(t_1,r)=√(0.01+0.32+0.72)÷3=0.34. In addition, the sentence vector of T 2 = "miscalculation of money" is set to be [0.2, 0.2], then the average If the semantic distance is d(t_2,r)=√(0.05+0.5+0.99)÷3=0.41, and T 3 = "too bad", if its sentence vector is [0.1, 0.1], then its average semantic The distance is d(t_3,r)=√(0.13+0.72+1.28)÷3=0.49. According to this, it can be obtained that the sentence O with the largest distance from R among the sentences in the dialogue data T is “too bad”, and the preset threshold value can be set to be 0.3. Therefore, the average semantic distance of the sentence O exceeds the preset value. The threshold value = 0.3, then O is selected as the representative sentence R N+1 of another new intention category, that is, the representative sentence of the new intention category is R N+1 = "too bad".

接著,再將原有標記資料A與新資料T為訓練資料,且以代表語句R1~N+1=[“我想查帳單資訊”,“有新的手機嗎”,“網路速度太慢”,“太爛了吧”]為預設的分群中心進行分群計算,並將所有資料區分為N+1群,其中,N=3為原有意圖類別數量,第N+1群即為新增之意圖類別。據此,本發明利用k-means分群方法,將k設為4群,並將k-means第一輪之群心以代表語句R1~N+1=[“我想查帳單資訊”,“有新的手機嗎”,“網路速度太慢”,“太爛了吧”]取代,利用演算法將語意相近的語句區分至相同群組。例如:以R1=“我想查帳單資訊”為群心的群組U1=[“我想查帳單資訊”,“帳單錯了”,“要看繳費資訊”,“我的帳單有問題”]。 Then, the original marked data A and the new data T are used as training data, and the representative sentence R 1~N+1 = [“I want to check the billing information”, “Do you have a new mobile phone”, “Internet speed” "Too slow", "too bad"] for the preset grouping center to perform grouping calculation, and divide all data into N+1 groups, where N=3 is the number of original intent categories, and the N+1th group is For the newly added intent category. Accordingly, the present invention uses the k-means grouping method, sets k as 4 groups, and uses the group heart of the first round of k-means to represent the sentence R 1~N+1 = [“I want to check the billing information”, "Do you have a new phone?", "The internet is too slow", "It's too bad"] instead, using an algorithm to classify sentences with similar semantics into the same group. For example: a group with R 1 = "I want to check billing information" U 1 = ["I want to check billing information", "The bill is wrong", "See billing information", "My There is a problem with the bill"].

於流程606中,對話資料分群標記模組針對各群組資料標記意圖類別。針對所有未標記過的新資料Tk,利用其與同群組內最相近且已知之意圖類別之Ak進行自動化標記。例如,語句T1=[“我的帳單有問題”]可利用最近鄰居法找出同群組U1=[“我想查帳單資訊”,“帳單錯了”,“要看繳費資 訊”,“我的帳單有問題”]中與T1最近的鄰居Nb=“我想查帳單資訊”,再將T1=[“我的帳單有問題”]之意圖類別標記為與Nb=“我想查帳單資訊”相同之意圖類別“帳單”。如此,即可對所有新資料T中之語句皆標記對應之所屬意圖類別,是以,I帳單=[“我想查帳單資訊”,“帳單錯了”,“要看繳費資訊”,“我的帳單有問題”]、I手機=[“要查手機新型號”,“有新的手機嗎”,“有新機嗎”]、I網路=[“網路速度太慢”,“網路斷線”,“連不上網路”]以及I其他=[“錢算錯”;“太爛了吧”]。 In the process 606, the dialog data grouping labeling module labels the intent category for each group data. For all unmarked new data T k , it is automatically marked with Ak of the closest and known intent category in the same group. For example, the statement T 1 = ["I have a problem with my bill"] can use the nearest neighbor method to find the same group U 1 = ["I want to check the billing information", "The bill is wrong", "It depends on the payment information", "I have a problem with my bill"], the nearest neighbor to T1 Nb = "I want to check the billing information", and then mark the intent category of T 1 = ["I have a problem with my bill"] as the same as Nb="I want to check billing information" same intent category "Billing". In this way, all the sentences in the new data T can be marked with the corresponding intent category, so, I bill = ["I want to check the billing information", "The bill is wrong", "See the payment information" , "I have a problem with my bill"], I mobile phone = ["I want to check the new model of the mobile phone", "Do you have a new mobile phone", "Do you have a new phone?"], I network = ["The network speed is too slow ”, “internet disconnected”, “cannot connect to internet”], and I other = [“money miscalculation”; “too bad”].

於流程607-608中,對話資料增強模組針對各意圖類別資料擴增。其中,對話資料增強模組會先檢視各組意圖類別之樣本數是否足夠,亦即,意圖類別中之語句是否足夠,對話資料增強模組比較各組意圖類別之間的樣本數差異,針對其中樣本數少之意圖類別進行資料增強,其中,資料增強方法包含同義詞替換(隨機選N個非停用詞,用其同義詞取代)、隨機插入(隨機選1個非停用詞,將它的同義詞插入句中任意位置N次)、隨機交換(任選句中兩個詞交換位置,以上重複N次)、隨機刪除(隨機選取N個詞刪除)、基於機器學習與深度學習之資料增強方法等,俾於各組意圖類別之語句資料持續擴增至各意圖類別內樣本數達到平衡(差異比例小於一設定之門檻值α)為止,藉以達到平衡意圖類別之樣本數之目的。另外,復可使對話資料增強模組中設定樣本數量之門檻值,以於計算各意圖類別數量Q後,判斷各意圖類別數量Q是否達到預定之門檻值(假設門檻值為2),即Q(I帳單)=4、Q(I手機)=3、Q(I網路)=3以及Q(I其他)=2,皆大於預定之門檻值,故進行資料是否平衡之評估,亦即,進行各類別數量差異的比例之計算,其公式係為E=x i /max(x),據此,各意圖類別中之語句數量差異比例E(I帳單)=1,E(I手機)=0.75,E(I網路)=0.75,E(I其他)=0.5,進一步地,依據數量差異 門檻值α=0.6,以決定意圖類別I其他之資料量明顯太低,因而需要進行資料增強。 In the processes 607-608, the dialog data enhancement module augments the data of each intent category. Among them, the dialogue data enhancement module will first check whether the number of samples of each group of intent categories is sufficient, that is, whether the sentences in the intent categories are sufficient, the dialogue data enhancement module Data enhancement is performed on the intent category with a small number of samples. The data enhancement method includes synonym replacement (randomly select N non-stop words and replace them with their synonyms), random insertion (randomly select 1 non-stop word and replace its synonyms with its synonyms). Insert any position in the sentence N times), random exchange (optionally exchange two words in the sentence, repeat the above N times), random deletion (randomly select N words to delete), data enhancement methods based on machine learning and deep learning, etc. , so that the sentence data of each group of intent categories continues to expand until the number of samples in each intent category reaches a balance (the difference ratio is less than a set threshold value α), so as to achieve the purpose of balancing the number of samples of the intent categories. In addition, a threshold value of the number of samples can be set in the dialogue data enhancement module, so that after calculating the number Q of each intent category, it can be determined whether the number Q of each intent category reaches a predetermined threshold value (assuming the threshold value is 2), that is, Q (I bill ) = 4, Q (I mobile phone ) = 3, Q (I network ) = 3 and Q (I other ) = 2, all of which are greater than the predetermined threshold value, so the evaluation of whether the data is balanced, that is , calculate the proportion of the difference in the number of each category, the formula is E = x i /max( x ), according to this, the proportion of the difference in the number of sentences in each intention category E(I bill ) = 1, E(I mobile phone ) = 0.75, E(I network ) = 0.75, E(I other ) = 0.5, further, according to the quantity difference threshold α = 0.6, it is determined that the amount of data of intention category I other is obviously too low, so it is necessary to carry out data enhanced.

於本發明之資料增強中,可利用自然語言處理中常見的同義詞替換以進行資料增強,藉由事先定義的同義詞表[“太爛”,“太差”,“太遜”],對意圖類別I其他中之語句“太爛了吧”進行同義詞替換,其係可新增“太差了吧”,且意圖類別計算E(I其他)=0.75超過數量差異門檻值α=0.6後,停止資料增強。上述過程能獲得出最佳之訓練資料集,能提供系統之意圖識別模型持續學習更新,其中,更新方式可使用重新訓練以及微調訓練(Fine tune)二種方式或其組合,其中,重新訓練係將新資料加入舊資料,以一起重新訓練模型,另外,微調訓練即保留之前模型權重只更新網路模型中上層權重值。 In the data enhancement of the present invention, the common synonyms in natural language processing can be used to replace the data to enhance the data. Through the predefined synonym table [“too bad”, “too bad”, “too poor”], the intent category is The sentence "too bad" in I Others can be replaced by synonyms, which can be added "Too bad", and the intention category calculation E(I Others )=0.75 After exceeding the threshold value of quantity difference α=0.6, stop the data enhanced. The above process can obtain the best training data set, and can provide continuous learning and updating of the intention recognition model of the system, wherein the updating method can use two methods of retraining and fine-tuning training (Fine tune) or a combination thereof, wherein the retraining is Add the new data to the old data to retrain the model together. In addition, fine-tuning training keeps the previous model weights and only updates the upper layer weights in the network model.

本發明中之的各模組、單元均可為軟體、硬體或韌體;若為硬體,則可為具有資料處理與運算能力之處理單元、處理器、電腦或伺服器;若為軟體或韌體,則可包括處理單元、處理器、電腦或伺服器可執行之指令。 Each module and unit in the present invention can be software, hardware or firmware; if it is hardware, it can be a processing unit, processor, computer or server with data processing and computing capabilities; if it is software or firmware, which may include instructions executable by a processing unit, processor, computer, or server.

此外,本發明還揭示一種電腦可讀媒介,係應用於具有處理器(例如,CPU、GPU等)及/或記憶體的計算裝置或電腦中,且儲存有指令,並可利用此計算裝置或電腦透過處理器及/或記憶體執行此電腦可讀媒介,以於執行此電腦可讀媒介時執行上述之方法及各步驟。 In addition, the present invention also discloses a computer-readable medium, which is applied to a computing device or computer having a processor (eg, CPU, GPU, etc.) and/or memory, and stores instructions, and can utilize the computing device or computer. The computer executes the computer-readable medium through a processor and/or a memory, so as to execute the above-mentioned methods and steps when executing the computer-readable medium.

綜上所述,本發明提出一種具有自動化資料擷取、分群標記與增強之對話資料處理系統、其方法及電腦可讀媒介,係利用對話資料擷取模組收集用戶有回饋之對話資料以及根據用戶情緒正負向、對話文字意圖信心度、是否轉接專人等指標對用戶對話語句資料進行過濾以篩選用戶之對話語句。另外,本發明之對話資料分群標記模組將原本既有的對話資料進行無監督式分群計算並預先自動標注,協助AI訓練師快速有效的分類對話資料的意圖類別,其可改善習知AI訓練師於進行標記時所面臨之意圖類別太多太雜,而很難直覺立 即將一筆新的語句資料進行歸類之問題。又,本發明針對每一個意圖類別內含的樣本數進行調整,針對樣本數不足的類別進行資料增強,藉以達到維持訓練資料集樣本數之平衡的目的。 To sum up, the present invention provides a dialogue data processing system with automatic data capture, group marking and enhancement, a method and a computer-readable medium thereof, which utilizes a dialogue data capture module to collect user feedback dialogue data and based on The user's emotional positive and negative, the confidence of the dialogue text intention, whether to transfer to a special person and other indicators are used to filter the user's dialogue sentence data to filter the user's dialogue sentences. In addition, the dialogue data grouping and labeling module of the present invention performs unsupervised grouping calculation on the existing dialogue data and automatically marks it in advance, so as to assist the AI trainer to quickly and effectively classify the intention categories of the dialogue data, which can improve the learning AI training. There are so many categories of intentions that teachers face when marking, it is difficult to intuitively establish The problem of classifying a new sentence data. In addition, the present invention adjusts the number of samples included in each intent category, and enhances data for categories with insufficient samples, so as to achieve the purpose of maintaining the balance of the number of samples in the training data set.

是以,本發明可達到以下之功效。 Therefore, the present invention can achieve the following effects.

第一,改善AI訓練師對大量對話紀錄之資料進行標記,且可能造成標記資料品質不佳的缺點。 First, improve the AI trainer's marking of a large number of dialogue records, which may cause the shortcomings of poor quality of the marked data.

第二,本發明考量資料過濾擷取、詞意相近聚合與擴增等方法,自動化計算挑選最佳訓練資料。 Second, the present invention automatically calculates and selects the best training data by considering methods such as data filtering and retrieval, aggregation and expansion of similar meanings.

第三,對話資料擷取非僅考量用戶回饋是否滿意、用戶是否轉接專人等指標,更增加系統偵測用戶對話文字情緒正負向極性以及系統預測此問句意圖之模型信心度兩個指標,以過濾篩選適合AI訓練師標記的對話紀錄。 Third, the dialogue data retrieval not only considers whether the user's feedback is satisfied, whether the user transfers to a special person, etc., but also adds two indicators, such as the system's detection of the positive and negative polarity of the user's dialogue text, and the model's confidence in the system's prediction of the intention of the question. Use the filter to filter the conversation transcripts that are suitable for the AI trainer to mark.

第四,透過對話資料分群計算及自動標記,以協助AI訓練師快速處理大量標記資料,減輕資料標記人力。 Fourth, through group calculation and automatic tagging of dialogue data, AI trainers can quickly process a large amount of tagged data and reduce data tagging manpower.

第五,利用對話資料增強來提升各意圖類別資料之豐富性與多樣性,且透過結合各類別資料平衡的檢查機制,確保後續意圖偵測模型預測的品質不受資料不平衡而造成的偏頗。 Fifth, the augmentation of dialogue data is used to enhance the richness and diversity of the data of each intent category, and through a check mechanism that combines the balance of each category of data, it is ensured that the quality of subsequent intent detection model predictions is not biased by data imbalance.

上述之實施案例僅為舉例性之具體說明,而非為限制本發明之範圍,凡任何對其進行之等效修改或變更者,皆未脫離本發明之精神與範疇,均應包含於本案專利範圍中。 The above-mentioned implementation cases are only illustrative and specific descriptions, rather than limiting the scope of the present invention. Any equivalent modifications or changes made to them do not depart from the spirit and scope of the present invention, and should be included in the patent of this case. in the range.

10:對話資料處理系統 10: Dialogue data processing system

11:對話資料擷取模組 11: Dialogue data capture module

12:對話資料分群標記模組 12: Dialogue data group marking module

Claims (15)

一種對話資料處理系統,其包括: A dialogue data processing system, comprising: 對話資料擷取模組,係用於蒐集具有語句之對話紀錄,以依據過濾指標過濾該對話紀錄,獲得意圖語句;以及 The dialogue data acquisition module is used to collect dialogue records with sentences, so as to filter the dialogue records according to the filter index to obtain the intention sentences; and 對話資料分群標記模組,係用於從多個意圖類別內具有標籤的原有資料中各自選出代表語句,再將該意圖語句分別與各該代表語句進行比對,以由該意圖語句中取得與各該代表語句間的平均語意距離最大者,俾於該平均語意距離最大者小於預定門檻值時,以各該代表語句為中心將該意圖語句與隱藏該標籤之該原有資料進行分群與標記,使同一群組內之該意圖語句及該原有資料具有相同標籤,或是於該平均語意距離最大者超過該預定門檻值時,令該平均語意距離最大者之意圖語句為新的意圖類別之新代表語句,且以所有該代表語句及該新代表語句為中心對該意圖語句與隱藏該標籤之該原有資料進行分群與標記,使同一群組內之該意圖語句及該原有資料具有相同標籤。 The dialog data grouping and tagging module is used to select representative sentences from the original data with labels in a plurality of intention categories, and then compare the intention sentences with the representative sentences respectively, so as to obtain from the intention sentences The one with the largest average semantic distance from each of the representative sentences, so that when the one with the largest average semantic distance is less than a predetermined threshold value, grouping the intended sentence and the original data hiding the label with each representative sentence as the center. mark, so that the intent sentence and the original data in the same group have the same label, or when the average semantic distance exceeds the predetermined threshold, the intent sentence with the largest average semantic distance is the new intent The new representative sentence of the category, and the intention sentence and the original data that hide the label are grouped and labeled with all the representative sentence and the new representative sentence as the center, so that the intention sentence and the original data in the same group are grouped and labeled. Profiles have the same label. 如請求項1所述之對話資料處理系統,其中,該對話資料分群標記模組對該多個意圖類別中之各者的多個語句所對應之多個語意向量取群心,以自該多個語意向量中取得與該群心距離最小者,作為該代表語句。 The dialog data processing system of claim 1, wherein the dialog data grouping and marking module takes a group center of a plurality of semantic vectors corresponding to a plurality of sentences of each of the plurality of intent categories, so as to obtain the The one with the smallest distance from the cluster center among the multiple semantic vectors is taken as the representative sentence. 如請求項1所述之對話資料處理系統,復包括用於資料擴增之對話資料增強模組,係於各該意圖類別之樣本數不足或是比較所有該意圖類別之間的樣本數差異後,對樣本數少者進行資料增強。 The dialogue data processing system as claimed in claim 1, further comprising a dialogue data enhancement module for data augmentation, when the number of samples for each of the intent categories is insufficient or after comparing the differences in the number of samples among all the intent categories , and data enhancement is performed for the small sample size. 如請求項3所述之對話資料處理系統,其中,該資料增強係使用同義詞替換、隨機插入、隨機交換、隨機刪除、基於機器學習與深度學習之資料增強方法或其任意組合。 The dialogue data processing system of claim 3, wherein the data augmentation uses synonym replacement, random insertion, random exchange, random deletion, data augmentation methods based on machine learning and deep learning, or any combination thereof. 如請求項1所述之對話資料處理系統,其中,該過濾指標係包括滿意度回饋值、語句情緒正負向極性、對話文字意圖信心度或是否有轉接請求。 The dialogue data processing system according to claim 1, wherein the filter index includes satisfaction feedback value, positive and negative polarity of sentence emotion, confidence level of dialogue text intention, or whether there is a transfer request. 如請求項5所述之對話資料處理系統,其中,該對話資料擷取模組復包括情緒識別單元,係利用情緒識別模型識別該意圖語句之情緒極性,以產生該語句情緒正負向極性。 The dialogue data processing system of claim 5, wherein the dialogue data acquisition module further comprises an emotion recognition unit, which uses an emotion recognition model to recognize the emotion polarity of the intent sentence to generate the positive and negative emotion polarity of the sentence. 如請求項5所述之對話資料處理系統,其中,該對話資料擷取模組復包括意圖識別單元,係利用意圖識別模型計算該意圖語句之意圖識別信心度,以產生該對話文字意圖信心度。 The dialogue data processing system of claim 5, wherein the dialogue data acquisition module further comprises an intention recognition unit, which uses an intention recognition model to calculate the intention recognition confidence level of the intention sentence to generate the dialogue text intention confidence level . 一種對話資料處理方法,係包括: A dialogue data processing method, comprising: 蒐集具有語句之對話紀錄; Collect conversation records with sentences; 依據過濾指標過濾該對話紀錄以獲得意圖語句; Filter the conversation record according to the filter index to obtain the intent sentence; 從多個意圖類別內具有標籤的原有資料中各自選出代表語句;以及 Select representative sentences from each of the tagged pre-existing data within multiple intent categories; and 將該意圖語句分別與各該代表語句進行比對,以由該意圖語句中取得與各該代表語句間的平均語意距離最大者,俾於該平均語意距離最大者小於預定門檻值時,以各該代表語句為中心將該意圖語句與隱藏該標籤之該原有資料進行分群與標記,使同一群組內之該意圖語句及該原有資料具有相同標籤,或是於該平均語意距離最大者超過該預定門檻值時,令該平均語意距離最大者之意圖語句為新的意圖類別之新代表語句,且以所有該代表語句及該新代表語句為中心對該意圖語句與隱藏該標籤之該原有資料進行分群與標記,使同一群組內之該意圖語句及該原有資料具有相同標籤。 The intent sentence is compared with each of the representative sentences, so as to obtain the one with the largest average semantic distance from the intended sentence and each of the representative sentences, so that when the largest average semantic distance is less than a predetermined threshold, each The representative sentence is centered on grouping and labeling the intent sentence and the original data that hides the label, so that the intent sentence and the original data in the same group have the same label, or the one with the largest average semantic distance When the predetermined threshold value is exceeded, the intent sentence with the largest average semantic distance is the new representative sentence of the new intent category, and all the representative sentences and the new representative sentence are centered on the intent sentence and the one that hides the label. The original data is grouped and labeled, so that the intent statement and the original data in the same group have the same label. 如請求項8所述之對話資料處理方法,其中,該選出代表語句之步驟係對該多個意圖類別中之各者的多個語句所對應之多個語意向量取群心,以自該多個語意向量中取得與該群心距離最小者作為該代表語句。 The dialogue data processing method according to claim 8, wherein the step of selecting a representative sentence is to take a group center of a plurality of semantic vectors corresponding to a plurality of sentences of each of the plurality of intention categories, so as to obtain the The one with the smallest distance from the cluster center among the plurality of semantic vectors is taken as the representative sentence. 如請求項9所述之對話資料處理方法,其中,該依據過濾指標過濾該對話紀錄以獲得意圖語句之步驟復包括於各該意圖類別之樣本數不足或是比較所有該意圖類別之間的樣本數差異後,對樣本數少的意圖類別進行資料增強。 The dialogue data processing method as claimed in claim 9, wherein the step of filtering the dialogue record according to the filtering index to obtain the intent sentence further comprises that the number of samples of each of the intent categories is insufficient or that the samples among all the intent categories are compared. After the difference in the number of samples, data enhancement is performed on the intent categories with a small number of samples. 如請求項10所述之對話資料處理方法,其中,該資料增強係使用同義詞替換、隨機插入、隨機交換、隨機刪除、基於機器學習與深度學習之資料增強方法或其任意組合。 The dialogue data processing method as claimed in claim 10, wherein the data augmentation uses synonym replacement, random insertion, random exchange, random deletion, data augmentation methods based on machine learning and deep learning, or any combination thereof. 如請求項8所述之對話資料處理方法,其中,該過濾指標係包括滿意度回饋值、語句情緒正負向極性、對話文字意圖信心度或是否有轉接請求。 The dialogue data processing method according to claim 8, wherein the filter index includes satisfaction feedback value, positive and negative polarity of sentence emotion, confidence level of dialogue text intention or whether there is a transfer request. 如請求項12所述之對話資料處理方法,其中,該語句情緒正負向極性係利用情緒識別模型識別該意圖語句之情緒極性所產生者。 The dialogue data processing method according to claim 12, wherein the positive and negative polarity of the sentence emotion is generated by identifying the emotional polarity of the intention sentence by using an emotion recognition model. 如請求項12所述之對話資料處理方法,其中,該對話文字意圖信心度係利用意圖識別模型計算該意圖語句之意圖識別信心度所產生者。 The dialogue data processing method according to claim 12, wherein the intention confidence level of the dialogue text is generated by calculating the intention recognition confidence level of the intention sentence by using an intention recognition model. 一種電腦可讀媒介,應用於計算裝置或電腦中,係儲存有指令,以執行如請求項8至14之任一項所述之對話資料處理方法。 A computer-readable medium used in a computing device or a computer and storing instructions for executing the method for processing session data as described in any one of claims 8 to 14.
TW110106716A 2021-02-25 2021-02-25 Dialogue data processing system and method thereof and computer readable medium TWI761090B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
TW110106716A TWI761090B (en) 2021-02-25 2021-02-25 Dialogue data processing system and method thereof and computer readable medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
TW110106716A TWI761090B (en) 2021-02-25 2021-02-25 Dialogue data processing system and method thereof and computer readable medium

Publications (2)

Publication Number Publication Date
TWI761090B TWI761090B (en) 2022-04-11
TW202234285A true TW202234285A (en) 2022-09-01

Family

ID=82199148

Family Applications (1)

Application Number Title Priority Date Filing Date
TW110106716A TWI761090B (en) 2021-02-25 2021-02-25 Dialogue data processing system and method thereof and computer readable medium

Country Status (1)

Country Link
TW (1) TWI761090B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI802459B (en) * 2022-07-01 2023-05-11 中華電信股份有限公司 A system and method for recommendation q&a based on data-enhanced

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8209182B2 (en) * 2005-11-30 2012-06-26 University Of Southern California Emotion recognition system
TW201820172A (en) * 2016-11-24 2018-06-01 財團法人資訊工業策進會 System, method and non-transitory computer readable storage medium for conversation analysis
CN111858916B (en) * 2019-04-01 2024-04-09 北京百度网讯科技有限公司 Method and device for clustering sentences
CN111274402B (en) * 2020-02-07 2022-09-23 南京邮电大学 E-commerce comment emotion analysis method based on unsupervised classifier

Also Published As

Publication number Publication date
TWI761090B (en) 2022-04-11

Similar Documents

Publication Publication Date Title
WO2020108608A1 (en) Search result processing method, device, terminal, electronic device, and storage medium
CN107515877B (en) Sensitive subject word set generation method and device
CN104598445B (en) Automatically request-answering system and method
CN106570708B (en) Management method and system of intelligent customer service knowledge base
CN106021362A (en) Query picture characteristic representation generation method and device, and picture search method and device
CN108550065B (en) Comment data processing method, device and equipment
CN113505586A (en) Seat-assisted question-answering method and system integrating semantic classification and knowledge graph
CN106294344A (en) Video retrieval method and device
CN111651606B (en) Text processing method and device and electronic equipment
CN103778206A (en) Method for providing network service resources
CN110134777A (en) Problem De-weight method, device, electronic equipment and computer readable storage medium
CN110765285A (en) Multimedia information content control method and system based on visual characteristics
CN110008365A (en) A kind of image processing method, device, equipment and readable storage medium storing program for executing
CN113946657A (en) Knowledge reasoning-based automatic identification method for power service intention
CN109271624A (en) A kind of target word determines method, apparatus and storage medium
TWI761090B (en) Dialogue data processing system and method thereof and computer readable medium
CN115062621A (en) Label extraction method and device, electronic equipment and storage medium
CN112925877A (en) One-person multi-case association identification method and system based on depth measurement learning
CN111767404B (en) Event mining method and device
CN108228779A (en) A kind of result prediction method based on Learning Community's dialogue stream
CN115439919B (en) Model updating method, device, equipment, storage medium and program product
CN116150313A (en) Data expansion processing method and device
CN112200260B (en) Figure attribute identification method based on discarding loss function
US20170293863A1 (en) Data analysis system, and control method, program, and recording medium therefor
CN111666765A (en) Fraud topic analysis method and system based on k-means text clustering