TWI802459B

TWI802459B - A system and method for recommendation q&a based on data-enhanced

Info

Publication number: TWI802459B
Application number: TW111124715A
Authority: TW
Inventors: 王振愷; 郭敏楓
Original assignee: 中華電信股份有限公司
Priority date: 2022-07-01
Filing date: 2022-07-01
Publication date: 2023-05-11
Also published as: TW202403608A

Abstract

A system and method for recommendation Q&A based on data-enhanced are provided, the system includes a transceiver, a storage medium and a processor. The storage medium stores a plurality of modules and database. The processor is coupled to the storage medium and the transceiver, accesses and executes a pre-training model selection module, a data enhancement module, an intent recognition module and a reply retrieval module. The pre-training model selection module correspondingly generates a specific domain training subset and a general domain training subset. The data enhancement module is used to generate augmented training data sets. The intent recognition module receives data input by an user and outputs category labels and keyword combinations. The reply retrieval module retrieves a best recommendation from the database and recommends the best recommendation to the user based on category labels and keyword combinations.

Description

System and method for data-based enhanced recommendation question answering

本發明是有關於一種推薦問答的系統及方法，且特別是有關於一種基於資料增強推薦問答的系統及方法。The present invention relates to a system and method for recommending questions and answers, and in particular to a system and method for recommending questions and answers based on data enhancement.

監督式學習 (Supervised Learning) 依賴大量已標註資料進行訓練，在訓練分類模型時，若訓練資料集的數量不足會降低模型的正則化能力，可能會導致機器學習發生擬合過度 (Overfitting) 問題，這是因為資料範圍的限制降低了模型的正則化 (Regularization) 能力，以至於很難有效的發揮前瞻演算法的效力。而訓練具有良好正則化能力的模型高度依賴數量足夠且質量良好的資料，但是在資料收集過程中的成本往往非常昂貴。Supervised Learning relies on a large amount of labeled data for training. When training a classification model, if the number of training data sets is insufficient, the regularization ability of the model will be reduced, which may lead to overfitting in machine learning. This is because the limitation of the data range reduces the regularization ability of the model, so that it is difficult to effectively exert the effectiveness of the forward-looking algorithm. Training a model with good regularization ability is highly dependent on sufficient and good-quality data, but the cost of data collection is often very expensive.

本發明提供一種基於資料增強推薦問答的系統及方法，通過生成高質量的增強資料集來擴充訓練資料集，以避免機器學習發生擬合過度 (Overfitting) 的問題。The present invention provides a system and method for recommending questions and answers based on data enhancement, which expands the training data set by generating a high-quality enhanced data set, so as to avoid the problem of overfitting in machine learning.

本發明的一種基於資料增強推薦問答的系統，包括收發器、儲存媒體以及處理器。儲存媒體儲存多個模組及資料庫。處理器耦接儲存媒體和收發器，並且存取和執行該些模組，其中該些模組包括預訓練模型選取模組、資料增強模組、意圖辨識模組及回覆檢索模組。預訓練模型選取模組根據訓練資料將訓練資料分類為特定領域資料及通用領域資料，選取特定領域預訓練模型及通用領域預訓練模型分別將特定領域資料及通用領域資料對應產生特定領域訓練子集及通用領域訓練子集。資料增強模組與預訓練模型選取模組電性連接，依據特定領域訓練子集及通用領域訓練子集分別提取候選關鍵詞，並生成類別關鍵詞清單，將類別關鍵詞清單與特定領域預訓練模型及通用領域預訓練模型進行資料增強以生成增強訓練資料集。意圖辨識模組與資料增強模組電性連接，接收使用者輸入的資料以基於增強訓練資料集訓練的分類模型辨識資料的類別標籤並且輸出類別標籤，並基於類別關鍵詞清單及資料提取並且輸出資料的關鍵詞組合。回覆檢索模組與意圖辨識模組電性連接，基於類別標籤及關鍵詞組合從資料庫檢索最佳推薦資料並且推薦至使用者。A system for recommending questions and answers based on data enhancement of the present invention includes a transceiver, a storage medium and a processor. The storage medium stores multiple modules and databases. The processor is coupled to the storage medium and the transceiver, and accesses and executes the modules, wherein the modules include a pre-training model selection module, a data enhancement module, an intention recognition module and a reply retrieval module. The pre-training model selection module classifies the training data into domain-specific data and general domain data according to the training data, selects the domain-specific pre-training model and the general domain pre-training model, and generates domain-specific training subsets corresponding to the domain-specific data and general domain data and general domain training subsets. The data enhancement module is electrically connected to the pre-training model selection module, and candidate keywords are extracted respectively according to the specific field training subset and the general field training subset, and a category keyword list is generated, and the category keyword list is combined with the specific field pre-training Models and general domain pre-trained models are augmented to generate enhanced training datasets. The intent recognition module is electrically connected with the data enhancement module, receives the data input by the user to identify the category label of the data based on the classification model trained by the enhanced training data set and outputs the category label, and extracts and outputs the category keyword list and data based on the category Keyword combination of data. The reply retrieval module is electrically connected with the intent recognition module, and retrieves the best recommended information from the database based on the combination of category tags and keywords and recommends them to users.

在本發明的一實施例中，上述的基於資料增強推薦問答的系統，其中預訓練模型選取模組更用以依據機率閾值將訓練資料分類為特定領域資料及通用領域資料。In an embodiment of the present invention, in the above-mentioned data-based enhanced recommendation question answering system, the pre-training model selection module is further used to classify the training data into specific domain data and general domain data according to the probability threshold.

在本發明的一實施例中，上述的基於資料增強推薦問答的系統，其中特定領域訓練子集為屬於與特定領域相關的訓練資料，通用領域訓練子集為屬於與特定領域不相關的訓練資料。In an embodiment of the present invention, in the above-mentioned data-based enhanced question-and-answer system, wherein the domain-specific training subset is training data related to a specific domain, and the general domain training subset is training data that is not related to a specific domain .

在本發明的一實施例中，上述的基於資料增強推薦問答的系統，其中資料增強模組更用以計算特定領域訓練子集及通用領域訓練子集中各類別標籤的類別向量，計算候選關鍵詞與各類別標籤的類別向量的相似度以生成類別關鍵詞清單，基於類別關鍵詞清單、特定領域預訓練模型以及通用領域預訓練模型生成相似詞，利用相似詞擴充訓練資料集並且生成增強訓練資料集。In an embodiment of the present invention, in the above-mentioned system for recommending questions and answers based on data enhancement, the data enhancement module is further used to calculate the category vector of each category label in the specific domain training subset and the general domain training subset, and calculate the candidate keywords The similarity with the category vector of each category label to generate a category keyword list, generate similar words based on the category keyword list, specific domain pre-training model and general domain pre-training model, use similar words to expand the training data set and generate enhanced training data set.

在本發明的一實施例中，上述的基於資料增強推薦問答的系統，其中資料增強模組更用以將特定領域訓練子集及通用領域訓練子集中的停滯詞 (Stop Words)過濾後進行斷詞並且輸出斷詞結果資訊，基於統計方法以對斷詞結果資訊提取候選關鍵詞。In an embodiment of the present invention, the above-mentioned system for recommending questions and answers based on data enhancement, wherein the data enhancement module is further used to filter the stop words (Stop Words) in the specific field training subset and the general field training subset word and output word segmentation result information, and extract candidate keywords from the word segmentation result information based on statistical methods.

在本發明的一實施例中，上述的基於資料增強推薦問答的系統，其中意圖辨識模組更用以基於增強訓練資料集、驗證資料集以及神經網路架構訓練分類模型，以對使用者的對話經語音轉換文字後輸入的資料進行類別標籤預測，以取得資料的類別標籤。In an embodiment of the present invention, in the above-mentioned system for recommending questions and answers based on data enhancement, the intent recognition module is further used to train the classification model based on the enhanced training data set, verification data set and neural network architecture, so as to analyze the user's The category label prediction is performed on the input data after the dialogue is converted from speech to text, so as to obtain the category label of the data.

在本發明的一實施例中，上述的基於資料增強推薦問答的系統，其中回覆檢索模組包括對話式語意解析模組以及對話狀態追蹤模組。In an embodiment of the present invention, in the above-mentioned system for recommending question-and-answer based on data enhancement, the answer retrieval module includes a conversational semantic analysis module and a dialogue state tracking module.

在本發明的一實施例中，上述的基於資料增強推薦問答的系統，其中回覆檢索模組更用以將資料的類別標籤及關鍵詞組合經對話式語意解析模組後將使用者的對話儲存於對話狀態追蹤模組，並對資料與資料庫中儲存的推薦資料進行相似度計算，經由對話式語意解析模組將最佳推薦資料推薦至使用者。In an embodiment of the present invention, in the above-mentioned system for recommending questions and answers based on data enhancement, the answer retrieval module is further used to store the user's dialogue after combining the category labels and keywords of the data through the conversational semantic analysis module In the dialog state tracking module, and calculate the similarity between the data and the recommended data stored in the database, and recommend the best recommended data to the user through the conversational semantic analysis module.

在本發明的一實施例中，上述的基於資料增強推薦問答的系統，其中回覆檢索模組更用以基於Top-N推薦算法以經由對話式語意解析模組根據語意強度、對話時間和點擊數對資料庫中儲存的推薦資料進行排序，以將最佳推薦資料推薦至使用者。In an embodiment of the present invention, the above-mentioned system for recommending questions and answers based on data enhancement, wherein the answer retrieval module is further used based on the Top-N recommendation algorithm, through the conversational semantic analysis module according to the semantic strength, dialogue time and number of clicks Sort the recommendation data stored in the database to recommend the best recommendation data to the user.

本發明的一種基於資料增強推薦問答的的方法，包括：根據訓練資料將訓練資料分類為特定領域資料及通用領域資料，選取特定領域預訓練模型及通用領域預訓練模型分別將特定領域資料及通用領域資料對應產生特定領域訓練子集及通用領域訓練子集；依據特定領域訓練子集及通用領域訓練子集分別提取關鍵詞，並生成類別關鍵詞清單，將類別關鍵詞清單與特定領域預訓練模型及通用領域預訓練模型進行資料增強以生成增強訓練資料集；接收使用者輸入的資料以基於增強訓練資料集訓練的分類模型辨識資料的類別標籤並且輸出類別標籤，並基於類別關鍵詞清單及資料提取並且輸出資料的關鍵詞組合；以及基於類別標籤及關鍵詞組合從資料庫檢索最佳推薦資料並且推薦至使用者。A method for recommending questions and answers based on data enhancement of the present invention, comprising: classifying the training data into specific domain data and general domain data according to the training data, selecting a specific domain pre-training model and a general domain pre-training model to separate the specific domain data and general domain data Domain-specific training subsets and general domain training subsets are correspondingly generated from domain data; keywords are extracted respectively according to domain-specific training subsets and general domain training subsets, and a category keyword list is generated, and the category keyword list is combined with domain-specific pre-training The model and the general domain pre-training model perform data enhancement to generate an enhanced training data set; receive the data input by the user to identify the category label of the data based on the classification model trained on the enhanced training data set and output the category label, and based on the list of category keywords and Extracting data and outputting a keyword combination of the data; and retrieving the best recommendation data from a database based on the category label and the keyword combination and recommending to the user.

基於上述，本發明提供一種基於資料增強推薦問答的系統及方法，將訓練資料集中的語句依據領域相關性分為特定領域訓練子集以及通用領域訓練子集，透過提取訓練子集中的類別關鍵詞並基於對應領域的預訓練模型將關鍵詞生成相似詞，其生成之相似詞可以保留該類別標籤的關鍵訊息，通過生成高質量的增強資料集來擴充訓練資料集，以避免機器學習發生擬合過度 (Overfitting) 的問題。Based on the above, the present invention provides a system and method for recommending questions and answers based on data enhancement. The sentences in the training data set are divided into a specific domain training subset and a general domain training subset according to domain relevance. By extracting category keywords in the training subset And based on the pre-training model in the corresponding field, the keywords are generated into similar words. The generated similar words can retain the key information of the category label, and expand the training data set by generating high-quality enhanced data sets to avoid machine learning from fitting The problem of overfitting.

為讓本發明的上述特徵和優點能更明顯易懂，下文特舉實施例，並配合所附圖式作詳細說明如下。In order to make the above-mentioned features and advantages of the present invention more comprehensible, the following specific embodiments are described in detail together with the accompanying drawings.

本發明的部份實施例接下來將會配合附圖來詳細描述，以下的描述所引用的元件符號，當不同附圖出現相同的元件符號將視為相同或相似的元件。這些實施例只是本發明的一部份，並未揭示所有本發明的可實施方式。更確切的說，這些實施例只是本發明的專利申請範圍中的方法、電子裝置以及電腦可讀取儲存媒體的範例。Parts of the embodiments of the present invention will be described in detail with reference to the accompanying drawings. For the referenced reference symbols in the following description, when the same reference symbols appear in different drawings, they will be regarded as the same or similar components. These embodiments are only a part of the present invention, and do not reveal all possible implementation modes of the present invention. More precisely, these embodiments are just examples of the method, electronic device and computer-readable storage medium within the scope of the patent application of the present invention.

圖1是依照本發明的一實施例的基於資料增強推薦問答的系統的示意圖。FIG. 1 is a schematic diagram of a system for recommending questions and answers based on data enhancement according to an embodiment of the present invention.

請參照圖1，基於資料增強推薦問答的系統10包括收發器110、儲存媒體120以及處理器130。Referring to FIG. 1 , the system 10 for recommending question-answer based on data enhancement includes a transceiver 110 , a storage medium 120 and a processor 130 .

收發器110以無線或有線的方式傳送及接收訊號。收發器110還可以執行例如低噪聲放大、阻抗匹配、混頻、向上或向下頻率轉換、濾波、放大以及類似的操作。The transceiver 110 transmits and receives signals in a wireless or wired manner. The transceiver 110 may also perform operations such as low noise amplification, impedance matching, frequency mixing, up or down frequency conversion, filtering, amplification, and the like.

儲存媒體120例如是任何型態的固定式或可移動式的隨機存取記憶體（random access memory，RAM）、唯讀記憶體（read-only memory，ROM）、快閃記憶體（flash memory）、硬碟（hard disk drive，HDD）、固態硬碟（solid state drive，SSD）或類似元件或上述元件的組合，儲存裝置102用以記錄可由處理器130執行的多個指令，更用於儲存可由處理器130執行的多個模組或各種應用程式。The storage medium 120 is, for example, any type of fixed or removable random access memory (random access memory, RAM), read-only memory (read-only memory, ROM), flash memory (flash memory) , hard disk drive (hard disk drive, HDD), solid state drive (solid state drive, SSD) or similar components or a combination of the above components, the storage device 102 is used to record multiple instructions that can be executed by the processor 130, and is also used to store A plurality of modules or various application programs executable by the processor 130 .

處理器130例如是中央處理單元（central processing unit，CPU），或是其他可程式化之一般用途或特殊用途的微控制單元（micro control unit，MCU）、微處理器（microprocessor）、數位訊號處理器（digital signal processor，DSP）、可程式化控制器、特殊應用積體電路（application specific integrated circuit，ASIC）、圖形處理器（graphics processing unit，GPU）、算數邏輯單元（arithmetic logic unit，ALU）、複雜可程式邏輯裝置（complex programmable logic device，CPLD）、現場可程式化邏輯閘陣列（field programmable gate array，FPGA）或其他類似元件或上述元件的組合。處理器130可耦接至儲存媒體120以及收發器110，並且存取和執行儲存於儲存媒體120中的多個模組和各種應用程式，以控制基於資料增強推薦問答的系統10的整體運作。The processor 130 is, for example, a central processing unit (central processing unit, CPU), or other programmable general purpose or special purpose micro control unit (micro control unit, MCU), microprocessor (microprocessor), digital signal processing Digital signal processor (DSP), programmable controller, application specific integrated circuit (ASIC), graphics processing unit (graphics processing unit, GPU), arithmetic logic unit (arithmetic logic unit, ALU) , complex programmable logic device (complex programmable logic device, CPLD), field programmable logic gate array (field programmable gate array, FPGA) or other similar components or a combination of the above components. The processor 130 is coupled to the storage medium 120 and the transceiver 110 , and accesses and executes a plurality of modules and various application programs stored in the storage medium 120 to control the overall operation of the system 10 based on data enhanced recommendation question answering.

在一實施例中，儲存媒體120可儲存包括預訓練模型選取模組1201、資料增強模組1202、意圖辨識模組1203以及回覆檢索模組1204等多個模組以及資料庫1205，其中資料庫1205中儲存多個推薦資料，回覆檢索模組1204包括對話式語意解析模組113以及對話狀態追蹤模組114，其功能將於後續說明。In one embodiment, the storage medium 120 can store multiple modules including a pre-training model selection module 1201, a data enhancement module 1202, an intention recognition module 1203, and a reply retrieval module 1204, as well as a database 1205, wherein the database A plurality of recommendation materials are stored in 1205 , and the reply retrieval module 1204 includes a dialog semantic analysis module 113 and a dialog status tracking module 114 , the functions of which will be described later.

圖2是依照本發明的一實施例的使用預訓練模型選取模組分類的示意圖。FIG. 2 is a schematic diagram of selecting module categories using a pre-trained model according to an embodiment of the present invention.

請參照圖2，預訓練模型選取模組1201根據訓練資料以及機率閾值將原始訓練資料集101經領域分類器102分類為特定領域資料及通用領域資料，並且選取特定領域預訓練模型及通用領域預訓練模型分別將特定領域資料及通用領域資料對應產生特定領域訓練子集104及通用領域訓練子集103，其中特定領域訓練子集104為屬於與特定領域較為相關的訓練資料，通用領域訓練子集103為屬於與特定領域較不相關的訓練資料，原始訓練資料集101包括特定領域訓練子集104的資料以及通用領域訓練子集103的資料。於本實施例中，領域分類器102為預訓練二元分類模型，通常是透過預先收集的特定領域內與特定領域外的資料或是基於網路爬蟲技術 (Web Crawler) 透過預先收集的領域關鍵詞清單獲取大量領域內 (即關鍵詞清單內的內容) 與領域外 (即關鍵詞清單外的內容) 的資料來訓練的二元分類模型。Please refer to FIG. 2 , the pre-training model selection module 1201 classifies the original training data set 101 into specific domain data and general domain data by the domain classifier 102 according to the training data and the probability threshold, and selects the specific domain pre-training model and the general domain pre-training model. The training model generates domain-specific training subset 104 and general domain training subset 103 by corresponding domain-specific data and general domain data, wherein domain-specific training subset 104 is training data that is more relevant to a specific domain, and the general domain training subset 103 is the training data that is less related to the specific domain. The original training data set 101 includes the data of the specific domain training subset 104 and the data of the general domain training subset 103 . In this embodiment, the domain classifier 102 is a pre-trained binary classification model, usually through pre-collected data in a specific domain and outside a specific domain or based on web crawler technology (Web Crawler) through pre-collected domain key The word list obtains a large amount of data in the domain (ie, the content in the keyword list) and outside the domain (ie, the content outside the keyword list) to train a binary classification model.

圖3是依照本發明的一實施例的使用資料增強模組的示意圖。FIG. 3 is a schematic diagram of a usage data enhancement module according to an embodiment of the present invention.

請參照圖3，資料增強模組1202與預訓練模型選取模組1201電性連接，資料增強模組1202將特定領域訓練子集104及通用領域訓練子集103經文字前處理。具體來說，資料增強模組1202將特定領域訓練子集104及通用領域訓練子集103中的停滯詞 (Stop Words)過濾後進行斷詞，並且輸出斷詞結果資訊，基於統計方法並且依據類別對斷詞結果資訊提取候選關鍵詞，計算特定領域訓練子集104及通用領域訓練子集103中各訓練資料對應的類別標籤的類別向量，且計算候選關鍵詞與各類別標籤的類別向量的相似度以生成類別關鍵詞清單111，基於類別關鍵詞清單111、特定領域預訓練模型以及通用領域預訓練模型生成相似詞105，於本實施例中，相似詞105是基於類別標籤中具有代表此類別標籤的關鍵價值提取的關鍵詞，可用以保留該類別標籤的關鍵訊息。且利用相似詞105擴充原始訓練資料集101，並且生成增強訓練資料集106。Please refer to FIG. 3 , the data enhancement module 1202 is electrically connected to the pre-training model selection module 1201 , and the data enhancement module 1202 pre-processes the domain-specific training subset 104 and the general domain training subset 103 . Specifically, the data enhancement module 1202 filters the stop words (Stop Words) in the domain-specific training subset 104 and the general domain training subset 103 to perform word segmentation, and outputs word segmentation result information, based on statistical methods and according to categories Extract candidate keywords from word segmentation result information, calculate the category vector of the category label corresponding to each training data in the specific field training subset 104 and the general field training subset 103, and calculate the similarity between the candidate keyword and the category vector of each category label Degree to generate category keyword list 111, based on category keyword list 111, specific domain pre-training model and general domain pre-training model to generate similar words 105, in this embodiment, similar words 105 is based on the category label has representative this category The keywords extracted from the key value of the label can be used to retain the key information of the category label. And use similar words 105 to expand the original training data set 101 and generate an enhanced training data set 106 .

圖4是依照本發明的一實施例的使用意圖辨識模組的示意圖。FIG. 4 is a schematic diagram of a usage intention recognition module according to an embodiment of the present invention.

請參照圖4，意圖辨識模組1203與資料增強模組1202電性連接，意圖辨識模組1203對使用者的對話經語音轉換文字後，接收使用者輸入的資料110，對資料110進行類別標籤預測，以取得資料110的類別標籤，並且意圖辨識模組1203基於增強訓練資料集106、驗證資料集107以及神經網路架構108訓練分類模型109，以基於類別關鍵詞清單111及資料110的關鍵詞提取輸出資料110的類別標籤與關鍵詞組合112。Please refer to FIG. 4 , the intent recognition module 1203 is electrically connected to the data enhancement module 1202. The intent recognition module 1203 converts the user's dialogue into text, receives the data 110 input by the user, and classifies the data 110. Prediction, to obtain the category label of the data 110, and the intent recognition module 1203 trains the classification model 109 based on the enhanced training data set 106, the verification data set 107 and the neural network architecture 108, so as to obtain the key words based on the category keyword list 111 and the data 110 The category label and keyword combination 112 of the word extraction output data 110 .

圖5是依照本發明的一實施例的使用回覆檢索模組的示意圖。FIG. 5 is a schematic diagram of using a reply retrieval module according to an embodiment of the present invention.

請參照圖5，回覆檢索模組1204與意圖辨識模組1203電性連接，回覆檢索模組1204將資料110的類別標籤與關鍵詞組合112經對話式語意解析模組113後將對應資料110的使用者的對話儲存於對話狀態追蹤模組114，並對資料110與資料庫1205中儲存的推薦資料進行相似度計算，基於Top-N推薦算法以經由對話式語意解析模組113根據語意強度、對話時間和點擊數對資料庫1205中儲存的推薦資料進行排序，以將最佳推薦資料推薦至使用者。Please refer to FIG. 5 , the reply retrieval module 1204 is electrically connected to the intent recognition module 1203 , the reply retrieval module 1204 combines the category tags and keywords 112 of the data 110 through the conversational semantic analysis module 113 and converts the corresponding data 110 The user's dialogue is stored in the dialogue state tracking module 114, and the similarity calculation is performed on the data 110 and the recommended data stored in the database 1205. Based on the Top-N recommendation algorithm, the conversational semantic analysis module 113 is based on the semantic strength, The conversation time and the number of hits sort the recommendation data stored in the database 1205 to recommend the best recommendation data to the user.

圖6是依照本發明的一實施例的基於資料增強推薦問答的方法的流程圖。FIG. 6 is a flow chart of a method for recommending questions and answers based on data enhancement according to an embodiment of the present invention.

請參照圖6，於步驟S601中，預訓練模型選取模組1201根據訓練資料將訓練資料分類為特定領域資料及通用領域資料，選取特定領域預訓練模型及通用領域預訓練模型分別將特定領域資料及通用領域資料對應產生特定領域訓練子集及通用領域訓練子集。Please refer to FIG. 6 , in step S601, the pre-training model selection module 1201 classifies the training data into domain-specific data and general domain data according to the training data, and selects the domain-specific pre-training model and the general domain pre-training model to separate the domain-specific data Corresponding to the general domain data, a domain-specific training subset and a general domain training subset are generated.

於步驟S602中，資料增強模組1202依據特定領域訓練子集及通用領域訓練子集分別提取候選關鍵詞，並生成類別關鍵詞清單111，將類別關鍵詞清單111與特定領域預訓練模型及通用領域預訓練模型進行資料增強以生成增強訓練資料集。In step S602, the data enhancement module 1202 extracts candidate keywords respectively according to the domain-specific training subset and the general domain training subset, and generates a category keyword list 111, and combines the category keyword list 111 with the domain-specific pre-training model and the general domain Domain pre-trained models are augmented to generate augmented training datasets.

於步驟S603中，意圖辨識模組1203接收使用者輸入的資料以基於增強訓練資料集訓練的分類模型辨識資料的類別標籤並且輸出類別標籤，並基於類別關鍵詞清單111及資料提取並且輸出資料的關鍵詞組合。In step S603, the intent identification module 1203 receives the data input by the user to identify the category label of the data based on the classification model trained in the enhanced training data set and outputs the category label, and extracts and outputs the category label based on the category keyword list 111 and the data keyword combination.

於步驟S604中，回覆檢索模組1204基於類別標籤及關鍵詞組合從資料庫1205檢索最佳推薦資料並且推薦至使用者。In step S604 , the reply retrieval module 1204 retrieves the best recommendation information from the database 1205 based on the combination of category tags and keywords and recommends them to the user.

於一實施例中，以使用者操作音箱情境進行說明，使用者通過終端設備 (例如：音箱、手機…等) 與基於資料增強推薦問答的系統10進行互動，資料集以勞工保險相關問答資料輔以說明。此基於資料增強推薦問答的系統可提供勞工保險相關領域的問與答服務，例如：勞工退休金繳款單補單方式、農保生育給付請領辦法…等。In one embodiment, the scenario of the user operating the speaker is used for illustration. The user interacts with the system 10 based on data-enhanced recommendation questions and answers through terminal equipment (such as speakers, mobile phones, etc.). The data set is supplemented by labor insurance-related question and answer data to illustrate. This data-enhanced and recommended Q&A system can provide Q&A services in related fields of labor insurance, such as: labor pension payment slip supplement method, agricultural insurance maternity benefits application method, etc.

結合圖1，於本實施例中，當使用者輸入一問句 “勞保老年給付年齡的計算說明”，音箱接收完此問句後，會透過網路將此問句經語音轉換文字操作之後的資訊傳遞至基於資料增強推薦問答的系統10，此時基於資料增強推薦問答的系統10會依序經由意圖辨識模組1203、回覆檢索模組1204而從資料庫120中檢索最佳推薦資料（即答案）並且將答案同樣地經文字轉換語音操作後傳遞至使用者音箱設備，因此使用者透過音箱會聽到答案：“勞保老年給付年齡計算依戶籍記載，自出生之日起實足計算”。而預訓練模型選取和資料增強於預訓練階段完成，不參與此問答歷程。以下詳細說明基於資料增強推薦問答的系統10如何處理問答的過程。In conjunction with Figure 1, in this embodiment, when the user enters a question "Description of the calculation of the age of labor insurance benefits for the elderly", after the speaker receives the question, it will convert the question through the network through the voice-to-text operation. The information is transmitted to the system 10 for recommending questions and answers based on data enhancement. At this time, the system 10 for recommending questions and answers based on data enhancement will search for the best recommended information from the database 120 through the intent identification module 1203 and the answer retrieval module 1204 in sequence (i.e. Answer) and the answer is also transmitted to the user's speaker device through the text-to-speech operation, so the user will hear the answer through the speaker: "The age of labor insurance old-age benefits is calculated according to the household registration record, and it is fully calculated from the date of birth." The pre-training model selection and data enhancement are completed in the pre-training stage, and do not participate in this question-and-answer process. The following describes in detail how the system 10 for recommending question and answer based on data enhancement processes the process of question answering.

首先將原始訓練資料集101通過預訓練的領域分類器102以基於機率閾值、選取特定領域預訓練模型及通用領域預訓練模型分別將原始訓練資料集101分為特定領域訓練子集104以及通用領域訓練子集103，例如：原始訓練資料集101中包含“個別勞工退休金專戶的收益金額與計算方式”與“為什麼我沒有收到電子帳單”兩句訓練語句，並將領域分類器 102 機率閾值設為0.7，其中語句“個別勞工退休金專戶的收益金額與計算方式”透過領域分類器 102 預測後得到的機率為0.73，另外語句“為什麼我沒有收到電子帳單”透過領域分類器 102 預測後得到的機率為0.42，基於前述結果，語句“個別勞工退休金專戶的收益金額與計算方式”會被分至特定領域訓練子集 104，另外語句“為什麼我沒有收到電子帳單”會被分至通用領域訓練子集 103。隨後將得到的特定領域訓練子集104以及通用領域訓練子集103選取對應的特定領域預訓練模型及通用領域預訓練模型經訓練後傳送至資料增強模組1202。其中特定領域預訓練模型通常透過包含數億個字的特定領域的文本資料集並基於深度神經網路進行訓練而成，其中通用領域預訓練模型通常透過包含數億個字的通用領域的文本資料集並基於深度神經網路進行訓練而成。First, the original training data set 101 is divided into a specific domain training subset 104 and a general domain training subset 104 based on the probability threshold, selecting a specific domain pre-training model and a general domain pre-training model through the pre-trained domain classifier 102. Training subset 103, for example: the original training data set 101 contains two training sentences of "the income amount and calculation method of individual labor pension special account" and "why I did not receive the electronic bill", and the domain classifier 102 The probability threshold is set to 0.7, and the probability of the sentence "the amount and calculation method of individual labor pension special accounts" is 0.73 after being predicted by the field classifier 102, and the sentence "why I did not receive the electronic bill" is classified through the field The probability obtained after prediction by the machine 102 is 0.42. Based on the foregoing results, the sentence "the amount and calculation method of the income of individual labor pension special accounts" will be divided into the specific field training subset 104. In addition, the sentence "why did I not receive the electronic account Single" will be divided into general domain training subset 103. Subsequently, the obtained domain-specific training subset 104 and the general domain training subset 103 are selected from the corresponding domain-specific pre-training models and general domain pre-training models, and then sent to the data enhancement module 1202 after training. The domain-specific pre-training model is usually trained based on a deep neural network through a domain-specific text data set containing hundreds of millions of words, and the general-purpose domain pre-training model is usually trained through a general-purpose domain text data containing hundreds of millions of words collected and trained based on a deep neural network.

當資料增強模組1202 接收到上述預訓練模型選取模組1201的特定領域訓練子集104、通用領域訓練子集103、特定領域預訓練模型與通用領域預訓練模型後，開始進行資料增強。首先對特定領域訓練子集104、通用領域訓練子集103進行基本的文字前處理。具體來說，先將語句中的停滯詞刪除並給予適當的斷詞，例如：原始訓練資料集101中包含語句“個別勞工退休金專戶的收益金額與計算方式”，其透過文字前處理後會轉換成 [“個別” “勞工” “退休金” “專戶” “收益” “金額” “計算” “方式”]。隨後基於統計方法中的詞頻逆向檔案頻率 (Term Frequency–Inverse Document Frequency, TF-IDF) 從兩個訓練子集（特定領域訓練子集104、通用領域訓練子集103）的斷詞結果中提取候選關鍵詞，如圖7所示。After the data enhancement module 1202 receives the domain-specific training subset 104 , the general domain training subset 103 , the domain-specific pre-training model, and the general domain pre-training model from the above-mentioned pre-training model selection module 1201 , data enhancement begins. Firstly, basic text preprocessing is performed on the domain-specific training subset 104 and the general domain training subset 103 . Specifically, first delete the stagnant words in the sentence and give appropriate word breaks, for example: the original training data set 101 contains the sentence "the amount of income and calculation method of individual labor pension special account", which is processed through text It will be converted into ["individual" "labor" "pension" "special account" "income" "amount" "calculation" "method"]. Then, based on the Term Frequency–Inverse Document Frequency (TF-IDF) in the statistical method, candidates are extracted from the segmentation results of the two training subsets (specific domain training subset 104, general domain training subset 103) keywords, as shown in Figure 7.

圖7是依照本發明的一實施例的候選關鍵詞的示意圖。將兩個訓練子集（特定領域訓練子集104、通用領域訓練子集103）分別透過選取的特定領域預訓練模型及通用領域預訓練模型利用公式1計算兩個訓練子集中各類別標籤的類別向量，其公式1如下所示。FIG. 7 is a schematic diagram of candidate keywords according to an embodiment of the present invention. The two training subsets (the domain-specific training subset 104 and the general domain training subset 103) are respectively used to calculate the category of each category label in the two training subsets through the selected domain-specific pre-training model and the general domain pre-training model using formula 1 vector, whose Equation 1 is shown below.

公式1:

Formula 1:

其中

為各訓練子集中類別標籤的數量、

為類別標籤、

為各類別標籤中包含的資料集數量以及

為向量的維度大小，

為第

個句子word embedding向量的第

維。隨後將上述候選關鍵詞與各類別標籤的類別向量透過餘弦相似度 (Cosine Similarity) 計算相似性（即公式2），從而獲得類別關鍵詞清單111，其公式2如下所示。 in

is the number of category labels in each training subset,

for the category labels,

is the number of data sets contained in each category label and

is the dimension size of the vector,

for the first

The first sentence word embedding vector

dimension. Then, the above candidate keywords and the category vectors of each category label are used to calculate the similarity through cosine similarity (ie formula 2), so as to obtain the category keyword list 111, the formula 2 of which is shown below.

公式2：

Formula 2:

其中的

和

分別代表

和

的向量，相似性範圍為

，

表示

和

兩個向量指向的方向完全相反，即相似性完全不同，另外

表示兩個向量指向的方向完全相同，即相似性完全相同。得到類別關鍵詞清單111後，基於上述對應的特定通用領域預訓練模型及通用領域預訓練模型將圖7中各類別標籤的關鍵詞生成相似詞，透過其生成的相似詞可以保留該類別標籤的關鍵訊息，如圖8所示。圖8是依照本發明的一實施例的類別關鍵詞的示意圖。生成相似詞後，基於相似詞來擴充原始訓練資料集101以得到增強訓練資料集106。重複此步驟至完成對所有的特定領域訓練子集104、通用領域訓練子集103以進行資料增強，合併所有擴充的增強訓練資料集106以作為最後的增強訓練資料集，如圖9所示，圖9是依照本發明的一實施例的增強訓練資料集的示意圖。 one of them

and

Representing

and

A vector with a similarity range of

,

express

and

The two vectors point in completely opposite directions, that is, the similarities are completely different, and in addition

Indicates that the two vectors point in exactly the same direction, that is, the similarity is exactly the same. After obtaining the category keyword list 111, based on the above-mentioned corresponding specific general domain pre-training model and general domain pre-training model, generate similar words for the keywords of each category label in Figure 7, and the generated similar words can retain the category label. Key information, as shown in Figure 8. FIG. 8 is a schematic diagram of category keywords according to an embodiment of the present invention. After the similar words are generated, the original training data set 101 is expanded based on the similar words to obtain an enhanced training data set 106 . Repeat this step to complete all domain-specific training subsets 104 and general domain training subsets 103 for data enhancement, and merge all expanded enhanced training data sets 106 as the final enhanced training data set, as shown in FIG. 9 , FIG. 9 is a schematic diagram of an augmented training data set according to an embodiment of the present invention.

其中分類模型109是透過上述增強訓練資料集106基於神經網路架108設置 softmax作為激勵函數進行訓練並基於驗證資料集107評估當前的分類模型好壞而修正此分類模型。公式3中softmax 的輸出表示不同類別標籤之間相對的機率，其公式3如下。The classification model 109 is trained by setting softmax as the activation function based on the neural network framework 108 through the above-mentioned enhanced training data set 106, and the current classification model is evaluated based on the verification data set 107 to modify the classification model. The output of softmax in Equation 3 represents the relative probability between different category labels, and its Equation 3 is as follows.

公式3：

Formula 3:

其中

表示類別標籤，

表示資料

為類別標籤

的機率。另外為了預防過擬合，在模型訓練中加入了

正則化，其增加一個索引來描述公式4中損失函數 (Loss Function) 中模型的複雜度，其公式4如下。 in

represents the category label,

Indicate data

for category labels

probability. In addition, in order to prevent overfitting, the model training is added

Regularization, which adds an index to describe the complexity of the model in the loss function (Loss Function) in formula 4, and its formula 4 is as follows.

公式4：

Formula 4:

其中

為模型的權重，做

正則化後當作損失函數

。 in

For the model weights, do

Regularized as a loss function

.

使用者輸入經過語音轉換文字得到文字內容為“勞保老年給付年齡的計算說明”的問句後，並將此問句輸入至分類模型109進行使用者的類別標籤預測，隨後可以得到預測此問句的類別標籤為“勞工保險”。另外透過前述的類別關鍵詞清單111對此問句進行關鍵詞提取，可以得到內容為“勞保”以及“老年給付”的關鍵詞，隨後將上述的類別標籤與關鍵詞組合112提交至回覆檢索模組1204。After the user enters the question sentence with the text content of "Calculation Instructions for the Age of Labor Insurance Elderly Benefits" obtained through voice conversion, the question sentence is input into the classification model 109 to predict the user's category label, and then the predicted question sentence can be obtained has a category label of "Labor Insurance". In addition, through the aforementioned category keyword list 111, keyword extraction is carried out on this question sentence, and the keywords whose contents are "labor insurance" and "old age benefit" can be obtained, and then the above-mentioned category label and keyword combination 112 is submitted to the reply retrieval module Group 1204.

對話式語意解析模組113透過上述類別標籤與關鍵詞組合112以基於最小編輯距離(Minimum Edit Distance)的相似度計算及其他指標，從資料庫1205（於本實施例中，可以為問答知識庫115）中找出較相近的結果。其中最小編輯距離我們採用Levenshtein的定義。若一個字串編輯成另一個字串可以進行下列三種動作：The conversational semantic analysis module 113 uses the above-mentioned category label and keyword combination 112 to calculate the similarity based on the minimum edit distance (Minimum Edit Distance) and other indicators, from the database 1205 (in this embodiment, it can be a question and answer knowledge base) 115) to find similar results. Among them, the minimum edit distance we use Levenshtein's definition. If a character string is edited into another character string, the following three actions can be performed:

從起始狀態開始，由

及

，依序去計算長度為

字串的最小編輯距離。 Starting from the initial state, by

and

, in order to calculate the length of

The minimum edit distance of a string.

最後結合多種指標，例如：對話狀態追蹤模組114中的關鍵詞、對話時間和點擊次數，進行排序，據以提升推薦結果的準確度，最後經由Top-N算法推薦最佳推薦資料給使用者。Finally, various indicators are combined, such as: keywords in the dialogue state tracking module 114, dialogue time and number of clicks, sorted, so as to improve the accuracy of the recommendation results, and finally recommend the best recommendation information to the user through the Top-N algorithm .

雖然本揭露已以實施例揭露如上，然其並非用以限定本揭露，任何所屬技術領域中具有通常知識者，在不脫離本揭露的精神和範圍內，當可作些許的更動與潤飾，故本揭露的保護範圍當視後附的申請專利範圍所界定者為準。Although the present disclosure has been disclosed above with embodiments, it is not intended to limit the present disclosure. Anyone with ordinary knowledge in the technical field may make some changes and modifications without departing from the spirit and scope of the present disclosure. The scope of protection of this disclosure should be defined by the scope of the appended patent application.

10:基於資料增強推薦問答的系統 110:收發器 120:儲存媒體 130:處理器 1201:預訓練模型選取模組 1202:資料增強模組 1203:意圖辨識模組 1204:回覆檢索模組 1205:資料庫 101:原始訓練資料集 102:領域分類器 103:通用領域訓練子集 104:特定領域訓練子集 105:相似詞 106:增強訓練資料集 107:驗證資料集 108:神經網路架構 109:分類模型 110:資料 111:類別關鍵詞清單 112:類別標籤與關鍵詞組合 113:對話式語意解析模組 114:對話狀態追蹤模組 115:問答知識庫 S601、S602、S603、S604:步驟10: A system based on data enhancement recommendation question answering 110: Transceiver 120: storage media 130: Processor 1201: Pre-training model selection module 1202: Data enhancement module 1203: Intent recognition module 1204:Reply search module 1205: database 101: Raw training data set 102:Domain Classifier 103:General Domain Training Subsets 104:Domain-Specific Training Subsets 105: Similar words 106: Enhanced training data set 107:Validation dataset 108: Neural Network Architecture 109: Classification Models 110: Information 111:Category keyword list 112:Category label and keyword combination 113: Conversational Semantic Analysis Module 114:Dialogue state tracking module 115: Question and answer knowledge base S601, S602, S603, S604: steps

圖1是依照本發明的一實施例的基於資料增強推薦問答的系統的示意圖。圖2是依照本發明的一實施例的使用預訓練模型選取模組分類的示意圖。圖3是依照本發明的一實施例的使用資料增強模組的示意圖。圖4是依照本發明的一實施例的使用意圖辨識模組的示意圖。圖5是依照本發明的一實施例的使用回覆檢索模組的示意圖。圖6是依照本發明的一實施例的基於資料增強推薦問答的方法的流程圖。圖7是依照本發明的一實施例的候選關鍵詞的示意圖。圖8是依照本發明的一實施例的類別關鍵詞的示意圖。圖9是依照本發明的一實施例的增強訓練資料集的示意圖。 FIG. 1 is a schematic diagram of a system for recommending questions and answers based on data enhancement according to an embodiment of the present invention. FIG. 2 is a schematic diagram of selecting module categories using a pre-trained model according to an embodiment of the present invention. FIG. 3 is a schematic diagram of a usage data enhancement module according to an embodiment of the present invention. FIG. 4 is a schematic diagram of a usage intention recognition module according to an embodiment of the present invention. FIG. 5 is a schematic diagram of using a reply retrieval module according to an embodiment of the present invention. FIG. 6 is a flow chart of a method for recommending questions and answers based on data enhancement according to an embodiment of the present invention. FIG. 7 is a schematic diagram of candidate keywords according to an embodiment of the present invention. FIG. 8 is a schematic diagram of category keywords according to an embodiment of the present invention. FIG. 9 is a schematic diagram of an augmented training data set according to an embodiment of the present invention.

10:基於資料增強推薦問答的系統 10: A system based on data enhancement recommendation question answering

110:收發器 110: Transceiver

120:儲存媒體 120: storage media

130:處理器 130: Processor

1201:預訓練模型選取模組 1201: Pre-training model selection module

1202:資料增強模組 1202: Data enhancement module

1203:意圖辨識模組 1203: Intent recognition module

1204:回覆檢索模組 1204:Reply search module

1205:資料庫 1205: database

Claims

A system for enhancing recommendation questions and answers based on data, including: transceiver; Storage media, storing multiple modules and databases; and The processor is coupled to the storage medium and the transceiver, and accesses and executes the modules, wherein the modules include: The pre-training model selection module classifies the training data into domain-specific data and general domain data according to the training data, and selects the domain-specific pre-training model and the general domain pre-training model to generate corresponding domain-specific data and general domain data respectively. domain training subset and general domain training subset; The data enhancement module is electrically connected with the pre-training model selection module, extracts candidate keywords respectively according to the specific field training subset and the general field training subset, and generates a category keyword list, and the category keyword list performing data augmentation with the domain-specific pre-training model and the general domain pre-training model to generate an enhanced training data set; The intention recognition module is electrically connected to the data enhancement module, receives the data input by the user, recognizes the category label of the data based on the classification model trained by the enhanced training data set and outputs the category label, and based on the category keyword A list and combination of keywords for extracting the data and outputting the data; and The reply retrieval module is electrically connected with the intent recognition module, and retrieves the best recommended information from the database based on the category tag and the keyword combination and recommends it to the user.

The system for recommending questions and answers based on data enhancement as described in Claim 1, wherein the pre-training model selection module is further used to classify the training data into the specific domain data and the general domain data according to a probability threshold.

The system for recommending questions and answers based on data enhancement as described in claim 1, wherein the domain-specific training subset belongs to the training data related to a specific domain, and the general domain training subset belongs to the training data not related to a specific domain .

The system for recommending questions and answers based on data enhancement as described in claim 1, wherein the data enhancement module is further used to calculate the category vector of each category label in the specific domain training subset and the general domain training subset, and calculate the candidate keywords With the similarity of the category vector of each category label to generate a category keyword list, generate similar words based on the category keyword list, the specific domain pre-training model and the general domain pre-training model, use the similar words to expand the training data set and Generate the augmented training data set.

The system for recommending questions and answers based on data enhancement as described in claim 1, wherein the data enhancement module is further used to filter the stop words (Stop Words) in the specific domain training subset and the general domain training subset to segment words And output word segmentation result information, and extract the candidate keyword from the word segmentation result information based on a statistical method.

The system for recommending questions and answers based on data enhancement as described in claim 1, wherein the intent recognition module is further used to train a classification model based on the enhanced training data set, verification data set, and neural network architecture, so as to respond to the user's dialogue The class label prediction is performed on the input data after the voice-to-text conversion, so as to obtain the class label of the data.

The system for recommending questions and answers based on data enhancement as described in Claim 6, wherein the answer retrieval module includes a conversational semantic analysis module and a dialogue state tracking module.

The system for recommending questions and answers based on data enhancement as described in claim item 7, wherein the answer retrieval module is further used to combine the category label of the data and the keyword through the conversational semantic analysis module to combine the user's dialogue Stored in the dialog state tracking module, and calculate the similarity between the data and the recommended data stored in the database, and recommend the best recommended data to the user through the dialog semantic analysis module.

The system for recommending questions and answers based on data enhancement as described in claim 8, wherein the answer retrieval module is further used to use the Top-N recommendation algorithm based on the conversational semantic analysis module according to semantic strength, dialogue time and number of clicks. The recommended data stored in the database is sorted to recommend the best recommended data to the user.

A method for enhancing recommendation question answering based on data, comprising: According to the training data, the training data is classified into specific domain data and general domain data, and the specific domain pre-training model and the general domain pre-training model are selected to generate specific domain training subsets and general domain correspondingly to the specific domain data and the general domain data. training subset; Extracting keywords respectively according to the specific domain training subset and the general domain training subset, and generating a category keyword list, and performing data enhancement on the category keyword list with the specific domain pre-training model and the general domain pre-training model to Generate an augmented training dataset; receiving user-input data to identify a class label of the data based on the classification model trained on the enhanced training data set and outputting the class label, and extracting and outputting a keyword combination of the data based on the class keyword list and the data; and Based on the category tag and the keyword combination, the best recommended information is retrieved from the database and recommended to the user.