TW202242848A

TW202242848A - Knowledge entity identification method and knowledge entity identification device

Info

Publication number: TW202242848A
Application number: TW110113786A
Authority: TW
Inventors: 曾俋穎; 邱德旺
Original assignee: 台達電子工業股份有限公司
Priority date: 2021-04-16
Filing date: 2021-04-16
Publication date: 2022-11-01
Also published as: TWI777496B

Abstract

A knowledge entity identification method includes steps of receiving an object text to be analyzed and metadata, which the object text includes a candidate word; comparing the candidate word with words in a knowledge database to obtain a plurality of entity name associated with the candidate word, which each entity name includes entity description data; comparing the entity description data in the knowledge database with the metadata to obtain a comparing result; and setting the entity name associated with the candidate word according to the comparing result as an output classification of the candidate word in the object text.

Description

Knowledge entity recognition method and knowledge entity recognition device

本案是有關於一種電子裝置及其方法，且特別是有關於一種知識實體識別裝置及方法。This case is about an electronic device and its method, and in particular about a knowledge entity recognition device and method.

傳統的知識管理方法是由專家對所有的文件來逐一進行人工的資料標記。隨著技術的進展，目前的資料標記方法可以透過自然語言技術來分析語法及語義，然而，這樣的語料分析並無法使機器理解新詞，而是需要由專家人員進行標記。現有之運用標記的資料來進行訓練的過程過於冗長且欠缺彈性，對於已建立好的知識管理系統也難以基於現有的資料來訓練新領域的知識管理系統，針對不同領域的知識管理系統的建置需要花費相當高的訓練成本。The traditional knowledge management method is to manually mark all the files one by one by experts. With the development of technology, the current data labeling method can analyze grammar and semantics through natural language technology. However, such corpus analysis cannot enable machines to understand new words, but needs to be labeled by experts. The existing training process using labeled data is too lengthy and inflexible, and it is difficult for established knowledge management systems to train knowledge management systems in new fields based on existing data. The construction of knowledge management systems for different fields It requires a fairly high training cost.

有鑑於此，知識管理系統是具備相當管理動能的工具，然而目前仍欠缺高效率的建置方法，並且知識管理系統的執行精確度仍具有改善空間。據此，如何提供高效率的系統建置方法及提供高精確度的知識管理是所屬技術領域具通常之識者所欲解決的技術問題。In view of this, the knowledge management system is a tool with considerable management momentum, but there is still a lack of efficient construction methods, and there is still room for improvement in the execution accuracy of the knowledge management system. Accordingly, how to provide an efficient system construction method and high-precision knowledge management is a technical problem that those skilled in the art want to solve.

根據本案之一實施例，揭示一種知識實體識別方法其包括以下步驟：接收待解析之目標文本及元資料，其中目標文本包括候選字詞；使用候選字詞於知識庫中進行比對，以從知識庫中獲得關聯於候選字詞之複數個實體名稱，其中各該實體名稱具有對應之實體說明資料；比對知識庫中的實體說明資料與元資料以獲得比對結果；以及根據比對結果將知識庫中關聯於候選字詞之實體名稱設定為該目標文本中候選字詞的輸出分類。According to one embodiment of this case, a knowledge entity recognition method is disclosed, which includes the following steps: receiving the target text to be parsed and metadata, wherein the target text includes candidate words; using the candidate words to compare in the knowledge base to obtain from obtaining a plurality of entity names associated with the candidate word in the knowledge base, wherein each of the entity names has corresponding entity description data; comparing the entity description data and metadata in the knowledge base to obtain a comparison result; and according to the comparison result Set the entity name associated with the candidate word in the knowledge base as the output category of the candidate word in the target text.

根據另一實施例，揭示一種知識實體識別裝置其包括知識實體候選產生模組、知識實體驗證與增強模組以及知識實體分類模組。知識實體候選產生模組經配置以接收待解析之目標文本及元資料，以及使用目標文本之候選字詞於知識庫中進行比對，以從知識庫中獲得關聯於候選字詞之複數個實體名稱，其中各該實體名稱具有對應之實體說明資料。知識實體驗證與增強模組耦接知識實體候選產生模組，其中知識實體驗證與增強模組經配置以比對知識庫中的實體說明資料與元資料以獲得比對結果。知識實體分類模組耦接知識實體驗證與增強模組，其中知識實體分類模組經配置以根據比對結果將知識庫中關聯於候選字詞之實體名稱設定為目標文本中候選字詞的輸出分類。According to another embodiment, a knowledge entity recognition device is disclosed, which includes a knowledge entity candidate generation module, a knowledge entity verification and enhancement module, and a knowledge entity classification module. The knowledge entity candidate generation module is configured to receive the target text and metadata to be parsed, and use the candidate words of the target text for comparison in the knowledge base to obtain a plurality of entities associated with the candidate words from the knowledge base Name, where each entity name has corresponding entity description data. The knowledge entity verification and enhancement module is coupled to the knowledge entity candidate generation module, wherein the knowledge entity verification and enhancement module is configured to compare entity description data and metadata in the knowledge base to obtain a comparison result. The knowledge entity classification module is coupled to the knowledge entity verification and enhancement module, wherein the knowledge entity classification module is configured to set the entity name associated with the candidate word in the knowledge base as the output of the candidate word in the target text according to the comparison result Classification.

以下揭示內容提供許多不同實施例，以便實施本案之不同特徵。下文描述元件及排列之實施例以簡化本案。當然，該些實施例僅為示例性且並不欲為限制性。舉例而言，本案中使用「第一」、「第二」等用語描述元件，僅是用以區別以相同或相似的元件或操作，該用語並非用以限定本案的技術元件，亦非用以限定操作的次序或順位。The following disclosure provides many different embodiments in order to implement the different features of the present disclosure. Embodiments of components and arrangements are described below to simplify the present case. Of course, these examples are exemplary only and are not intended to be limiting. For example, the use of terms such as "first" and "second" in this case to describe components is only used to distinguish the same or similar components or operations. These terms are not used to limit the technical components of this case, nor are they intended to Restricts the order or sequence of operations.

請參照第1圖，其繪示根據本案一實施例中的一種知識實體識別裝置100的方塊圖。知識實體識別裝置100用以對一輸入資料102識別當中的目標物，並提供識別後之輸出資料104。舉例而言，知識實體識別裝置100解析一輸入之文本、句子、段落等資料，以進行命名實體識別(Named Entity Recognition)。於一實施例中，知識實體識別裝置100接收的輸入資料102包括目標文本及元資料(Domain Metadata)。目標文本為待解析的資料。元資料為用以輔助目標文本之解析的資料，可以是使用者事先設計的類別及其關鍵字詞。Please refer to FIG. 1 , which shows a block diagram of a knowledge entity recognition device 100 according to an embodiment of the present invention. The knowledge entity recognition device 100 is used for recognizing a target object in an input data 102 and providing a recognized output data 104 . For example, the knowledge entity recognition device 100 analyzes an input text, sentence, paragraph and other data to perform named entity recognition (Named Entity Recognition). In one embodiment, the input data 102 received by the knowledge entity recognition device 100 includes target text and domain metadata. The target text is the data to be parsed. Metadata is data used to assist the analysis of the target text, which may be categories and keywords designed by the user in advance.

於一實施例中，知識實體識別裝置100耦接於知識庫500。知識庫500耦接於外部通用知識庫600。外部通用知識庫600例如是維基百科、專用辭典、領域專家知識等具有不同格式及領域內容之資料庫。知識庫500可以是儲存由內部自行定義的知識及/或透過外部通用知識庫600的資料所建立的知識資料。舉例而言，知識庫500設置有解析及儲存模組502。解析及儲存模組502可以讀取外部通用知識庫600的資料，並將外部資料轉換為特定格式的資料結構，例如將外部資料及領域專家知識資料進行正規處理，使得知識庫500中儲存的資料可以提供給知識實體識別裝置100識別目標文本時之使用。In one embodiment, the knowledge entity recognition device 100 is coupled to the knowledge base 500 . The knowledge base 500 is coupled to an external common knowledge base 600 . The external general knowledge base 600 is, for example, a database with different formats and domain content, such as Wikipedia, special dictionary, domain expert knowledge, etc. The knowledge base 500 may store knowledge defined internally and/or knowledge data established through data from the external general knowledge base 600 . For example, the knowledge base 500 is provided with an analysis and storage module 502 . The parsing and storage module 502 can read the data of the external general knowledge base 600, and convert the external data into a data structure in a specific format, such as formally processing the external data and domain expert knowledge data, so that the data stored in the knowledge base 500 It can be provided for use when the knowledge entity recognition device 100 recognizes the target text.

於一實施例中，知識實體識別裝置100包括知識實體候選產生模組112、知識實體驗證與增強模組114以及知識實體分類模組116。知識實體候選產生模組112電性耦接於知識實體驗證與增強模組114。知識實體驗證與增強模組114電性耦接於知識實體分類模組116。為便於理解本案內容，以下說明請一併參照第1圖及第2圖。第2圖繪示根據本案一實施例中一種知識實體識別方法200的流程圖。知識實體識別方法200可由第1圖的知識實體識別裝置100所執行。In one embodiment, the knowledge entity identification device 100 includes a knowledge entity candidate generation module 112 , a knowledge entity verification and enhancement module 114 , and a knowledge entity classification module 116 . The knowledge entity candidate generation module 112 is electrically coupled to the knowledge entity verification and enhancement module 114 . The knowledge entity verification and enhancement module 114 is electrically coupled to the knowledge entity classification module 116 . In order to facilitate understanding of the content of this case, please refer to Figure 1 and Figure 2 for the following description. FIG. 2 shows a flow chart of a knowledge entity recognition method 200 according to an embodiment of the present application. The knowledge entity recognition method 200 can be executed by the knowledge entity recognition device 100 in FIG. 1 .

於步驟S210，知識實體候選產生模組112接收待解析之目標文本及元資料。In step S210, the knowledge entity candidate generation module 112 receives the target text and metadata to be parsed.

於一實施例中，待解析之目標文本是欲分析的文本資料，包括一或多個句子或段落等形式。另一方面，元資料包括複數個類別(key)，每一個類別包括複數個關鍵字詞(value)。使用者可以事先定義元資料的所有類別以及每一個類別的關鍵字詞，並與目標文本同時輸入至知識實體識別裝置100。為利於說明本案內容，以下提及之目標文本是以句子“An apple a day keeps the doctor away.”以及元資料是表一所示的內容作為舉例說明。應注意的是，本案不以此舉例內容為限。In one embodiment, the target text to be analyzed is text data to be analyzed, including one or more sentences or paragraphs. On the other hand, the metadata includes a plurality of categories (keys), and each category includes a plurality of key words (values). The user can define all categories of metadata and keywords of each category in advance, and input them into the knowledge entity recognition device 100 simultaneously with the target text. To facilitate the description of the content of this case, the target text mentioned below is an example of the sentence "An apple a day keeps the doctor away." and the metadata shown in Table 1. It should be noted that this case is not limited to the content of this example.

表一：元資料類別(key) 關鍵字詞(value) FRUIT fruit, juicy, tree, … MEAT animal, hunt, … DESSERT sugar, sweet, … Table 1: Metadata category (key) Key words (value) FRUIT fruit, juicy, tree, … MEAT animal, hunt, … DESSERT sugar, sweet, …

於一實施例中，知識實體候選產生模組112執行自然語言處理技術來抽取目標文本的名詞或名詞片語。這些被抽取出來的名詞或名詞片語被作為目標文本的候選字詞。承上述目標文本“An apple a day keeps the doctor away.”的範例，從目標文本中抽取出的候選字詞包括 “apple”、“day”以及“doctor”。目標文本的候選字詞之個數會因目標文本的內容而異。於一實施例中，目標文本包括一或多個候選字詞。於此目標文本的範例中，候選字詞的個數是3個。In one embodiment, the knowledge entity candidate generation module 112 executes natural language processing technology to extract nouns or noun phrases of the target text. These extracted nouns or noun phrases are used as candidate words of the target text. Following the above example of the target text "An apple a day keeps the doctor away.", the candidate words extracted from the target text include "apple", "day" and "doctor". The number of candidate words for the target text varies depending on the content of the target text. In one embodiment, the target text includes one or more candidate words. In the example of the target text, the number of candidate words is 3.

於步驟S220，知識實體候選產生模組112使用目標文本的候選字詞，於知識庫500中進行比對，以從知識庫500中獲得關聯於候選字詞之複數個實體名稱。In step S220 , the knowledge entity candidate generating module 112 uses the candidate words of the target text to perform comparison in the knowledge base 500 to obtain a plurality of entity names associated with the candidate words from the knowledge base 500 .

於一實施例中，知識實體驗證與增強模組114逐一地對這些候選字詞於知識庫500中進行比對。知識庫500記錄複數個實體資料。每一個實體資料的資料結構包括但不限於編號、實體名稱、實體說明、實體種類等，如表二所示。In one embodiment, the knowledge entity verification and enhancement module 114 compares these candidate words in the knowledge base 500 one by one. The knowledge base 500 records a plurality of entity data. The data structure of each entity data includes but not limited to number, entity name, entity description, entity type, etc., as shown in Table 2.

表二：知識庫編號實體名稱實體說明實體種類 0 Apple Inc. Apple Inc. is an American multinational technology company headquartered in Cupertino, California. Mobile phone manufacturers; Companies in the NASDAQ-100 Index. 1 Apple An apple is an edible fruit produced by an apple tree (Malus domestica). Apple trees are cultivated worldwide and are the most widely grown species in the genus Malus. Fruits; Malus; Plants. 2 Pineapple The pineapple (Ananas comosus) is a tronical plant with an edible fruit and the most economically significant plant in the family Bromeliance. Fruits; Ananas; Plants. 3 Apple, Oklahoma Apple, Oklahoma is an unincorporated community located near Hugo Lake and State Highway 93 in Choctaw County, Oklahoma. Oklahoma geography stubs. … N Orange The orange is the fruit of various citrus species in the family Rutaceae (see list of plants known as orange) Fruits; Tropical agriculture. Table 2: Knowledge Base serial number entity name entity description Entity type 0 Apple Inc. Apple Inc. is an American multinational technology company headquartered in Cupertino, California. Mobile phone manufacturers; Companies in the NASDAQ-100 Index. 1 Apple An apple is an edible fruit produced by an apple tree (Malus domestica). Apple trees are cultivated worldwide and are the most widely grown species in the genus Malus. Fruits; Malus; Plants. 2 Pineapple The pineapple (Ananas comosus) is a tropical plant with an edible fruit and the most economically significant plant in the family Bromeliance. Fruits; Ananas; Plants. 3 Apple, Oklahoma Apple, Oklahoma is an unincorporated community located near Hugo Lake and State Highway 93 in Choctaw County, Oklahoma. Oklahoma geography stubs. … N Orange The orange is the fruit of various citrus species in the family Rutaceae (see list of plants known as orange) Fruits; Tropical agriculture.

承上述範例，當候選字詞是“apple”時，知識實體驗證與增強模組114將候選字詞“apple”在表二的知識庫500進行比對，而可得到關聯於“apple”的複數個實體名稱，例如編號0的“Apple Inc.”、編號1的“Apple”、編號2的“Pineapple”以及編號3的“Apple, Oklahoma”。於一實施例中，這些得到的編號0至編號3的實體名稱可以被記錄於候選字詞“apple”的候選清單中。另一方面，由於編號N的實體名稱“Orange”與候選字詞“apple”之間不相同/相似，因此編號N的實體名稱“Orange”不會被記錄於候選字詞“apple”的候選清單中。Following the above example, when the candidate word is "apple", the knowledge entity verification and enhancement module 114 compares the candidate word "apple" in the knowledge base 500 in Table 2, and obtains the plural associated with "apple". entity names, such as "Apple Inc." for number 0, "Apple" for number 1, "Pineapple" for number 2, and "Apple, Oklahoma" for number 3. In an embodiment, the obtained entity names numbered 0 to 3 may be recorded in the candidate list of the candidate word "apple". On the other hand, since the entity name "Orange" with number N is not identical/similar to the candidate word "apple", the entity name "Orange" with number N will not be recorded in the candidate list for the candidate word "apple" middle.

於一實施例中，將候選字詞在知識庫500中進行查找比對的資訊檢索方法可以是詞頻-逆向文件頻率(term frequency–inverse document frequency, tf-idf)方法或其他資料探勘/詞頻統計方法，本案不限於此。In one embodiment, the information retrieval method for searching and comparing the candidate words in the knowledge base 500 may be a term frequency-inverse document frequency (term frequency-inverse document frequency, tf-idf) method or other data mining/term frequency statistics method, this case is not limited to this.

於步驟S230，知識實體驗證與增強模組114比對知識庫500中的實體說明資料與元資料，以獲得比對結果。In step S230 , the knowledge entity verification and enhancement module 114 compares the entity description data and metadata in the knowledge base 500 to obtain a comparison result.

於一實施例中，於實體說明資料中搜尋字詞來獲得更多的內容描述資訊，作為候選字詞的增強資訊，以供後續知識實體分類模組116使用。In one embodiment, words are searched in the entity description data to obtain more content description information, which is used as enhanced information of candidate words for subsequent use by the knowledge entity classification module 116 .

呈上述範例，候選字詞“apple”的候選清單中記錄的實體名稱為“Apple Inc.”、“Apple”、“Pineapple”以及“Apple, Oklahoma”等4筆資料。進一步地，根據前述步驟S210中所接收的元資料來逐一檢索在候選清單中每一個實體名稱所對應的實體說明資料。以元資料的類別“FRUIT”及其關鍵字詞“fruit, juicy, tree”為例(如上表一所示)。知識實體驗證與增強模組114將關鍵字詞“fruit”與實體名稱“Apple”所對應的實體說明資料“An apple is an edible fruit produced by an apple tree (Malus domestica). Apple trees are cultivated worldwide and are the most widely grown species in the genus Malus.”進行檢索比對，判斷是否有任何字詞與“fruit”相匹配，並於獲得一個匹配字詞則累計1次。於此實施例中，類別“FRUIT”有三個關鍵字詞，則分別對這三個關鍵字詞執行相同的檢索與匹配，最後得到此類別的匹配次數總和。舉例而言，類別“FRUIT”之關鍵字詞“fruit”可在實體名稱“Apple”所對應的實體說明資料中得到1次匹配次數；類別“FRUIT”之關鍵字詞“juicy”可在實體名稱“Apple”所對應的實體說明資料中得到0次匹配次數；類別“FRUIT”之關鍵字詞“tree”可在實體名稱“Apple”所對應的實體說明資料中得到2次匹配次數。因此，類別“FRUIT”相對於實體名稱“Apple”的總匹配次數為3。In the above example, the entity names recorded in the candidate list of the candidate word "apple" are 4 pieces of data including "Apple Inc.", "Apple", "Pineapple" and "Apple, Oklahoma". Further, the entity description material corresponding to each entity name in the candidate list is retrieved one by one according to the metadata received in the aforementioned step S210. Take the metadata category "FRUIT" and its keywords "fruit, juice, tree" as an example (as shown in Table 1 above). The knowledge entity verification and enhancement module 114 uses the entity description data corresponding to the keyword "fruit" and the entity name "Apple" as "An apple is an edible fruit produced by an apple tree (Malus domestica). Apple trees are cultivated worldwide and are the most widely grown species in the genus Malus." to search and compare to determine whether any word matches "fruit", and accumulate once when a matching word is obtained. In this embodiment, the category "FRUIT" has three keywords, and the same retrieval and matching are performed on the three keywords respectively, and finally the sum of matching times of this category is obtained. For example, the keyword "fruit" of the category "FRUIT" can get 1 match in the entity description data corresponding to the entity name "Apple"; the keyword "juicy" of the category "FRUIT" can be found in the entity name There are 0 matching times in the entity description data corresponding to "Apple"; the keyword "tree" of the category "FRUIT" can get 2 matching times in the entity description data corresponding to the entity name "Apple". Therefore, the total number of matches for the category "FRUIT" against the entity name "Apple" is 3.

以此類推，類別“MEAT”之關鍵字詞“animal, hunt”於實體名稱“Apple”所對應的實體說明資料中的總匹配次數為0。類別“DESSERT”之關鍵字詞“sugar, sweet”於實體名稱“Apple”所對應的實體說明資料中的總匹配次數為0。由此可知，在步驟S210所輸入的元資料的三個類別中，具有最大總匹配次數的是類別“FRUIT”。因此，元資料的類別“FRUIT”是目標文本的比對結果。同時，與類別“FRUIT”關聯性最大的實體名稱“Apple”則被設定為最關聯實體名稱。By analogy, the total matching times of the keywords "animal, hunt" of the category "MEAT" in the entity description data corresponding to the entity name "Apple" is 0. The total matching times of the keyword "sugar, sweet" of the category "DESSERT" in the entity description data corresponding to the entity name "Apple" is 0. It can be seen that among the three categories of metadata input in step S210, the category "FRUIT" has the largest total number of matches. Therefore, the category "FRUIT" of the metadata is the comparison result of the target text. At the same time, the entity name "Apple" which is most relevant to the category "FRUIT" is set as the most relevant entity name.

於一實施例中，元資料與每一個實體說明資料之間的比對可以是相似性比較方法(Cosine similarity)。透過使用元資料來搜尋知識庫500中的實體說明資料，藉由相似性比較方法來篩選出最接近元資料的實體名稱。In one embodiment, the comparison between metadata and description data of each entity may be a similarity comparison method (Cosine similarity). By using the metadata to search the entity description data in the knowledge base 500, the entity name closest to the metadata is screened out by the similarity comparison method.

於步驟S240，知識實體分類模組116根據比對結果將知識庫500中關聯於候選字詞的實體名稱設定為目標文本中該候選字詞的輸出分類。In step S240 , the knowledge entity classification module 116 sets the entity name associated with the candidate word in the knowledge base 500 as the output classification of the candidate word in the target text according to the comparison result.

呈上述範例，目標文本中該候選字詞類別的比對結果為“FRUIT”。進一步地，知識實體分類模組116將此比對結果“FRUIT”於知識庫500中最關聯實體名稱(即“Apple”)所對應的實體種類(即“Fruits; Malus; Plants”)進行比對。由於可以在實體種類中找到與比對結果“FRUIT”相匹配的字詞“Fruits”，因此可以驗證得到前述找到的比對結果“FRUIT”即為目標文本中該候選字詞的輸出分類。In the above example, the comparison result of the candidate word category in the target text is "FRUIT". Further, the knowledge entity classification module 116 compares the comparison result "FRUIT" with the entity category (ie "Fruits; Malus; Plants") corresponding to the most relevant entity name (ie "Apple") in the knowledge base 500 . Since the word "Fruits" matching the comparison result "FRUIT" can be found in the entity category, it can be verified that the comparison result "FRUIT" found above is the output classification of the candidate word in the target text.

於一實施例中，於步驟S230中得到的候選字詞的資料增強結果及使用者在元資料事先定義的類別及其關鍵字詞可以被輸入至一文字分類模型(第1圖未繪示)來進行分類，可藉此確定候選字詞即為目標文本的知識實體，並且將知識實體分類到對應的類別，以得到最終的知識實體及其所屬類別。In one embodiment, the data enhancement results of the candidate words obtained in step S230 and the categories and keywords defined by the user in the metadata can be input into a text classification model (not shown in FIG. 1 ) to Classification can be used to determine that the candidate words are the knowledge entities of the target text, and classify the knowledge entities into corresponding categories to obtain the final knowledge entities and their categories.

請一併參照第1圖及第3圖。第3圖繪示根據本案一實施例中一種知識實體識別方法300的流程圖。知識實體識別方法300可由第1圖的知識實體識別裝置100所執行。Please refer to Figure 1 and Figure 3 together. FIG. 3 shows a flowchart of a knowledge entity recognition method 300 according to an embodiment of the present application. The knowledge entity recognition method 300 can be executed by the knowledge entity recognition device 100 in FIG. 1 .

於步驟S310，知識實體驗證與增強模組114使用目標文本之候選字詞於知識庫500中執行比對，並依照相似度高低獲得排序之多個實體名稱。In step S310 , the knowledge entity verification and enhancement module 114 uses the candidate words of the target text to perform comparison in the knowledge base 500 , and obtains a plurality of entity names sorted according to the degree of similarity.

於一實施例中，目標文本可藉由知識實體候選產生模組112執行自然語言處理技術來抽取出候選字詞。承上述目標文本“An apple a day keeps the doctor away.”的範例，候選字詞“apple”於表二的知識庫500中的所有實體名稱進行比對。知識庫500中與候選字詞“apple”最相似的實體名稱具有最高排序。基於每一個實體名稱的相似度，由高而低排序而可得到經過排序的實體名稱。經過排序後的實體名稱如表三所示，編號1的實體名稱為第一順位，編號0的實體名稱為第二順位，以此類推。在進行完相似度比對後，從知識庫500中篩選出來的經過排序後的實體名稱有4筆，是為與候選字詞相同/相似的資料。In one embodiment, the target text can be extracted by the knowledge entity candidate generation module 112 using a natural language processing technique to extract candidate words. Following the example of the above target text "An apple a day keeps the doctor away.", the candidate word "apple" is compared with all entity names in the knowledge base 500 in Table 2. The entity name in knowledge base 500 that is most similar to the candidate word "apple" has the highest rank. Based on the similarity of each entity name, the sorted entity names can be obtained by sorting from high to low. The sorted entity names are shown in Table 3, the entity name with the number 1 is the first rank, the entity name with the number 0 is the second rank, and so on. After the similarity comparison, there are 4 sorted entity names screened from the knowledge base 500, which are the same/similar to the candidate words.

表三編號實體名稱實體說明資料實體種類 1 Apple An apple is an edible fruit produced by an apple tree (Malus domestica). Apple trees are cultivated worldwide and are the most widely grown species in the genus Malus. Fruits; Malus; Plants. 0 Apple Inc. Apple Inc. is an American multinational technology company headquartered in Cupertino, California. Mobile phone manufacturers; Companies in the NASDAQ-100 Index. 3 Apple, Oklahoma Apple, Oklahoma is an unincorporated community located near Hugo Lake and State Highway 93 in Choctaw County, Oklahoma. Oklahoma geography stubs. 2 Pineapple The pineapple (Ananas comosus) is a tronical plant with an edible fruit and the most economically significant plant in the family Bromeliance. Fruits; Ananas; Plants. Table three serial number entity name Entity description data Entity type 1 Apple An apple is an edible fruit produced by an apple tree (Malus domestica). Apple trees are cultivated worldwide and are the most widely grown species in the genus Malus. Fruits; Malus; Plants. 0 Apple Inc. Apple Inc. is an American multinational technology company headquartered in Cupertino, California. Mobile phone manufacturers; Companies in the NASDAQ-100 Index. 3 Apple, Oklahoma Apple, Oklahoma is an unincorporated community located near Hugo Lake and State Highway 93 in Choctaw County, Oklahoma. Oklahoma geography stubs. 2 Pineapple The pineapple (Ananas comosus) is a tropical plant with an edible fruit and the most economically significant plant in the family Bromeliance. Fruits; Ananas; Plants.

於步驟S320，知識實體驗證與增強模組114將元資料的各類別之關鍵字詞與排序過的實體名稱所對應之實體說明資料中的字詞進行比對，以獲得比對結果。於一些實施例中，知識實體驗證與增強模組114會將元資料的各類別之關鍵字詞與排序過的實體名稱所對應之實體說明資料中的字詞進行比對，以計算實體說明資料中相同或相似於各類別的關鍵字詞的字詞匹配數量，使得各類別具有對應之字詞匹配數量。In step S320 , the knowledge entity verification and enhancement module 114 compares the keywords in each category of the metadata with the words in the entity description data corresponding to the sorted entity names to obtain a comparison result. In some embodiments, the knowledge entity verification and enhancement module 114 will compare the keywords of each category of metadata with the words in the entity description data corresponding to the sorted entity names to calculate the entity description data The number of word matches that are the same or similar to the keywords in each category, so that each category has a corresponding number of word matches.

於一實施例中，元資料包括複數個類別，其中各類別包括複數個關鍵字詞。舉例而言，元資料包括第一類別“FRUIT”及第二類別“MEAT”。第一類別“FRUIT”包括關鍵字詞“fruit”、“juicy”以及“tree”。第二類別“MEAT”包括關鍵字詞“animal”以及“hunt”。第三類別“DESSERT”包括關鍵字詞“sugar”以及“sweet”。In one embodiment, the metadata includes a plurality of categories, and each category includes a plurality of keywords. For example, the metadata includes a first category "FRUIT" and a second category "MEAT". The first category "FRUIT" includes key words "fruit", "juicy" and "tree". The second category "MEAT" includes the key words "animal" and "hunt". The third category "DESSERT" includes the key words "sugar" and "sweet".

於一實施例中，關鍵字詞“fruit”於排序過的第一實體說明資料“An apple is an edible fruit produced by an apple tree (Malus domestica). Apple trees are cultivated worldwide and are the most widely grown species in the genus Malus.”中進行比對，而得到1個匹配字詞。相似地，關鍵字詞“juicy”及“tree”分別於第一實體說明資料中進行比對，而得到0個及2個匹配字詞。換言之，第一類別“FRUIT”關聯於第一實體名稱的匹配字詞總和為3。以此類推，第二類別“MEAT”關聯於第一實體名稱的匹配字詞總和為0。各類別的關鍵字詞及第一實體名稱“Apple”的實體說明資料之間匹配字詞總和表示如表四。In one embodiment, the keyword "fruit" is included in the sorted first entity description data "An apple is an edible fruit produced by an apple tree (Malus domestica). Apple trees are cultivated worldwide and are the most widely grown species in the genus Malus." and got 1 match. Similarly, the keywords "juicy" and "tree" are compared in the description data of the first entity respectively, and 0 and 2 matching words are obtained. In other words, the sum of matching terms for the first category "FRUIT" associated with the first entity name is 3. By analogy, the sum of matching words of the second category "MEAT" associated with the first entity name is 0. Table 4 shows the sum of matching words between keywords of each category and the entity description data of the first entity name "Apple".

表四：元資料第一實體說明資料之字詞匹配數量第一類別 “FRUIT” 3 第二類別 “MEAT” 0 第三類別 “DESSERT” 0 Table four: metadata Term Matches for First Entity Description Data The first category "FRUIT" 3 The second category "MEAT" 0 The third category "DESSERT" 0

於步驟S330，知識實體驗證與增強模組114將具有最大的字詞匹配數量的類別設定為目標文本中該候選字詞的輸出分類。In step S330 , the knowledge entity verification and enhancement module 114 sets the category with the largest number of word matches as the output category of the candidate word in the target text.

承上述範例，第一類別具有最大的字詞匹配數量(即3)，因此，第一類別“FRUIT”將會被設定為目標文本中該候選字詞的輸出分類。Following the above example, the first category has the largest number of word matches (ie 3), therefore, the first category "FRUIT" will be set as the output category of the candidate word in the target text.

值得一提的是，步驟S320及步驟S330會同樣會以元資料的第一類別、第二類別及第三類別對排序過的第二實體名稱計算其字詞匹配數量總和、以元資料的第一類別、第二類別及第三類別對排序過的第三實體名稱計算其字詞匹配數量總和，以及以元資料的第一類別、第二類別及第三類別對排序過的第四實體名稱計算其字詞匹配數量總和。換言之，元資料的所有類別會對每一個排序過的實體名稱進行匹配，以得到針對每一個實體名稱的所有類別之字詞匹配數量總和。為簡潔說明書內容，於此不重複說明匹配步驟。It is worth mentioning that step S320 and step S330 will also calculate the sum of the word matching numbers of the second entity names sorted by the first category, the second category and the third category of the metadata, and use the metadata of the first category Sum of the number of word matches for the third entity name sorted by category 1, category 2, and category 3, and the name of a fourth entity sorted by category 1, category 2, and category 3 of metadata Calculate the sum of its word matches. In other words, all categories of metadata are matched against each sorted entity name to get the sum of word matches for all categories for each entity name. For the sake of brevity, the description of the matching steps is not repeated here.

於步驟S340，知識實體驗證與增強模組114將輸出分類與知識庫500中的排序過的實體名稱所對應之實體種類進行比對，以驗證目標文本中該候選字詞的輸出分類是否正確。In step S340 , the knowledge entity verification and enhancement module 114 compares the output classification with the entity type corresponding to the sorted entity name in the knowledge base 500 to verify whether the output classification of the candidate word in the target text is correct.

承上述範例，具有最大的字詞匹配數量的是第一類別“FRUIT”，因此目標文本中該候選字詞的輸出分類被設定為“FRUIT”。於步驟S340中，為驗證此輸出分類是否正確，此輸出分類“FRUIT”會進一步與第一實體種類進行比對。如表三所示，第一實體種類包括“Fruits”、“Malus”及“Plants”。由於第一實體種類的“Fruits”與輸出分類“FRUIT”可匹配，因此可以驗證出此輸出分類為正確的結果。Following the above example, the first category "FRUIT" has the largest number of word matches, so the output category of the candidate word in the target text is set as "FRUIT". In step S340 , to verify whether the output classification is correct, the output classification “FRUIT” is further compared with the first entity type. As shown in Table 3, the first entity type includes “Fruits”, “Malus” and “Plants”. Since "Fruits" of the first entity category can match the output classification "FRUIT", it can be verified that this output classification is a correct result.

於一實施例中，知識實體識別100可以實施為但不限於可攜式電子裝置、行動電話、平板電腦(tablet computer)、個人數位助理(PDA，personal digital assistant)、可穿戴裝置或筆記型電腦等裝置。In one embodiment, knowledge entity recognition 100 can be implemented as but not limited to portable electronic devices, mobile phones, tablet computers (tablet computers), personal digital assistants (PDA, personal digital assistant), wearable devices or notebook computers and other devices.

於一實施例中，知識實體識別100至少包括處理器(第1圖未繪示)、儲存媒體(第1圖未繪示)以及輸入/輸出介面(第1圖未繪示)。處理器用以操作控制知識實體候選產生模組112、知識實體驗證與增強模組114以及知識實體分類模組116。儲存媒體用以儲存複數個程式指令及執行指令過程中的暫存資料。輸入/輸出介面耦接於處理器，用以接收一輸入資料102以及送出一輸出資料104。In one embodiment, the knowledge entity identifier 100 at least includes a processor (not shown in FIG. 1 ), a storage medium (not shown in FIG. 1 ), and an input/output interface (not shown in FIG. 1 ). The processor is used to operate and control the knowledge entity candidate generation module 112 , the knowledge entity verification and enhancement module 114 and the knowledge entity classification module 116 . The storage medium is used to store a plurality of program instructions and temporary data during the execution of the instructions. The I/O interface is coupled to the processor for receiving an input data 102 and sending an output data 104 .

所述處理器可以實施為但不限於中央處理器(central processing unit, CPU)、系統單晶片(System on Chip, SoC)、應用處理器、音訊處理器、數位訊號處理器(digital signal processor, DSP)或特定功能的處理晶片或控制器。The processor can be implemented as but not limited to a central processing unit (central processing unit, CPU), a system single chip (System on Chip, SoC), an application processor, an audio processor, a digital signal processor (digital signal processor, DSP) ) or a processing chip or controller for a specific function.

所述儲存媒體可以實施為但不限於隨機存取記憶體(Random Access Memory, RAM)或非揮發性記憶體(例如快閃記憶體(Flash memory)、唯讀記憶體(Read Only Memory, ROM)、硬碟機(Hard Disk Drive, HDD)、固態硬碟(Solid State Drive, SSD)或光儲存器等。The storage medium can be implemented as but not limited to random access memory (Random Access Memory, RAM) or non-volatile memory (such as flash memory (Flash memory), read only memory (Read Only Memory, ROM) , hard disk drive (Hard Disk Drive, HDD), solid state drive (Solid State Drive, SSD) or optical storage, etc.

於一實施例中，文字分類模型可以是人工智慧模型可及於多個子演算法所建立，其包含類神經網路(Artificial Neural Network, ANN)、機器學習(Machine learning)中的監督式學習(Supervised learning)，其中監督式學習包含支撐向量機(Support Vector Machine, SVM)、回歸分析及統計分類等演算法。In one embodiment, the text classification model can be established by an artificial intelligence model that can be used in multiple sub-algorithms, including artificial neural network (Artificial Neural Network, ANN), supervised learning in machine learning (Machine learning) ( Supervised learning), where supervised learning includes support vector machine (Support Vector Machine, SVM), regression analysis and statistical classification algorithms.

在一實施例中，本案提出一種非暫態電腦可讀取記錄媒體，可儲存多個程式碼。程式碼被載入至如第1圖之知識實體識別裝置100之處理器後，處理器執行程式碼並執行如第2圖及第3圖之步驟。In one embodiment, the present application provides a non-transitory computer-readable recording medium capable of storing a plurality of program codes. After the program code is loaded into the processor of the knowledge entity recognition device 100 as shown in FIG. 1 , the processor executes the program code and performs the steps as shown in FIG. 2 and FIG. 3 .

本案的知識實體識別方法及裝置相較於現有技術，可以在相同的待分析知識實體個數的前提下，分析出較多的知識實體，達成高度召回率(recall rate)。以及，對於分析出的知識實體的個數相同的前提下，本案可獲得較多的正確知識實體個數，達成高度精準度(precission)。Compared with the prior art, the knowledge entity recognition method and device in this case can analyze more knowledge entities and achieve a high recall rate under the premise of the same number of knowledge entities to be analyzed. And, on the premise that the number of analyzed knowledge entities is the same, in this case, more correct knowledge entities can be obtained to achieve high precision.

綜上所述，本案藉由在輸入待標註的目標文本的同時也輸入元資料，並於知識庫中搜尋到實體名稱之後，進一步再檢索該實體名稱的實體說明資料來作為驗證，可以提升識別目標文本中知識實體的分類之精準度。此外，本案的知識實體識別裝置及方法可應用於大量文獻的標註。當欲標註的文獻換到不同領域時，只需要切換對應的知識庫，即可達成領域的切換。並且，在擴充方面上，只需要把新的詞彙加入知識庫即可被更新。此外，這樣的方法可以減少人力標註的成本及專家的負擔，省去大量的人工標註工作、後續應用多樣(輸入文章即可自動標註其類別和關鍵字)。To sum up, in this case, by inputting metadata while inputting the target text to be marked, and after searching the entity name in the knowledge base, further retrieving the entity description data of the entity name as verification can improve the recognition The accuracy of the classification of knowledge entities in the target text. In addition, the knowledge entity recognition device and method of this case can be applied to the annotation of a large number of documents. When the document to be annotated is changed to a different field, only the corresponding knowledge base needs to be switched to achieve the switching of the field. Moreover, in terms of expansion, it can be updated only by adding new vocabulary to the knowledge base. In addition, such a method can reduce the cost of human labeling and the burden of experts, save a lot of manual labeling work, and have various subsequent applications (entering articles can automatically label their categories and keywords).

上述內容概述若干實施例之特徵，使得熟習此項技術者可更好地理解本案之態樣。熟習此項技術者應瞭解，在不脫離本案的精神和範圍的情況下，可輕易使用上述內容作為設計或修改為其他變化的基礎，以便實施本文所介紹之實施例的相同目的及/或實現相同優勢。上述內容應當被理解為本案的舉例，其保護範圍應以申請專利範圍為準。The above content summarizes the characteristics of several embodiments, so that those skilled in the art can better understand the aspect of the present application. Those skilled in the art should understand that without departing from the spirit and scope of the present application, the above content can be easily used as a basis for designing or modifying other changes, so as to achieve the same purpose and/or realize the embodiments described herein. same advantage. The above content should be understood as an example of this case, and the scope of protection should be based on the scope of the patent application.

100:知識實體識別裝置 102:輸入資料 104:輸出資料 112:知識實體候選產生模組 114:知識實體驗證與增強模組 116:知識實體分類模組 200、300:知識實體識別方法 500:知識庫 502:解析及儲存模組 600:外部通用知識庫 S210~S240、S310~S340:步驟 100: Knowledge entity recognition device 102: Input data 104: Output data 112: Knowledge entity candidate generation module 114: Knowledge entity verification and enhancement module 116: Knowledge entity classification module 200, 300: knowledge entity recognition method 500: knowledge base 502: Analysis and storage module 600: external general knowledge base S210~S240, S310~S340: steps

以下詳細描述結合隨附圖式閱讀時，將有利於較佳地理解本揭示文件之態樣。應注意，根據說明上實務的需求，圖式中各特徵並不一定按比例繪製。實際上，出於論述清晰之目的，可能任意增加或減小各特徵之尺寸。第1圖繪示根據本案一實施例中一種知識實體識別裝置的方塊圖。第2圖繪示根據本案一實施例中一種知識實體識別方法的流程圖。第3圖繪示根據本案一實施例中一種知識實體識別方法的流程圖。 When the following detailed description is read in conjunction with the accompanying drawings, it will help to better understand the aspect of the disclosed document. It should be noted that, as required by illustrative practice, features in the drawings are not necessarily drawn to scale. In fact, the dimensions of the various features may be arbitrarily increased or decreased for clarity of discussion. FIG. 1 is a block diagram of a knowledge entity recognition device according to an embodiment of the present application. FIG. 2 shows a flow chart of a knowledge entity recognition method according to an embodiment of the present application. FIG. 3 is a flow chart of a knowledge entity recognition method according to an embodiment of the present application.

國內寄存資訊(請依寄存機構、日期、號碼順序註記) 無國外寄存資訊(請依寄存國家、機構、日期、號碼順序註記) 無 Domestic deposit information (please note in order of depositor, date, and number) none Overseas storage information (please note in order of storage country, institution, date, and number) none

100:知識實體識別裝置 100: Knowledge entity recognition device

102:輸入資料 102: Input data

104:輸出資料 104: Output data

112:知識實體候選產生模組 112: Knowledge entity candidate generation module

114:知識實體驗證與增強模組 114: Knowledge entity verification and enhancement module

116:知識實體分類模組 116: Knowledge entity classification module

500:知識庫 500: knowledge base

502:解析及儲存模組 502: Analysis and storage module

600:外部通用知識庫 600: external general knowledge base

Claims

A knowledge entity recognition method, comprising: receiving a target text to be parsed and metadata, wherein the target text includes a candidate word; using the candidate word for comparison in a knowledge base to obtain a plurality of entity names associated with the candidate word from the knowledge base, wherein each of the entity names has a corresponding entity description data; comparing the entity description data in the knowledge base with the metadata to obtain a comparison result; and The entity name associated with the candidate word in the knowledge base is set as an output classification of the candidate word in the target text according to the comparison result.

The knowledge entity identification method as described in Claim 1, wherein the metadata includes multiple categories, and each category includes multiple keywords, wherein the knowledge entity identification method includes: Comparing the key words of each category of the metadata with the words in the entity description data corresponding to the entity name, so as to obtain the comparison result.

The knowledge entity identification method as described in claim 2 further includes: performing a comparison in the knowledge base using the candidate word of the target text, and obtaining the sorted entity name based on the similarity of the comparison; Compare the keywords of each category in the metadata with the words in the entity description data corresponding to the sorted entity name, so as to calculate the same or similar words in the entity description data for each category such that each category has a corresponding number of word matches; and The category with the largest number of matches for the word is set as the output category for the candidate word in the target text.

The knowledge entity identification method described in claim 3 further includes: The output classification is compared with an entity category corresponding to the entity name in the knowledge base to verify whether the output classification of the candidate word in the target text is correct.

The knowledge entity identification method as described in claim 2 further includes: comparing the candidate words of the target text in the knowledge base, and obtaining the sorted entity names based on the similarity; According to each keyword in these categories, compare and sort the words in the entity description data of the entity name to obtain a matching number respectively, wherein the sum of the matching numbers of each category is the number of matches for that term for the corresponding category; and The category corresponding to the word with the largest number of matches is taken as the output category.

A knowledge entity recognition device, comprising: A knowledge entity candidate generation module configured to receive a target text to be parsed and a metadata, and use the candidate word of the target text to compare in a knowledge base to obtain from the knowledge base Multiple entity names of candidate words, each of which has a corresponding entity description; A knowledge entity verification and enhancement module coupled to the knowledge entity candidate generation module, wherein the knowledge entity verification and enhancement module is configured to compare the entity description data in the knowledge base with the metadata to obtain a comparison to the results; and A knowledge entity classification module, coupled to the knowledge entity verification and enhancement module, wherein the knowledge entity classification module is configured to set the entity name associated with the candidate word in the knowledge base according to the comparison result as An output classification of the candidate word in the target text.

The knowledge entity recognition device as described in claim 6, wherein the metadata includes a plurality of categories, each of which includes a plurality of keywords, wherein the knowledge entity candidate generation module is further configured to use the candidate of the target text Perform comparison of words in the knowledge base to obtain the entity name, and compare the keywords in each category of the metadata with the words in the entity description data corresponding to the entity name, to obtain the comparison result.

The knowledge entity recognition device as described in claim 7, wherein the knowledge entity verification and enhancement module uses the candidate words of the target text to perform comparison in the knowledge base, and obtains the sorted results based on the similarity of the comparison The name of the entity, and compare the key words of each category of the metadata with the words in the entity description data corresponding to the sorted entity name, so as to calculate the same or a number of word matches similar to the keyword words of each category, so that each category has a corresponding number of word matches, and the category with the largest number of word matches is set as the target text The output classification for the candidate word.

The knowledge entity recognition device as described in claim 8, wherein the knowledge entity classification module is further configured to compare the output classification with an entity type corresponding to the entity name in the knowledge base to verify the target Whether the output classification of the candidate word in the text is correct.

The knowledge entity identification device as described in Claim 7, wherein the knowledge entity verification and enhancement module is further configured to: comparing the candidate words of the target text in the knowledge base, and obtaining the sorted entity names based on the similarity; According to each keyword in these categories, compare and sort the words in the entity description data of the entity name to obtain a matching number respectively, wherein the sum of the matching numbers of each category is the number of matches for that term for the corresponding category; and The category corresponding to the word with the largest number of matches is taken as the output category.