TW202242848A - Knowledge entity identification method and knowledge entity identification device - Google Patents
Knowledge entity identification method and knowledge entity identification device Download PDFInfo
- Publication number
- TW202242848A TW202242848A TW110113786A TW110113786A TW202242848A TW 202242848 A TW202242848 A TW 202242848A TW 110113786 A TW110113786 A TW 110113786A TW 110113786 A TW110113786 A TW 110113786A TW 202242848 A TW202242848 A TW 202242848A
- Authority
- TW
- Taiwan
- Prior art keywords
- entity
- knowledge
- category
- words
- candidate
- Prior art date
Links
Images
Landscapes
- Devices For Executing Special Programs (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Supply And Distribution Of Alternating Current (AREA)
- Machine Translation (AREA)
Abstract
Description
本案是有關於一種電子裝置及其方法,且特別是有關於一種知識實體識別裝置及方法。This case is about an electronic device and its method, and in particular about a knowledge entity recognition device and method.
傳統的知識管理方法是由專家對所有的文件來逐一進行人工的資料標記。隨著技術的進展,目前的資料標記方法可以透過自然語言技術來分析語法及語義,然而,這樣的語料分析並無法使機器理解新詞,而是需要由專家人員進行標記。現有之運用標記的資料來進行訓練的過程過於冗長且欠缺彈性,對於已建立好的知識管理系統也難以基於現有的資料來訓練新領域的知識管理系統,針對不同領域的知識管理系統的建置需要花費相當高的訓練成本。The traditional knowledge management method is to manually mark all the files one by one by experts. With the development of technology, the current data labeling method can analyze grammar and semantics through natural language technology. However, such corpus analysis cannot enable machines to understand new words, but needs to be labeled by experts. The existing training process using labeled data is too lengthy and inflexible, and it is difficult for established knowledge management systems to train knowledge management systems in new fields based on existing data. The construction of knowledge management systems for different fields It requires a fairly high training cost.
有鑑於此,知識管理系統是具備相當管理動能的工具,然而目前仍欠缺高效率的建置方法,並且知識管理系統的執行精確度仍具有改善空間。據此,如何提供高效率的系統建置方法及提供高精確度的知識管理是所屬技術領域具通常之識者所欲解決的技術問題。In view of this, the knowledge management system is a tool with considerable management momentum, but there is still a lack of efficient construction methods, and there is still room for improvement in the execution accuracy of the knowledge management system. Accordingly, how to provide an efficient system construction method and high-precision knowledge management is a technical problem that those skilled in the art want to solve.
根據本案之一實施例,揭示一種知識實體識別方法其包括以下步驟:接收待解析之目標文本及元資料,其中目標文本包括候選字詞;使用候選字詞於知識庫中進行比對,以從知識庫中獲得關聯於候選字詞之複數個實體名稱,其中各該實體名稱具有對應之實體說明資料;比對知識庫中的實體說明資料與元資料以獲得比對結果;以及根據比對結果將知識庫中關聯於候選字詞之實體名稱設定為該目標文本中候選字詞的輸出分類。According to one embodiment of this case, a knowledge entity recognition method is disclosed, which includes the following steps: receiving the target text to be parsed and metadata, wherein the target text includes candidate words; using the candidate words to compare in the knowledge base to obtain from obtaining a plurality of entity names associated with the candidate word in the knowledge base, wherein each of the entity names has corresponding entity description data; comparing the entity description data and metadata in the knowledge base to obtain a comparison result; and according to the comparison result Set the entity name associated with the candidate word in the knowledge base as the output category of the candidate word in the target text.
根據另一實施例,揭示一種知識實體識別裝置其包括知識實體候選產生模組、知識實體驗證與增強模組以及知識實體分類模組。知識實體候選產生模組經配置以接收待解析之目標文本及元資料,以及使用目標文本之候選字詞於知識庫中進行比對,以從知識庫中獲得關聯於候選字詞之複數個實體名稱,其中各該實體名稱具有對應之實體說明資料。知識實體驗證與增強模組耦接知識實體候選產生模組,其中知識實體驗證與增強模組經配置以比對知識庫中的實體說明資料與元資料以獲得比對結果。知識實體分類模組耦接知識實體驗證與增強模組,其中知識實體分類模組經配置以根據比對結果將知識庫中關聯於候選字詞之實體名稱設定為目標文本中候選字詞的輸出分類。According to another embodiment, a knowledge entity recognition device is disclosed, which includes a knowledge entity candidate generation module, a knowledge entity verification and enhancement module, and a knowledge entity classification module. The knowledge entity candidate generation module is configured to receive the target text and metadata to be parsed, and use the candidate words of the target text for comparison in the knowledge base to obtain a plurality of entities associated with the candidate words from the knowledge base Name, where each entity name has corresponding entity description data. The knowledge entity verification and enhancement module is coupled to the knowledge entity candidate generation module, wherein the knowledge entity verification and enhancement module is configured to compare entity description data and metadata in the knowledge base to obtain a comparison result. The knowledge entity classification module is coupled to the knowledge entity verification and enhancement module, wherein the knowledge entity classification module is configured to set the entity name associated with the candidate word in the knowledge base as the output of the candidate word in the target text according to the comparison result Classification.
以下揭示內容提供許多不同實施例,以便實施本案之不同特徵。下文描述元件及排列之實施例以簡化本案。當然,該些實施例僅為示例性且並不欲為限制性。舉例而言,本案中使用「第一」、「第二」等用語描述元件,僅是用以區別以相同或相似的元件或操作,該用語並非用以限定本案的技術元件,亦非用以限定操作的次序或順位。The following disclosure provides many different embodiments in order to implement the different features of the present disclosure. Embodiments of components and arrangements are described below to simplify the present case. Of course, these examples are exemplary only and are not intended to be limiting. For example, the use of terms such as "first" and "second" in this case to describe components is only used to distinguish the same or similar components or operations. These terms are not used to limit the technical components of this case, nor are they intended to Restricts the order or sequence of operations.
請參照第1圖,其繪示根據本案一實施例中的一種知識實體識別裝置100的方塊圖。知識實體識別裝置100用以對一輸入資料102識別當中的目標物,並提供識別後之輸出資料104。舉例而言,知識實體識別裝置100解析一輸入之文本、句子、段落等資料,以進行命名實體識別(Named Entity Recognition)。於一實施例中,知識實體識別裝置100接收的輸入資料102包括目標文本及元資料(Domain Metadata)。目標文本為待解析的資料。元資料為用以輔助目標文本之解析的資料,可以是使用者事先設計的類別及其關鍵字詞。Please refer to FIG. 1 , which shows a block diagram of a knowledge
於一實施例中,知識實體識別裝置100耦接於知識庫500。知識庫500耦接於外部通用知識庫600。外部通用知識庫600例如是維基百科、專用辭典、領域專家知識等具有不同格式及領域內容之資料庫。知識庫500可以是儲存由內部自行定義的知識及/或透過外部通用知識庫600的資料所建立的知識資料。舉例而言,知識庫500設置有解析及儲存模組502。解析及儲存模組502可以讀取外部通用知識庫600的資料,並將外部資料轉換為特定格式的資料結構,例如將外部資料及領域專家知識資料進行正規處理,使得知識庫500中儲存的資料可以提供給知識實體識別裝置100識別目標文本時之使用。In one embodiment, the knowledge
於一實施例中,知識實體識別裝置100包括知識實體候選產生模組112、知識實體驗證與增強模組114以及知識實體分類模組116。知識實體候選產生模組112電性耦接於知識實體驗證與增強模組114。知識實體驗證與增強模組114電性耦接於知識實體分類模組116。為便於理解本案內容,以下說明請一併參照第1圖及第2圖。第2圖繪示根據本案一實施例中一種知識實體識別方法200的流程圖。知識實體識別方法200可由第1圖的知識實體識別裝置100所執行。In one embodiment, the knowledge
於步驟S210,知識實體候選產生模組112接收待解析之目標文本及元資料。In step S210, the knowledge entity
於一實施例中,待解析之目標文本是欲分析的文本資料,包括一或多個句子或段落等形式。另一方面,元資料包括複數個類別(key),每一個類別包括複數個關鍵字詞(value)。使用者可以事先定義元資料的所有類別以及每一個類別的關鍵字詞,並與目標文本同時輸入至知識實體識別裝置100。為利於說明本案內容,以下提及之目標文本是以句子“An apple a day keeps the doctor away.”以及元資料是表一所示的內容作為舉例說明。應注意的是,本案不以此舉例內容為限。In one embodiment, the target text to be analyzed is text data to be analyzed, including one or more sentences or paragraphs. On the other hand, the metadata includes a plurality of categories (keys), and each category includes a plurality of key words (values). The user can define all categories of metadata and keywords of each category in advance, and input them into the knowledge
表一:元資料
於一實施例中,知識實體候選產生模組112執行自然語言處理技術來抽取目標文本的名詞或名詞片語。這些被抽取出來的名詞或名詞片語被作為目標文本的候選字詞。承上述目標文本“An apple a day keeps the doctor away.”的範例,從目標文本中抽取出的候選字詞包括 “apple”、“day”以及“doctor”。目標文本的候選字詞之個數會因目標文本的內容而異。於一實施例中,目標文本包括一或多個候選字詞。於此目標文本的範例中,候選字詞的個數是3個。In one embodiment, the knowledge entity
於步驟S220,知識實體候選產生模組112使用目標文本的候選字詞,於知識庫500中進行比對,以從知識庫500中獲得關聯於候選字詞之複數個實體名稱。In step S220 , the knowledge entity
於一實施例中,知識實體驗證與增強模組114逐一地對這些候選字詞於知識庫500中進行比對。知識庫500記錄複數個實體資料。每一個實體資料的資料結構包括但不限於編號、實體名稱、實體說明、實體種類等,如表二所示。In one embodiment, the knowledge entity verification and
表二:知識庫
承上述範例,當候選字詞是“apple”時,知識實體驗證與增強模組114將候選字詞“apple”在表二的知識庫500進行比對,而可得到關聯於“apple”的複數個實體名稱,例如編號0的“Apple Inc.”、編號1的“Apple”、編號2的“Pineapple”以及編號3的“Apple, Oklahoma”。於一實施例中,這些得到的編號0至編號3的實體名稱可以被記錄於候選字詞“apple”的候選清單中。另一方面,由於編號N的實體名稱“Orange”與候選字詞“apple”之間不相同/相似,因此編號N的實體名稱“Orange”不會被記錄於候選字詞“apple”的候選清單中。Following the above example, when the candidate word is "apple", the knowledge entity verification and
於一實施例中,將候選字詞在知識庫500中進行查找比對的資訊檢索方法可以是詞頻-逆向文件頻率(term frequency–inverse document frequency, tf-idf)方法或其他資料探勘/詞頻統計方法,本案不限於此。In one embodiment, the information retrieval method for searching and comparing the candidate words in the
於步驟S230,知識實體驗證與增強模組114比對知識庫500中的實體說明資料與元資料,以獲得比對結果。In step S230 , the knowledge entity verification and
於一實施例中,於實體說明資料中搜尋字詞來獲得更多的內容描述資訊,作為候選字詞的增強資訊,以供後續知識實體分類模組116使用。In one embodiment, words are searched in the entity description data to obtain more content description information, which is used as enhanced information of candidate words for subsequent use by the knowledge
呈上述範例,候選字詞“apple”的候選清單中記錄的實體名稱為“Apple Inc.”、“Apple”、“Pineapple”以及“Apple, Oklahoma”等4筆資料。進一步地,根據前述步驟S210中所接收的元資料來逐一檢索在候選清單中每一個實體名稱所對應的實體說明資料。以元資料的類別“FRUIT”及其關鍵字詞“fruit, juicy, tree”為例(如上表一所示)。知識實體驗證與增強模組114將關鍵字詞“fruit”與實體名稱“Apple”所對應的實體說明資料“An apple is an edible fruit produced by an apple tree (Malus domestica). Apple trees are cultivated worldwide and are the most widely grown species in the genus Malus.”進行檢索比對,判斷是否有任何字詞與“fruit”相匹配,並於獲得一個匹配字詞則累計1次。於此實施例中,類別“FRUIT”有三個關鍵字詞,則分別對這三個關鍵字詞執行相同的檢索與匹配,最後得到此類別的匹配次數總和。舉例而言,類別“FRUIT”之關鍵字詞“fruit”可在實體名稱“Apple”所對應的實體說明資料中得到1次匹配次數;類別“FRUIT”之關鍵字詞“juicy”可在實體名稱“Apple”所對應的實體說明資料中得到0次匹配次數;類別“FRUIT”之關鍵字詞“tree”可在實體名稱“Apple”所對應的實體說明資料中得到2次匹配次數。因此,類別“FRUIT”相對於實體名稱“Apple”的總匹配次數為3。In the above example, the entity names recorded in the candidate list of the candidate word "apple" are 4 pieces of data including "Apple Inc.", "Apple", "Pineapple" and "Apple, Oklahoma". Further, the entity description material corresponding to each entity name in the candidate list is retrieved one by one according to the metadata received in the aforementioned step S210. Take the metadata category "FRUIT" and its keywords "fruit, juice, tree" as an example (as shown in Table 1 above). The knowledge entity verification and
以此類推,類別“MEAT”之關鍵字詞“animal, hunt”於實體名稱“Apple”所對應的實體說明資料中的總匹配次數為0。類別“DESSERT”之關鍵字詞“sugar, sweet”於實體名稱“Apple”所對應的實體說明資料中的總匹配次數為0。由此可知,在步驟S210所輸入的元資料的三個類別中,具有最大總匹配次數的是類別“FRUIT”。因此,元資料的類別“FRUIT”是目標文本的比對結果。同時,與類別“FRUIT”關聯性最大的實體名稱“Apple”則被設定為最關聯實體名稱。By analogy, the total matching times of the keywords "animal, hunt" of the category "MEAT" in the entity description data corresponding to the entity name "Apple" is 0. The total matching times of the keyword "sugar, sweet" of the category "DESSERT" in the entity description data corresponding to the entity name "Apple" is 0. It can be seen that among the three categories of metadata input in step S210, the category "FRUIT" has the largest total number of matches. Therefore, the category "FRUIT" of the metadata is the comparison result of the target text. At the same time, the entity name "Apple" which is most relevant to the category "FRUIT" is set as the most relevant entity name.
於一實施例中,元資料與每一個實體說明資料之間的比對可以是相似性比較方法(Cosine similarity)。透過使用元資料來搜尋知識庫500中的實體說明資料,藉由相似性比較方法來篩選出最接近元資料的實體名稱。In one embodiment, the comparison between metadata and description data of each entity may be a similarity comparison method (Cosine similarity). By using the metadata to search the entity description data in the
於步驟S240,知識實體分類模組116根據比對結果將知識庫500中關聯於候選字詞的實體名稱設定為目標文本中該候選字詞的輸出分類。In step S240 , the knowledge
呈上述範例,目標文本中該候選字詞類別的比對結果為“FRUIT”。進一步地,知識實體分類模組116將此比對結果“FRUIT”於知識庫500中最關聯實體名稱(即“Apple”)所對應的實體種類(即“Fruits; Malus; Plants”)進行比對。由於可以在實體種類中找到與比對結果“FRUIT”相匹配的字詞“Fruits”,因此可以驗證得到前述找到的比對結果“FRUIT”即為目標文本中該候選字詞的輸出分類。In the above example, the comparison result of the candidate word category in the target text is "FRUIT". Further, the knowledge
於一實施例中,於步驟S230中得到的候選字詞的資料增強結果及使用者在元資料事先定義的類別及其關鍵字詞可以被輸入至一文字分類模型(第1圖未繪示)來進行分類,可藉此確定候選字詞即為目標文本的知識實體,並且將知識實體分類到對應的類別,以得到最終的知識實體及其所屬類別。In one embodiment, the data enhancement results of the candidate words obtained in step S230 and the categories and keywords defined by the user in the metadata can be input into a text classification model (not shown in FIG. 1 ) to Classification can be used to determine that the candidate words are the knowledge entities of the target text, and classify the knowledge entities into corresponding categories to obtain the final knowledge entities and their categories.
請一併參照第1圖及第3圖。第3圖繪示根據本案一實施例中一種知識實體識別方法300的流程圖。知識實體識別方法300可由第1圖的知識實體識別裝置100所執行。Please refer to Figure 1 and Figure 3 together. FIG. 3 shows a flowchart of a knowledge
於步驟S310,知識實體驗證與增強模組114使用目標文本之候選字詞於知識庫500中執行比對,並依照相似度高低獲得排序之多個實體名稱。In step S310 , the knowledge entity verification and
於一實施例中,目標文本可藉由知識實體候選產生模組112執行自然語言處理技術來抽取出候選字詞。承上述目標文本“An apple a day keeps the doctor away.”的範例,候選字詞“apple”於表二的知識庫500中的所有實體名稱進行比對。知識庫500中與候選字詞“apple”最相似的實體名稱具有最高排序。基於每一個實體名稱的相似度,由高而低排序而可得到經過排序的實體名稱。經過排序後的實體名稱如表三所示,編號1的實體名稱為第一順位,編號0的實體名稱為第二順位,以此類推。在進行完相似度比對後,從知識庫500中篩選出來的經過排序後的實體名稱有4筆,是為與候選字詞相同/相似的資料。In one embodiment, the target text can be extracted by the knowledge entity
表三
於步驟S320,知識實體驗證與增強模組114將元資料的各類別之關鍵字詞與排序過的實體名稱所對應之實體說明資料中的字詞進行比對,以獲得比對結果。於一些實施例中,知識實體驗證與增強模組114會將元資料的各類別之關鍵字詞與排序過的實體名稱所對應之實體說明資料中的字詞進行比對,以計算實體說明資料中相同或相似於各類別的關鍵字詞的字詞匹配數量,使得各類別具有對應之字詞匹配數量。In step S320 , the knowledge entity verification and
於一實施例中,元資料包括複數個類別,其中各類別包括複數個關鍵字詞。舉例而言,元資料包括第一類別“FRUIT”及第二類別“MEAT”。第一類別“FRUIT”包括關鍵字詞“fruit”、“juicy”以及“tree”。第二類別“MEAT”包括關鍵字詞“animal”以及“hunt”。第三類別“DESSERT”包括關鍵字詞“sugar”以及“sweet”。In one embodiment, the metadata includes a plurality of categories, and each category includes a plurality of keywords. For example, the metadata includes a first category "FRUIT" and a second category "MEAT". The first category "FRUIT" includes key words "fruit", "juicy" and "tree". The second category "MEAT" includes the key words "animal" and "hunt". The third category "DESSERT" includes the key words "sugar" and "sweet".
於一實施例中,關鍵字詞“fruit”於排序過的第一實體說明資料“An apple is an edible fruit produced by an apple tree (Malus domestica). Apple trees are cultivated worldwide and are the most widely grown species in the genus Malus.”中進行比對,而得到1個匹配字詞。相似地,關鍵字詞“juicy”及“tree”分別於第一實體說明資料中進行比對,而得到0個及2個匹配字詞。換言之,第一類別“FRUIT”關聯於第一實體名稱的匹配字詞總和為3。以此類推,第二類別“MEAT”關聯於第一實體名稱的匹配字詞總和為0。各類別的關鍵字詞及第一實體名稱“Apple”的實體說明資料之間匹配字詞總和表示如表四。In one embodiment, the keyword "fruit" is included in the sorted first entity description data "An apple is an edible fruit produced by an apple tree (Malus domestica). Apple trees are cultivated worldwide and are the most widely grown species in the genus Malus." and got 1 match. Similarly, the keywords "juicy" and "tree" are compared in the description data of the first entity respectively, and 0 and 2 matching words are obtained. In other words, the sum of matching terms for the first category "FRUIT" associated with the first entity name is 3. By analogy, the sum of matching words of the second category "MEAT" associated with the first entity name is 0. Table 4 shows the sum of matching words between keywords of each category and the entity description data of the first entity name "Apple".
表四:
於步驟S330,知識實體驗證與增強模組114將具有最大的字詞匹配數量的類別設定為目標文本中該候選字詞的輸出分類。In step S330 , the knowledge entity verification and
承上述範例,第一類別具有最大的字詞匹配數量(即3),因此,第一類別“FRUIT”將會被設定為目標文本中該候選字詞的輸出分類。Following the above example, the first category has the largest number of word matches (ie 3), therefore, the first category "FRUIT" will be set as the output category of the candidate word in the target text.
值得一提的是,步驟S320及步驟S330會同樣會以元資料的第一類別、第二類別及第三類別對排序過的第二實體名稱計算其字詞匹配數量總和、以元資料的第一類別、第二類別及第三類別對排序過的第三實體名稱計算其字詞匹配數量總和,以及以元資料的第一類別、第二類別及第三類別對排序過的第四實體名稱計算其字詞匹配數量總和。換言之,元資料的所有類別會對每一個排序過的實體名稱進行匹配,以得到針對每一個實體名稱的所有類別之字詞匹配數量總和。為簡潔說明書內容,於此不重複說明匹配步驟。It is worth mentioning that step S320 and step S330 will also calculate the sum of the word matching numbers of the second entity names sorted by the first category, the second category and the third category of the metadata, and use the metadata of the first category Sum of the number of word matches for the third entity name sorted by category 1, category 2, and category 3, and the name of a fourth entity sorted by category 1, category 2, and category 3 of metadata Calculate the sum of its word matches. In other words, all categories of metadata are matched against each sorted entity name to get the sum of word matches for all categories for each entity name. For the sake of brevity, the description of the matching steps is not repeated here.
於步驟S340,知識實體驗證與增強模組114將輸出分類與知識庫500中的排序過的實體名稱所對應之實體種類進行比對,以驗證目標文本中該候選字詞的輸出分類是否正確。In step S340 , the knowledge entity verification and
承上述範例,具有最大的字詞匹配數量的是第一類別“FRUIT”,因此目標文本中該候選字詞的輸出分類被設定為“FRUIT”。於步驟S340中,為驗證此輸出分類是否正確,此輸出分類“FRUIT”會進一步與第一實體種類進行比對。如表三所示,第一實體種類包括“Fruits”、“Malus”及“Plants”。由於第一實體種類的“Fruits”與輸出分類“FRUIT”可匹配,因此可以驗證出此輸出分類為正確的結果。Following the above example, the first category "FRUIT" has the largest number of word matches, so the output category of the candidate word in the target text is set as "FRUIT". In step S340 , to verify whether the output classification is correct, the output classification “FRUIT” is further compared with the first entity type. As shown in Table 3, the first entity type includes “Fruits”, “Malus” and “Plants”. Since "Fruits" of the first entity category can match the output classification "FRUIT", it can be verified that this output classification is a correct result.
於一實施例中,知識實體識別100可以實施為但不限於可攜式電子裝置、行動電話、平板電腦(tablet computer)、個人數位助理(PDA,personal digital assistant)、可穿戴裝置或筆記型電腦等裝置。In one embodiment,
於一實施例中,知識實體識別100至少包括處理器(第1圖未繪示)、儲存媒體(第1圖未繪示)以及輸入/輸出介面(第1圖未繪示)。處理器用以操作控制知識實體候選產生模組112、知識實體驗證與增強模組114以及知識實體分類模組116。儲存媒體用以儲存複數個程式指令及執行指令過程中的暫存資料。輸入/輸出介面耦接於處理器,用以接收一輸入資料102以及送出一輸出資料104。In one embodiment, the
所述處理器可以實施為但不限於中央處理器(central processing unit, CPU)、系統單晶片(System on Chip, SoC)、應用處理器、音訊處理器、數位訊號處理器(digital signal processor, DSP)或特定功能的處理晶片或控制器。The processor can be implemented as but not limited to a central processing unit (central processing unit, CPU), a system single chip (System on Chip, SoC), an application processor, an audio processor, a digital signal processor (digital signal processor, DSP) ) or a processing chip or controller for a specific function.
所述儲存媒體可以實施為但不限於隨機存取記憶體(Random Access Memory, RAM)或非揮發性記憶體(例如快閃記憶體(Flash memory)、唯讀記憶體(Read Only Memory, ROM)、硬碟機(Hard Disk Drive, HDD)、固態硬碟(Solid State Drive, SSD)或光儲存器等。The storage medium can be implemented as but not limited to random access memory (Random Access Memory, RAM) or non-volatile memory (such as flash memory (Flash memory), read only memory (Read Only Memory, ROM) , hard disk drive (Hard Disk Drive, HDD), solid state drive (Solid State Drive, SSD) or optical storage, etc.
於一實施例中,文字分類模型可以是人工智慧模型可及於多個子演算法所建立,其包含類神經網路(Artificial Neural Network, ANN)、機器學習(Machine learning)中的監督式學習(Supervised learning),其中監督式學習包含支撐向量機(Support Vector Machine, SVM)、回歸分析及統計分類等演算法。In one embodiment, the text classification model can be established by an artificial intelligence model that can be used in multiple sub-algorithms, including artificial neural network (Artificial Neural Network, ANN), supervised learning in machine learning (Machine learning) ( Supervised learning), where supervised learning includes support vector machine (Support Vector Machine, SVM), regression analysis and statistical classification algorithms.
在一實施例中,本案提出一種非暫態電腦可讀取記錄媒體,可儲存多個程式碼。程式碼被載入至如第1圖之知識實體識別裝置100之處理器後,處理器執行程式碼並執行如第2圖及第3圖之步驟。In one embodiment, the present application provides a non-transitory computer-readable recording medium capable of storing a plurality of program codes. After the program code is loaded into the processor of the knowledge
本案的知識實體識別方法及裝置相較於現有技術,可以在相同的待分析知識實體個數的前提下,分析出較多的知識實體,達成高度召回率(recall rate)。以及,對於分析出的知識實體的個數相同的前提下,本案可獲得較多的正確知識實體個數,達成高度精準度(precission)。Compared with the prior art, the knowledge entity recognition method and device in this case can analyze more knowledge entities and achieve a high recall rate under the premise of the same number of knowledge entities to be analyzed. And, on the premise that the number of analyzed knowledge entities is the same, in this case, more correct knowledge entities can be obtained to achieve high precision.
綜上所述,本案藉由在輸入待標註的目標文本的同時也輸入元資料,並於知識庫中搜尋到實體名稱之後,進一步再檢索該實體名稱的實體說明資料來作為驗證,可以提升識別目標文本中知識實體的分類之精準度。此外,本案的知識實體識別裝置及方法可應用於大量文獻的標註。當欲標註的文獻換到不同領域時,只需要切換對應的知識庫,即可達成領域的切換。並且,在擴充方面上,只需要把新的詞彙加入知識庫即可被更新。此外,這樣的方法可以減少人力標註的成本及專家的負擔,省去大量的人工標註工作、後續應用多樣(輸入文章即可自動標註其類別和關鍵字)。To sum up, in this case, by inputting metadata while inputting the target text to be marked, and after searching the entity name in the knowledge base, further retrieving the entity description data of the entity name as verification can improve the recognition The accuracy of the classification of knowledge entities in the target text. In addition, the knowledge entity recognition device and method of this case can be applied to the annotation of a large number of documents. When the document to be annotated is changed to a different field, only the corresponding knowledge base needs to be switched to achieve the switching of the field. Moreover, in terms of expansion, it can be updated only by adding new vocabulary to the knowledge base. In addition, such a method can reduce the cost of human labeling and the burden of experts, save a lot of manual labeling work, and have various subsequent applications (entering articles can automatically label their categories and keywords).
上述內容概述若干實施例之特徵,使得熟習此項技術者可更好地理解本案之態樣。熟習此項技術者應瞭解,在不脫離本案的精神和範圍的情況下,可輕易使用上述內容作為設計或修改為其他變化的基礎,以便實施本文所介紹之實施例的相同目的及/或實現相同優勢。上述內容應當被理解為本案的舉例,其保護範圍應以申請專利範圍為準。The above content summarizes the characteristics of several embodiments, so that those skilled in the art can better understand the aspect of the present application. Those skilled in the art should understand that without departing from the spirit and scope of the present application, the above content can be easily used as a basis for designing or modifying other changes, so as to achieve the same purpose and/or realize the embodiments described herein. same advantage. The above content should be understood as an example of this case, and the scope of protection should be based on the scope of the patent application.
100:知識實體識別裝置
102:輸入資料
104:輸出資料
112:知識實體候選產生模組
114:知識實體驗證與增強模組
116:知識實體分類模組
200、300:知識實體識別方法
500:知識庫
502:解析及儲存模組
600:外部通用知識庫
S210~S240、S310~S340:步驟
100: Knowledge entity recognition device
102: Input data
104: Output data
112: Knowledge entity candidate generation module
114: Knowledge entity verification and enhancement module
116: Knowledge
以下詳細描述結合隨附圖式閱讀時,將有利於較佳地理解本揭示文件之態樣。應注意,根據說明上實務的需求,圖式中各特徵並不一定按比例繪製。實際上,出於論述清晰之目的,可能任意增加或減小各特徵之尺寸。 第1圖繪示根據本案一實施例中一種知識實體識別裝置的方塊圖。 第2圖繪示根據本案一實施例中一種知識實體識別方法的流程圖。 第3圖繪示根據本案一實施例中一種知識實體識別方法的流程圖。 When the following detailed description is read in conjunction with the accompanying drawings, it will help to better understand the aspect of the disclosed document. It should be noted that, as required by illustrative practice, features in the drawings are not necessarily drawn to scale. In fact, the dimensions of the various features may be arbitrarily increased or decreased for clarity of discussion. FIG. 1 is a block diagram of a knowledge entity recognition device according to an embodiment of the present application. FIG. 2 shows a flow chart of a knowledge entity recognition method according to an embodiment of the present application. FIG. 3 is a flow chart of a knowledge entity recognition method according to an embodiment of the present application.
國內寄存資訊(請依寄存機構、日期、號碼順序註記) 無 國外寄存資訊(請依寄存國家、機構、日期、號碼順序註記) 無 Domestic deposit information (please note in order of depositor, date, and number) none Overseas storage information (please note in order of storage country, institution, date, and number) none
100:知識實體識別裝置 100: Knowledge entity recognition device
102:輸入資料 102: Input data
104:輸出資料 104: Output data
112:知識實體候選產生模組 112: Knowledge entity candidate generation module
114:知識實體驗證與增強模組 114: Knowledge entity verification and enhancement module
116:知識實體分類模組 116: Knowledge entity classification module
500:知識庫 500: knowledge base
502:解析及儲存模組 502: Analysis and storage module
600:外部通用知識庫 600: external general knowledge base
Claims (10)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
TW110113786A TWI777496B (en) | 2021-04-16 | 2021-04-16 | Knowledge entity identification method and knowledge entity identification device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
TW110113786A TWI777496B (en) | 2021-04-16 | 2021-04-16 | Knowledge entity identification method and knowledge entity identification device |
Publications (2)
Publication Number | Publication Date |
---|---|
TWI777496B TWI777496B (en) | 2022-09-11 |
TW202242848A true TW202242848A (en) | 2022-11-01 |
Family
ID=84957961
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
TW110113786A TWI777496B (en) | 2021-04-16 | 2021-04-16 | Knowledge entity identification method and knowledge entity identification device |
Country Status (1)
Country | Link |
---|---|
TW (1) | TWI777496B (en) |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
AU1523400A (en) * | 1998-11-12 | 2000-05-29 | Ac Properties B.V. | A system, method and article of manufacture for advanced mobile bargain shopping |
CN101561805B (en) * | 2008-04-18 | 2014-06-25 | 日电(中国)有限公司 | Document classifier generation method and system |
CN101770467B (en) * | 2008-12-31 | 2014-04-09 | 迈克尔·布卢门撒尔 | Method and system for analyzing and ordering data targets capable of visiting web |
WO2013009710A1 (en) * | 2011-07-08 | 2013-01-17 | Steamfunk Labs, Inc. | Automated presentation of information using infographics |
-
2021
- 2021-04-16 TW TW110113786A patent/TWI777496B/en active
Also Published As
Publication number | Publication date |
---|---|
TWI777496B (en) | 2022-09-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021068339A1 (en) | Text classification method and device, and computer readable storage medium | |
JP7028858B2 (en) | Systems and methods for contextual search of electronic records | |
US10146862B2 (en) | Context-based metadata generation and automatic annotation of electronic media in a computer network | |
Demir et al. | Improving named entity recognition for morphologically rich languages using word embeddings | |
TWI662425B (en) | A method of automatically generating semantic similar sentence samples | |
Zhang et al. | Extractive document summarization based on convolutional neural networks | |
CN112818093B (en) | Evidence document retrieval method, system and storage medium based on semantic matching | |
US20080052262A1 (en) | Method for personalized named entity recognition | |
WO2022227165A1 (en) | Question and answer method and apparatus for machine reading comprehension, computer device, and storage medium | |
WO2020114100A1 (en) | Information processing method and apparatus, and computer storage medium | |
Al-Ash et al. | Fake news identification characteristics using named entity recognition and phrase detection | |
WO2021175005A1 (en) | Vector-based document retrieval method and apparatus, computer device, and storage medium | |
WO2018090468A1 (en) | Method and device for searching for video program | |
WO2018056423A1 (en) | Scenario passage classifier, scenario classifier, and computer program therefor | |
Jain et al. | Context sensitive text summarization using k means clustering algorithm | |
Sharma et al. | BioAMA: towards an end to end biomedical question answering system | |
CN105404677A (en) | Tree structure based retrieval method | |
Khan et al. | A clustering framework for lexical normalization of Roman Urdu | |
CN110019474B (en) | Automatic synonymy data association method and device in heterogeneous database and electronic equipment | |
Wang et al. | Semi-supervised chinese open entity relation extraction | |
TWI777496B (en) | Knowledge entity identification method and knowledge entity identification device | |
CN110705285A (en) | Government affair text subject word bank construction method, device, server and readable storage medium | |
CN105426490A (en) | Tree structure based indexing method | |
Sariki et al. | A book recommendation system based on named entities | |
CN115221313A (en) | Knowledge entity identification method and knowledge entity identification device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
GD4A | Issue of patent certificate for granted invention patent |